### What changes were proposed in this pull request?
Upgrade Apache Parquet to 1.12.1
### Why are the changes needed?
Parquet 1.12.1 contains the following bug fixes:
- PARQUET-2064: Make Range public accessible in RowRanges
- PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream`
- PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding
- PARQUET-1633: Fix integer overflow
- PARQUET-2054: fix TCP leaking when calling ParquetFileWriter.appendFile
- PARQUET-2072: Do Not Determine Both Min/Max for Binary Stats
- PARQUET-2073: Fix estimate remaining row count in ColumnWriteStoreBase
- PARQUET-2078: Failed to read parquet file after writing with the same
In particular PARQUET-2078 is a blocker for the upcoming Apache Spark 3.2.0 release.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests + a new test for the issue in SPARK-36696
Closes#33969 from sunchao/upgrade-parquet-12.1.
Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
### What changes were proposed in this pull request?
For query
```
select arrays_overlap(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [false], but it should return [true].
This issue is caused by `scala.mutable.HashSet` can't handle `Double.NaN` and `Float.NaN`.
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
arrays_overlap won't handle equal `NaN` value
### How was this patch tested?
Added UT
Closes#34006 from AngersZhuuuu/SPARK-36755.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
According to https://github.com/apache/spark/pull/33955#discussion_r708570515 use normalized NaN
### Why are the changes needed?
Use normalized NaN for duplicated NaN value
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Exiting UT
Closes#34003 from AngersZhuuuu/SPARK-36702-FOLLOWUP.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
octet_length: caliculate the byte length of strings
bit_length: caliculate the bit length of strings
Those two string related functions are only implemented on SparkSQL, not on Scala, Python and R.
### Why are the changes needed?
Those functions would be useful for multi-bytes character users, who mainly working with Scala, Python or R.
### Does this PR introduce _any_ user-facing change?
Yes. Users can call octet_length/bit_length APIs on Scala(Dataframe), Python, and R.
### How was this patch tested?
unit tests
Closes#33992 from yoda-mon/add-bit-octet-length.
Authored-by: Leona Yoda <yodal@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
### What changes were proposed in this pull request?
This PR fixes a perf issue in `SchemaPruning` when a struct has many fields (e.g. >10K fields).
The root cause is `SchemaPruning.sortLeftFieldsByRight` does N * M order searching.
```
val filteredRightFieldNames = rightStruct.fieldNames
.filter(name => leftStruct.fieldNames.exists(resolver(_, name)))
```
To fix this issue, this PR proposes to use `HashMap` to expect a constant order searching.
This PR also adds `case _ if left == right => left` to the method as a short-circuit code.
### Why are the changes needed?
To fix a perf issue.
### Does this PR introduce _any_ user-facing change?
No. The logic should be identical.
### How was this patch tested?
I confirmed that the following micro benchmark finishes within a few seconds.
```
import org.apache.spark.sql.catalyst.expressions.SchemaPruning
import org.apache.spark.sql.types._
var struct1 = new StructType()
(1 to 50000).foreach { i =>
struct1 = struct1.add(new StructField(i + "", IntegerType))
}
var struct2 = new StructType()
(50001 to 100000).foreach { i =>
struct2 = struct2.add(new StructField(i + "", IntegerType))
}
SchemaPruning.sortLeftFieldsByRight(struct1, struct2)
SchemaPruning.sortLeftFieldsByRight(struct2, struct2)
```
The correctness should be checked by existing tests.
Closes#33981 from sarutak/improve-schemapruning-performance.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
SPARK-36456 change to use `JavaUtils. closeQuietly` instead of `IOUtils.closeQuietly`, but there is slightly different from the 2 methods in default behavior: swallowing IOException is same, but the former logs it as ERROR while the latter doesn't log by default.
`Apache commons-io` community decided to retain the `IOUtils.closeQuietly` method in the [new version](75f20dca72/src/main/java/org/apache/commons/io/IOUtils.java (L465-L467)) and removed deprecated annotation, the change has been released in version 2.11.0.
So the change of this pr is to upgrade `Apache commons-io` to 2.11.0 and revert change of SPARK-36456 to maintain original behavior(don't print error log).
### Why are the changes needed?
1. Upgrade `Apache commons-io` to 2.11.0 to use non-deprecated `closeQuietly` API, other changes related to `Apache commons-io are detailed in [commons-io/changes-report](https://commons.apache.org/proper/commons-io/changes-report.html#a2.11.0)
2. Revert change of SPARK-36737 to maintain original `IOUtils.closeQuietly` API behavior(don't print error log).
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass the Jenkins or GitHub Action
Closes#33977 from LuciferYang/upgrade-commons-io.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
For query
```
select array_union(array(cast('nan' as double), cast('nan' as double)), array())
```
This returns [NaN, NaN], but it should return [NaN].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr we add a wrap for OpenHashSet that can handle `null`, `Double.NaN`, `Float.NaN` together
### Why are the changes needed?
Fix bug
### Does this PR introduce _any_ user-facing change?
ArrayUnion won't show duplicated `NaN` value
### How was this patch tested?
Added UT
Closes#33955 from AngersZhuuuu/SPARK-36702-WrapOpenHashSet.
Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Fix InferFiltersFromGenerate bug, InferFiltersFromGenerate should not infer filter for generate when the children contain an expression which is instance of `org.apache.spark.sql.catalyst.expressions.UserDefinedExpression`.
Before this pr, the following case will throw an exception.
```scala
spark.udf.register("vec", (i: Int) => (0 until i).toArray)
sql("select explode(vec(8)) as c1").show
```
```
Once strategy's idempotence is broken for batch Infer Filters
GlobalLimit 21 GlobalLimit 21
+- LocalLimit 21 +- LocalLimit 21
+- Project [cast(c1#3 as string) AS c1#12] +- Project [cast(c1#3 as string) AS c1#12]
+- Generate explode(vec(8)), false, [c1#3] +- Generate explode(vec(8)), false, [c1#3]
+- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8))) +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8)))
! +- OneRowRelation +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8)))
! +- OneRowRelation
java.lang.RuntimeException:
Once strategy's idempotence is broken for batch Infer Filters
GlobalLimit 21 GlobalLimit 21
+- LocalLimit 21 +- LocalLimit 21
+- Project [cast(c1#3 as string) AS c1#12] +- Project [cast(c1#3 as string) AS c1#12]
+- Generate explode(vec(8)), false, [c1#3] +- Generate explode(vec(8)), false, [c1#3]
+- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8))) +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8)))
! +- OneRowRelation +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8)))
! +- OneRowRelation
at org.apache.spark.sql.errors.QueryExecutionErrors$.onceStrategyIdempotenceIsBrokenForBatchError(QueryExecutionErrors.scala:1200)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.checkBatchIdempotence(RuleExecutor.scala:168)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:254)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200)
at scala.collection.immutable.List.foreach(List.scala:431)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:179)
at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:179)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$optimizedPlan$1(QueryExecution.scala:138)
at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:196)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:196)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:134)
at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:130)
at org.apache.spark.sql.execution.QueryExecution.assertOptimized(QueryExecution.scala:148)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$executedPlan$1(QueryExecution.scala:166)
at org.apache.spark.sql.execution.QueryExecution.withCteMap(QueryExecution.scala:73)
at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:163)
at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:163)
at org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:214)
at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$explainString(QueryExecution.scala:259)
at org.apache.spark.sql.execution.QueryExecution.explainString(QueryExecution.scala:228)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:98)
at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3731)
at org.apache.spark.sql.Dataset.head(Dataset.scala:2755)
at org.apache.spark.sql.Dataset.take(Dataset.scala:2962)
at org.apache.spark.sql.Dataset.getRows(Dataset.scala:288)
at org.apache.spark.sql.Dataset.showString(Dataset.scala:327)
at org.apache.spark.sql.Dataset.show(Dataset.scala:807)
```
### Does this PR introduce _any_ user-facing change?
No, only bug fix.
### How was this patch tested?
Unit test.
Closes#33956 from cfmcgrady/SPARK-36715.
Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
As [reported on `devspark.apache.org`](https://lists.apache.org/thread.html/r84cff66217de438f1389899e6d6891b573780159cd45463acf3657aa%40%3Cdev.spark.apache.org%3E), the published POMs when building with Scala 2.13 have the `scala-parallel-collections` dependency only in the `scala-2.13` profile of the pom.
### What changes were proposed in this pull request?
This PR suggests to work around this by un-commenting the `scala-parallel-collections` dependency when switching to 2.13 using the the `change-scala-version.sh` script.
I included an upgrade to scala-parallel-collections version 1.0.3, the changes compared to 0.2.0 are minor.
- removed OSGi metadata
- renamed some internal inner classes
- added `Automatic-Module-Name`
### Why are the changes needed?
According to the posts, this solves issues for developers that write unit tests for their applications.
Stephen Coy suggested to use the https://www.mojohaus.org/flatten-maven-plugin. While this sounds like a more principled solution, it is possibly too risky to do at this specific point in time?
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Locally
Closes#33948 from lrytz/parCollDep.
Authored-by: Lukas Rytz <lukas.rytz@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
In the PR, I propose to support a case-insensitive variant of the `LIKE (ALL | ANY | SOME)` expression - `ILIKE`. In this way, Spark's users can match strings to single pattern in the case-insensitive manner. For example:
```sql
spark-sql> create table ilike_example(subject varchar(20));
spark-sql> insert into ilike_example values
> ('jane doe'),
> ('Jane Doe'),
> ('JANE DOE'),
> ('John Doe'),
> ('John Smith');
spark-sql> select *
> from ilike_example
> where subject ilike any ('jane%', '%SMITH')
> order by subject;
JANE DOE
Jane Doe
John Smith
jane doe
```
The syntax of `ILIKE` is similar to `LIKE`:
```
str NOT? ILIKE (ANY | SOME | ALL) (pattern+)
```
### Why are the changes needed?
1. To improve user experience with Spark SQL. No need to use `lower(col_name)` in where clauses.
2. To make migration from other popular DMBSs to Spark SQL easier. DBMSs below support `ilike` in SQL:
- [Snowflake](https://docs.snowflake.com/en/sql-reference/functions/ilike.html#ilike)
- [PostgreSQL](https://www.postgresql.org/docs/12/functions-matching.html)
- [CockroachDB](https://www.cockroachlabs.com/docs/stable/functions-and-operators.html)
### Does this PR introduce _any_ user-facing change?
No, it doesn't. The PR **extends** existing APIs.
### How was this patch tested?
1. By running of expression examples via:
```
$ build/sbt "sql/test:testOnly org.apache.spark.sql.expressions.ExpressionInfoSuite"
```
2. Added new test to test parsing of `ILIKE`:
```
$ build/sbt "test:testOnly *.ExpressionParserSuite"
```
3. Via existing test suites:
```
$ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z ilike-any.sql"
$ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z ilike-all.sql"
```
Closes#33966 from MaxGekk/ilike-any.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to support `timestamp_ntz` as a type of time column for `SessionWIndow` like `TimeWindow` does.
### Why are the changes needed?
For better usability.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New test.
Closes#33965 from sarutak/session-window-ntz.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
Fixed wrong documentation on Cot API
### Why are the changes needed?
[Doc](https://spark.apache.org/docs/latest/api/sql/index.html#cot) says `1/java.lang.Math.cot` but it should be `1/java.lang.Math.tan`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manual check.
Closes#33978 from yutoacts/SPARK-36738.
Authored-by: Yuto Akutsu <yuto.akutsu@nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
- move the rule `OptimizeSkewedJoin` from stage optimization phase to stage preparation phase.
- run the rule `EnsureRequirements` one more time after the `OptimizeSkewedJoin` rule in the stage preparation phase.
- add `SkewJoinAwareCost` to support estimate skewed join cost
- add new config to decide if force optimize skewed join
- in `OptimizeSkewedJoin`, we generate 2 physical plans, one with skew join optimization and one without. Then we use the cost evaluator w.r.t. the force-skew-join flag and pick the plan with lower cost.
### Why are the changes needed?
In general, skewed join has more impact on performance than once more shuffle. It makes sense to force optimize skewed join even if introduce extra shuffle.
A common case:
```
HashAggregate
SortMergJoin
Sort
Exchange
Sort
Exchange
```
and after this PR, the plan looks like:
```
HashAggregate
Exchange
SortMergJoin (isSkew=true)
Sort
Exchange
Sort
Exchange
```
Note that, the new introduced shuffle also can be optimized by AQE.
### Does this PR introduce _any_ user-facing change?
Yes, a new config.
### How was this patch tested?
* Add new test
* pass exists test `SPARK-30524: Do not optimize skew join if introduce additional shuffle`
* pass exists test `SPARK-33551: Do not use custom shuffle reader for repartition`
Closes#32816 from ulysses-you/support-extra-shuffle.
Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR aims to ensure that HiveThriftServer2Suites (e.g. `thriftserver.UISeleniumSuite`) stop Thrift JDBC server on exit using shutdown hook.
### Why are the changes needed?
Normally, HiveThriftServer2Suites stops Thrift JDBC server via `afterAll` method.
But, if they are killed by signal (e.g. Ctrl-C), Thrift JDBC server will be remain.
```
$ jps
2792969 SparkSubmit
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Killed `thriftserver.UISeleniumSuite` by Ctrl-C and confirmed no Thrift JDBC server is remain by jps.
Closes#33967 from sarutak/stop-thrift-on-exit.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Co-Authored-By: DB Tsai d_tsaiapple.com
Co-Authored-By: Huaxin Gao huaxin_gaoapple.com
### What changes were proposed in this pull request?
Add DSV2 Filters and use these in V2 codepath.
### Why are the changes needed?
The motivation of adding DSV2 filters:
1. The values in V1 filters are Scala types. When translating catalyst `Expression` to V1 filers, we have to call `convertToScala` to convert from Catalyst types used internally in rows to standard Scala types, and later convert Scala types back to Catalyst types. This is very inefficient. In V2 filters, we use `Expression` for filter values, so the conversion from Catalyst types to Scala types and Scala types back to Catalyst types are avoided.
2. Improve nested column filter support.
3. Make the filters work better with the rest of the DSV2 APIs.
### Does this PR introduce _any_ user-facing change?
Yes. The new V2 filters
### How was this patch tested?
new test
Closes#33803 from huaxingao/filter.
Lead-authored-by: Huaxin Gao <huaxin_gao@apple.com>
Co-authored-by: DB Tsai <d_tsai@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
redundant _ERROR suffix in error-classes.json
### Why are the changes needed?
Clean up error classes to reduce clutter
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests
Closes#33944 from dgd-contributor/SPARK-36687.
Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This patch proposes to add a few LZ4 wrapper classes for Parquet Lz4 compression output that uses Hadoop Lz4 codec.
### Why are the changes needed?
Currently we use Hadop 3.3.1's shaded client libraries. Lz4 is a provided dependency in Hadoop Common 3.3.1 for Lz4Codec. But it isn't excluded from relocation in these libraries. So to use lz4 as Parquet codec, we will hit the exception even we include lz4 as dependency.
```
[info] Cause: java.lang.NoClassDefFoundError: org/apache/hadoop/shaded/net/jpountz/lz4/LZ4Factory
[info] at org.apache.hadoop.io.compress.lz4.Lz4Compressor.<init>(Lz4Compressor.java:66)
[info] at org.apache.hadoop.io.compress.Lz4Codec.createCompressor(Lz4Codec.java:119)
[info] at org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:152)
[info] at org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:168)
```
Before the issue is fixed at Hadoop new release, we can add a few wrapper classes for Lz4 codec.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Modified test.
Closes#33940 from viirya/lz4-wrappers.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
In the PR, I propose to support a case-insensitive variant of the `like` expression - `ilike`. In this way, Spark's users can match strings to single pattern in the case-insensitive manner. For example:
```sql
spark-sql> create table ilike_ex(subject varchar(20));
spark-sql> insert into ilike_ex values
> ('John Dddoe'),
> ('Joe Doe'),
> ('John_down'),
> ('Joe down'),
> (null);
spark-sql> select *
> from ilike_ex
> where subject ilike '%j%h%do%'
> order by 1;
John Dddoe
John_down
```
The syntax of `ilike` is similar to `like`:
```
str ILIKE pattern[ ESCAPE escape]
```
#### Implementation details
`ilike` is implemented as a runtime replaceable expression to `Like(Lower(left), Lower(right), escapeChar)`. Such replacement is acceptable because `ilike`/`like` recognise only `_` and `%` as special characters but not special character classes.
**Note:** The PR aims to support `ilike` in SQL only. Others APIs can be updated separately on demand.
### Why are the changes needed?
1. To improve user experience with Spark SQL. No need to use `lower(col_name)` in where clauses.
2. To make migration from other popular DMBSs to Spark SQL easier. DBMSs below support `ilike` in SQL:
- [Snowflake](https://docs.snowflake.com/en/sql-reference/functions/ilike.html#ilike)
- [Redshift](https://docs.aws.amazon.com/redshift/latest/dg/r_patternmatching_condition_like.html)
- [PostgreSQL](https://www.postgresql.org/docs/12/functions-matching.html)
- [ClickHouse](https://clickhouse.tech/docs/en/sql-reference/functions/string-search-functions/#ilike)
- [Vertica](https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/LanguageElements/Predicates/LIKE-predicate.htm)
- [Impala](https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_operators.html#ilike)
### Does this PR introduce _any_ user-facing change?
No, it doesn't. The PR **extends** existing APIs.
### How was this patch tested?
1. By running of expression examples via:
```
$ build/sbt "sql/test:testOnly org.apache.spark.sql.expressions.ExpressionInfoSuite"
```
2. Added new test:
```
$ build/sbt "test:testOnly *.RegexpExpressionsSuite"
```
3. Via existing test suites:
```
$ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z regexp-functions.sql"
$ build/sbt "test:testOnly *SQLKeywordSuite"
$ build/sbt "sql/testOnly *ExpressionsSchemaSuite"
```
Closes#33919 from MaxGekk/ilike-single-pattern.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
fix SimplifyConditionalsInPredicate to be null-safe
Reproducible:
```
import org.apache.spark.sql.types.{StructField, BooleanType, StructType}
import org.apache.spark.sql.Row
val schema = List(
StructField("b", BooleanType, true)
)
val data = Seq(
Row(true),
Row(false),
Row(null)
)
val df = spark.createDataFrame(
spark.sparkContext.parallelize(data),
StructType(schema)
)
// cartesian product of true / false / null
val df2 = df.select(col("b") as "cond").crossJoin(df.select(col("b") as "falseVal"))
df2.createOrReplaceTempView("df2")
spark.sql("SELECT * FROM df2 WHERE IF(cond, FALSE, falseVal)").show()
// actual:
// +-----+--------+
// | cond|falseVal|
// +-----+--------+
// |false| true|
// +-----+--------+
spark.sql("SET spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.SimplifyConditionalsInPredicate")
spark.sql("SELECT * FROM df2 WHERE IF(cond, FALSE, falseVal)").show()
// expected:
// +-----+--------+
// | cond|falseVal|
// +-----+--------+
// |false| true|
// | null| true|
// +-----+--------+
```
### Why are the changes needed?
is a regression that leads to incorrect results
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
existing tests
Closes#33928 from hypercubestart/fix-SimplifyConditionalsInPredicate.
Authored-by: Andrew Liu <andrewlliu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR adds:
- the support of `TimestampNTZType` in pandas API on Spark.
- the support of Py4J handling of `spark.sql.timestampType` configuration
### Why are the changes needed?
To complete `TimestampNTZ` support.
In more details:
- ([#33876](https://github.com/apache/spark/pull/33876)) For `TimestampNTZType` in Spark SQL at PySpark, we can successfully ser/de `TimestampNTZType` instances to naive `datetime` (see also https://docs.python.org/3/library/datetime.html#aware-and-naive-objects). This naive `datetime` interpretation is up to the program to decide how to interpret, e.g.) whether a local time vs UTC time as an example. Although some Python built-in APIs assume they are local time in general (see also https://docs.python.org/3/library/datetime.html#datetime.datetime.utcfromtimestamp):
> Because naive datetime objects are treated by many datetime methods as local times ...
semantically it is legitimate to assume:
- that naive `datetime` is mapped to `TimestampNTZType` (unknown timezone).
- if you want to handle them as if a local timezone, this interpretation is matched to `TimestamType` (local time)
- ([#33875](https://github.com/apache/spark/pull/33875)) For `TimestampNTZType` in Arrow, they provide the same semantic (see also https://github.com/apache/arrow/blob/master/format/Schema.fbs#L240-L278):
- `Timestamp(..., timezone=sparkLocalTimezone)` -> `TimestamType`
- `Timestamp(..., timezone=null)` -> `TimestampNTZType`
- (this PR) For `TimestampNTZType` in pandas API on Spark, it follows Python side in general - pandas implements APIs based on the assumption of time (e.g., naive `datetime` is a local time or a UTC time).
One example is that pandas allows to convert these naive `datetime` as if they are in UTC by default:
```python
>>> pd.Series(datetime.datetime(1970, 1, 1)).astype("int")
0 0
```
whereas in Spark:
```python
>>> spark.createDataFrame([[datetime.datetime(1970, 1, 1, 0, 0, 0)]]).selectExpr("CAST(_1 as BIGINT)").show()
+------+
| _1|
+------+
|-32400|
+------+
>>> spark.createDataFrame([[datetime.datetime(1970, 1, 1, 0, 0, 0, tzinfo=datetime.timezone.utc)]]).selectExpr("CAST(_1 as BIGINT)").show()
+---+
| _1|
+---+
| 0|
+---+
```
In contrast, some APIs like `pandas.fromtimestamp` assume they are local times:
```python
>>> pd.Timestamp.fromtimestamp(pd.Series(datetime(1970, 1, 1, 0, 0, 0)).astype("int").iloc[0])
Timestamp('1970-01-01 09:00:00')
```
For native Python, users can decide how to interpret native `datetime` so it's fine. The problem is that pandas API on Spark case would require to have two implementations of the same pandas behavior for `TimestampType` and `TimestampNTZType` respectively, which might be non-trivial overhead and work.
As far as I know, pandas API on Spark has not yet implemented such ambiguous APIs so they are left as future work.
### Does this PR introduce _any_ user-facing change?
Yes, now pandas API on Spark can handle `TimestampNTZType`.
```python
import datetime
spark.createDataFrame([(datetime.datetime.now(),)], schema="dt timestamp_ntz").to_pandas_on_spark()
```
```
dt
0 2021-08-31 19:58:55.024410
```
This PR also adds the support of Py4J handling with `spark.sql.timestampType` configuration:
```python
>>> lit(datetime.datetime.now())
Column<'TIMESTAMP '2021-09-03 19:34:03.949998''>
```
```python
>>> spark.conf.set("spark.sql.timestampType", "TIMESTAMP_NTZ")
>>> lit(datetime.datetime.now())
Column<'TIMESTAMP_NTZ '2021-09-03 19:34:24.864632''>
```
### How was this patch tested?
Unittests were added.
Closes#33877 from HyukjinKwon/SPARK-36625.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Currently, we have `FieldReference` for aggregate column type, should be `NamedReference` instead
### Why are the changes needed?
`FieldReference` is a private class, should use `NamedReference` instead
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing tests
Closes#33927 from huaxingao/agg_followup.
Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
SPARK-26346 upgrade Parquet related modules from 1.10.1 to 1.11.1 and `parquet-jackson 1.11.1` use `com.fasterxml.jackson` instead of `org.codehaus.jackson`.
So, there are warning logs related to
```
17:12:17.605 WARN org.apache.hadoop.fs.FileSystem: Cannot load filesystem
java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.hdfs.web.WebHdfsFileSystem could not be instantiated
...
Caused by: java.lang.ClassNotFoundException: org.codehaus.jackson.map.ObjectMapper
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
...
```
when test `sql/core` modules with `hadoop-2.7` profile.
This pr adds test dependencies related to `org.codehaus.jackson` in `sql/core` module when `hadoop-2.7` profile is activated.
### Why are the changes needed?
Clean up test warning logs that shouldn't exist.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- Pass GA or Jenkins Tests.
- Manual test `mvn clean test -pl sql/core -am -DwildcardSuites=none -Phadoop-2.7`
**Before**
No test failed, but warning logs as follows:
```
[INFO] Running test.org.apache.spark.sql.JavaBeanDeserializationSuite
22:42:45.211 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22:42:46.827 WARN org.apache.hadoop.fs.FileSystem: Cannot load filesystem
java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.hdfs.web.WebHdfsFileSystem could not be instantiated
at java.util.ServiceLoader.fail(ServiceLoader.java:232)
at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2631)
at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2650)
at org.apache.hadoop.fs.FsUrlStreamHandlerFactory.<init>(FsUrlStreamHandlerFactory.java:62)
at org.apache.spark.sql.internal.SharedState$.liftedTree1$1(SharedState.scala:181)
at org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$setFsUrlStreamHandlerFactory(SharedState.scala:180)
at org.apache.spark.sql.internal.SharedState.<init>(SharedState.scala:54)
at org.apache.spark.sql.SparkSession.$anonfun$sharedState$1(SparkSession.scala:135)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:135)
at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:134)
at org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:335)
at org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42)
at org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41)
at org.apache.spark.sql.SparkSession.$anonfun$new$3(SparkSession.scala:109)
at scala.Option.map(Option.scala:230)
at org.apache.spark.sql.SparkSession.$anonfun$new$1(SparkSession.scala:109)
at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:194)
at org.apache.spark.sql.types.DataType.sameType(DataType.scala:97)
at org.apache.spark.sql.catalyst.analysis.TypeCoercion$.$anonfun$haveSameType$1(TypeCoercion.scala:291)
at org.apache.spark.sql.catalyst.analysis.TypeCoercion$.$anonfun$haveSameType$1$adapted(TypeCoercion.scala:291)
at scala.collection.LinearSeqOptimized.forall(LinearSeqOptimized.scala:85)
at scala.collection.LinearSeqOptimized.forall$(LinearSeqOptimized.scala:82)
at scala.collection.immutable.List.forall(List.scala:89)
at org.apache.spark.sql.catalyst.analysis.TypeCoercion$.haveSameType(TypeCoercion.scala:291)
at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck(Expression.scala:1074)
at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck$(Expression.scala:1069)
at org.apache.spark.sql.catalyst.expressions.If.dataTypeCheck(conditionalExpressions.scala:37)
at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType(Expression.scala:1080)
at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType$(Expression.scala:1079)
at org.apache.spark.sql.catalyst.expressions.If.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType$lzycompute(conditionalExpressions.scala:37)
at org.apache.spark.sql.catalyst.expressions.If.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType(conditionalExpressions.scala:37)
at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType(Expression.scala:1084)
at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType$(Expression.scala:1084)
at org.apache.spark.sql.catalyst.expressions.If.dataType(conditionalExpressions.scala:37)
at org.apache.spark.sql.catalyst.expressions.objects.MapObjects.$anonfun$dataType$4(objects.scala:815)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.catalyst.expressions.objects.MapObjects.dataType(objects.scala:815)
at org.apache.spark.sql.catalyst.expressions.CreateNamedStruct.$anonfun$dataType$9(complexTypeCreator.scala:416)
at scala.collection.immutable.List.map(List.scala:290)
at org.apache.spark.sql.catalyst.expressions.CreateNamedStruct.dataType$lzycompute(complexTypeCreator.scala:410)
at org.apache.spark.sql.catalyst.expressions.CreateNamedStruct.dataType(complexTypeCreator.scala:409)
at org.apache.spark.sql.catalyst.expressions.CreateNamedStruct.dataType(complexTypeCreator.scala:398)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.isSerializedAsStruct(ExpressionEncoder.scala:309)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.isSerializedAsStructForTopLevel(ExpressionEncoder.scala:319)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.<init>(ExpressionEncoder.scala:248)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:75)
at org.apache.spark.sql.Encoders$.bean(Encoders.scala:154)
at org.apache.spark.sql.Encoders.bean(Encoders.scala)
at test.org.apache.spark.sql.JavaBeanDeserializationSuite.testBeanWithArrayFieldDeserialization(JavaBeanDeserializationSuite.java:75)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:364)
at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:272)
at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:237)
at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:158)
at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:428)
at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:162)
at org.apache.maven.surefire.booter.ForkedBooter.run(ForkedBooter.java:562)
at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:548)
Caused by: java.lang.NoClassDefFoundError: org/codehaus/jackson/map/ObjectMapper
at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.<clinit>(WebHdfsFileSystem.java:129)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at java.lang.Class.newInstance(Class.java:442)
at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380)
... 81 more
Caused by: java.lang.ClassNotFoundException: org.codehaus.jackson.map.ObjectMapper
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
... 88 more
```
**After**
There are no more warning logs like above
Closes#33926 from LuciferYang/SPARK-36684.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Currently only DataSources V1 are supported in the CreateTempViewUsing command. This PR refactor DataframeReader to reuse the code for the creation of a DataFrame from a DataSource V2
### Why are the changes needed?
Improve the support of DataSourve V2 in this command
### Does this PR introduce _any_ user-facing change?
It does not change the current behavior, it only adds a new functionality
### How was this patch tested?
Unit testing
Closes#33922 from planga82/feature/spark35803_crateview_datasourceV2.
Authored-by: Pablo Langa <soypab@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
To make it easy to use/modify the semantic annotation, we want to have a shorter API to update the metadata in a dataframe. Currently we have `df.withColumn("col1", col("col1").alias("col1", metadata=metadata))` to update the metadata without changing the column name, and this is too verbose. We want to have a syntax suger API `df.withMetadata("col1", metadata=metadata)` to achieve the same functionality.
### Why are the changes needed?
A bit of background for the frequency of the update: We are working on inferring the semantic data types and use them in AutoML and store the semantic annotation in the metadata. So in many cases, we will suggest the user update the metadata to correct the wrong inference or add the annotation for weak inference.
### Does this PR introduce _any_ user-facing change?
Yes.
A syntax suger API `df.withMetadata("col1", metadata=metadata)` to achieve the same functionality as`df.withColumn("col1", col("col1").alias("col1", metadata=metadata))`.
### How was this patch tested?
A unit test in DataFrameSuite.scala.
Closes#33853 from liangz1/withMetadata.
Authored-by: Liang Zhang <liang.zhang@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
### What changes were proposed in this pull request?
This PR filters out `ExtractValues`s that contains any aggregation function in the `NestedColumnAliasing` rule to prevent cases where aggregations are pushed down into projections.
### Why are the changes needed?
To handle a corner/missed case in `NestedColumnAliasing` that can cause users to encounter a runtime exception.
Consider the following schema:
```
root
|-- a: struct (nullable = true)
| |-- c: struct (nullable = true)
| | |-- e: string (nullable = true)
| |-- d: integer (nullable = true)
|-- b: string (nullable = true)
```
and the query:
`SELECT MAX(a).c.e FROM (SELECT a, b FROM test_aggregates) GROUP BY b`
Executing the query before this PR will result in the error:
```
java.lang.UnsupportedOperationException: Cannot generate code for expression: max(input[0, struct<c:struct<e:string>,d:int>, true])
at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotGenerateCodeForExpressionError(QueryExecutionErrors.scala:83)
at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:312)
at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:311)
at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:99)
...
```
The optimised plan before this PR is:
```
'Aggregate [b#1], [_extract_e#5 AS max(a).c.e#3]
+- 'Project [max(a#0).c.e AS _extract_e#5, b#1]
+- Relation default.test_aggregates[a#0,b#1] parquet
```
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
A new unit test in `NestedColumnAliasingSuite`. The test consists of the repro mentioned earlier.
The produced optimized plan is checked for equivalency with a plan of the form:
```
Aggregate [b#452], [max(a#451).c.e AS max('a)[c][e]#456]
+- LocalRelation <empty>, [a#451, b#452]
```
Closes#33921 from vicennial/spark-36677.
Authored-by: Venkata Sai Akhil Gudesa <venkata.gudesa@databricks.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
This patch mainly proposes to add some e2e test cases in Spark for codec used by main datasources.
### Why are the changes needed?
We found there is no e2e test cases available for main datasources like Parquet, Orc. It makes developers harder to identify possible bugs early. We should add such tests in Spark.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added tests.
Closes#33912 from viirya/SPARK-36670.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
Fix regression in AQEShuffleReadExec when used in conjunction with Spark plugins with custom partitioning.
Signed-off-by: Andy Grove <andygrove73gmail.com>
### What changes were proposed in this pull request?
Return `UnknownPartitioning` rather than throw an exception in `AQEShuffleReadExec`.
### Why are the changes needed?
The [RAPIDS Accelerator for Apache Spark](https://github.com/NVIDIA/spark-rapids) replaces `AQEShuffleReadExec` with a custom operator that runs on the GPU. Due to changes in [SPARK-36315](dd80457ffb), Spark now throws an exception if the shuffle exchange does not have recognized partitioning, and this happens before the postStageOptimizer rules so there is no opportunity to replace this operator now.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
I am still in the process of testing this change. I will update the PR in the next few days with status.
Closes#33910 from andygrove/SPARK-36666.
Authored-by: Andy Grove <andygrove73@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR proposes to ensure StateStoreProvider instances are properly closed for each test in StateStoreSuite/RocksDBStateStoreSuite.
### Why are the changes needed?
While this doesn't break the test, this is a bad practice and may possibly make nasty problems in the future.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing UTs
Closes#33916 from HeartSaVioR/SPARK-36667.
Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
This PR aims to support `ScriptTransformation` for `timestamp_ntz`.
In the current master, it doesn't work.
```
spark.sql("SELECT transform(col1) USING 'cat' AS (col1 timestamp_ntz) FROM VALUES timestamp_ntz'2021-09-06 20:19:13' t").show(false)
21/09/06 22:03:55 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: SparkScriptTransformation without serde does not support TimestampNTZType$ as output data type
at org.apache.spark.sql.errors.QueryExecutionErrors$.outputDataTypeUnsupportedByNodeWithoutSerdeError(QueryExecutionErrors.scala:1740)
at org.apache.spark.sql.execution.BaseScriptTransformationExec.$anonfun$outputFieldWriters$1(BaseScriptTransformationExec.scala:245)
at scala.collection.immutable.List.map(List.scala:293)
at org.apache.spark.sql.execution.BaseScriptTransformationExec.org$apache$spark$sql$execution$BaseScriptTransformationExec$$outputFieldWriters(BaseScriptTransformationExec.scala:194)
at org.apache.spark.sql.execution.BaseScriptTransformationExec.org$apache$spark$sql$execution$BaseScriptTransformationExec$$outputFieldWriters$(BaseScriptTransformationExec.scala:194)
at org.apache.spark.sql.execution.SparkScriptTransformationExec.org$apache$spark$sql$execution$BaseScriptTransformationExec$$outputFieldWriters$lzycompute(SparkScriptTransformationExec.scala:38)
at org.apache.spark.sql.execution.SparkScriptTransformationExec.org$apache$spark$sql$execution$BaseScriptTransformationExec$$outputFieldWriters(SparkScriptTransformationExec.scala:38)
at org.apache.spark.sql.execution.BaseScriptTransformationExec$$anon$1.$anonfun$processRowWithoutSerde$1(BaseScriptTransformationExec.scala:121)
at org.apache.spark.sql.execution.BaseScriptTransformationExec$$anon$1.next(BaseScriptTransformationExec.scala:162)
at org.apache.spark.sql.execution.BaseScriptTransformationExec$$anon$1.next(BaseScriptTransformationExec.scala:113)
```
### Why are the changes needed?
For better usability.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New test.
Closes#33920 from sarutak/script-transformation-timestamp-ntz.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
### What changes were proposed in this pull request?
Add cotangent support by Dataframe operations (e.g. `df.select(cot($"col"))`).
### Why are the changes needed?
Cotangent has been supported by Spark SQL since 2.3.0 but it cannot be called by Dataframe operations.
### Does this PR introduce _any_ user-facing change?
Yes, users can now call the cotangent function by Dataframe operations.
### How was this patch tested?
unit tests.
Closes#33906 from yutoacts/SPARK-36660.
Lead-authored-by: Yuto Akutsu <yuto.akutsu@nttdata.com>
Co-authored-by: Yuto Akutsu <87687356+yutoacts@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
The `Table.capabilities` method return a `java.util.Set` of `TableCapability` enumeration type, which is implemented using `java.util.HashSet` now. Such Set can be replaced `with java.util.EnumSet` because `EnumSet` implementations can be much more efficient compared to other sets.
### Why are the changes needed?
Use more appropriate data structures.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Pass GA or Jenkins Tests.
- Add a new benchmark to compare `create` and `contains` operation between `EnumSet` and `HashSet`
Closes#33867 from LuciferYang/SPARK-36613.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
The change of this pr is remove redundant asInstanceOf casts in Spark code.
### Why are the changes needed?
Code simplification
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Pass GA or Jenkins Tests.
Closes#33852 from LuciferYang/cleanup-asInstanceof.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR adds additional information to ERROR log while SparkConf is modified when spark.sql.legacy.setCommandRejectsSparkCoreConfs is set
### Why are the changes needed?
Right now, by default sql.legacy.setCommandRejectsSparkCoreConfs is set as true in Spark 3.* versions int order to avoid changing Spark Confs. But from the error message we get confused if we can not modify/change Spark conf in Spark 3.* or not.
### Does this PR introduce any user-facing change?
Yes. Trivial change in the error messages is included
### How was this patch tested?
New Test added - SPARK-36643: Show migration guide when attempting SparkConf
Closes#33894 from senthh/1st_Sept_2021.
Lead-authored-by: Senthil Kumar <senthh@gmail.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR fixes an issue that `sequence` builtin function causes `ArrayIndexOutOfBoundsException` if the arguments are under the condition of `start == stop && step < 0`.
This is an example.
```
SELECT sequence(timestamp'2021-08-31', timestamp'2021-08-31', -INTERVAL 1 month);
21/09/02 04:14:42 ERROR SparkSQLDriver: Failed in [SELECT sequence(timestamp'2021-08-31', timestamp'2021-08-31', -INTERVAL 1 month)]
java.lang.ArrayIndexOutOfBoundsException: 1
```
Actually, this example succeeded before SPARK-31980 (#28819) was merged.
### Why are the changes needed?
Bug fix.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New tests.
Closes#33895 from sarutak/fix-sequence-issue.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
### What changes were proposed in this pull request?
Promote spark.sql.execution.topKSortFallbackThreshold to a user-facing config
### Why are the changes needed?
spark.sql.execution.topKSortFallbackThreshold now is an internal config hidden from users Integer.MAX_VALUE - 15 as its default. In many real-world cases, if the K is very big, there would be performance issues.
It's better to leave this choice to users
### Does this PR introduce _any_ user-facing change?
spark.sql.execution.topKSortFallbackThreshold is now user-facing
### How was this patch tested?
passing GA
Closes#33904 from yaooqinn/SPARK-36659.
Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Kent Yao <yao@apache.org>
### What changes were proposed in this pull request?
This PR proposes to improve `DataSourceStrategy` to be able to push down boolean column filters. Currently boolean column filters do not get pushed down and may cause unnecessary IO.
### Why are the changes needed?
The following query does not push down the filter in the current implementation
```
SELECT * FROM t WHERE boolean_field
```
although the following query pushes down the filter as expected.
```
SELECT * FROM t WHERE boolean_field = true
```
This is because the Physical Planner (`DataSourceStrategy`) currently only pushes down limited expression patterns like`EqualTo`.
It is fair for Spark SQL users to expect `boolean_field` performs the same as `boolean_field = true`.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added unit tests
```
build/sbt "core/testOnly *DataSourceStrategySuite -- -z SPARK-36644"
```
Closes#33898 from kazuyukitanimura/SPARK-36644.
Authored-by: Kazuyuki Tanimura <ktanimura@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
### What changes were proposed in this pull request?
Currently `DynamicJoinSelection` has two features: 1.demote broadcast hash join, and 2.promote shuffled hash join. Both are achieved by adding join hint in query plan, and only works for equi join. However [the rule is matching with `Join` operator now](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/DynamicJoinSelection.scala#L71), so it would add hint for non-equi join by mistake (See added test query in `JoinHintSuite.scala` for an example).
This PR is to fix `DynamicJoinSelection` to only apply to equi-join, and improve `checkHintNonEquiJoin` to check we should not add `PREFER_SHUFFLE_HASH` for non-equi join.
### Why are the changes needed?
Improve the logic of codebase to be better.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added unit test in `JoinHintSuite.scala`.
Closes#33899 from c21/aqe-test.
Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Currently in `V2ScanRelationPushDown`, we push the filters (partition filters + data filters) to file source, and then pass all the filters (partition filters + data filters) as post scan filters to v2 Scan, and later in `PruneFileSourcePartitions`, we separate partition filters and data filters, set them in the format of `Expression` to file source.
Changes in this PR:
When we push filters to file sources in `V2ScanRelationPushDown`, since we already have the information about partition column , we want to separate partition filter and data filter there.
The benefit of doing this:
- we can handle all the filter related work for v2 file source at one place instead of two (`V2ScanRelationPushDown` and `PruneFileSourcePartitions`), so the code will be cleaner and easier to maintain.
- we actually have to separate partition filters and data filters at `V2ScanRelationPushDown`, otherwise, there is no way to find out which filters are partition filters, and we can't push down aggregate for parquet even if we only have partition filter.
- By separating the filters early at `V2ScanRelationPushDown`, we only needs to check data filters to find out which one needs to be converted to data source filters (e.g. Parquet predicates, ORC predicates) and pushed down to file source, right now we are checking all the filters (both partition filters and data filters)
- Similarly, we can only pass data filters as post scan filters to v2 Scan, because partition filters are used for partition pruning only, no need to pass them as post scan filters.
In order to do this, we will have the following changes
- add `pushFilters` in file source v2. In this method:
- push both Expression partition filter and Expression data filter to file source. Have to use Expression filters because we need these for partition pruning.
- data filters are used for filter push down. If file source needs to push down data filters, it translates the data filters from `Expression` to `Sources.Filer`, and then decides which filters to push down.
- partition filters are used for partition pruning.
- file source v2 no need to implement `SupportsPushdownFilters` any more, because when we separating the two types of filters, we have already set them on file data sources. It's redundant to use `SupportsPushdownFilters` to set the filters again on file data sources.
### Why are the changes needed?
see section one
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests
Closes#33650 from huaxingao/partition_filter.
Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
This PR aims to update comments in `gen-sql-config-docs.py`.
### Why are the changes needed?
To make it up to date according to Spark version 3.2.0 release.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
N/A.
Closes#33902 from williamhyun/fixtool.
Authored-by: William Hyun <william@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Two case of using undefined window frame as below should provide proper error message
1. For case using undefined window frame with window function
```
SELECT nth_value(employee_name, 2) OVER w second_highest_salary
FROM basic_pays;
```
origin error message is
```
Window function nth_value(employee_name#x, 2, false) requires an OVER clause.
```
It's confused that in use use a window frame `w` but it's not defined.
Now the error message is
```
Window specification w is not defined in the WINDOW clause.
```
2. For case using undefined window frame with aggregation function
```
SELECT SUM(salary) OVER w sum_salary
FROM basic_pays;
```
origin error message is
```
Error in query: unresolved operator 'Aggregate [unresolvedwindowexpression(sum(salary#2), WindowSpecReference(w)) AS sum_salary#34]
+- SubqueryAlias spark_catalog.default.basic_pays
+- HiveTableRelation [`default`.`employees`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [name#0, dept#1, salary#2, age#3], Partition Cols: []]
```
In this case, when convert GlobalAggregate, should skip UnresolvedWindowExpression
Now the error message is
```
Window specification w is not defined in the WINDOW clause.
```
### Why are the changes needed?
Provide proper error message
### Does this PR introduce _any_ user-facing change?
Yes, error messages are improved as described in desc
### How was this patch tested?
Added UT
Closes#33892 from AngersZhuuuu/SPARK-36637.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR adds a test for SPARK-36400 (#33743).
### Why are the changes needed?
SPARK-36512 (#33741) was fixed so we can add this test now.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New test.
Closes#33885 from sarutak/add-reduction-test.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
### What changes were proposed in this pull request?
This PR proposes to implement `TimestampNTZType` support in PySpark's `SparkSession.createDataFrame`, `DataFrame.toPandas`, Python UDFs, and pandas UDFs with and without Arrow.
### Why are the changes needed?
To complete `TimestampNTZType` support.
### Does this PR introduce _any_ user-facing change?
Yes.
- Users now can use `TimestampNTZType` type in `SparkSession.createDataFrame`, `DataFrame.toPandas`, Python UDFs, and pandas UDFs with and without Arrow.
- If `spark.sql.timestampType` is configured to `TIMESTAMP_NTZ`, PySpark will infer the `datetime` without timezone as `TimestampNTZType`. If it has a timezone, it will be inferred as `TimestampType` in `SparkSession.createDataFrame`.
- If `TimestampType` and `TimestampNTZType` conflict during merging inferred schema, `TimestampType` has a higher precedence.
- If the type is `TimestampNTZType`, treat this internally as an unknown timezone, and compute w/ UTC (same as JVM side), and avoid localization externally.
### How was this patch tested?
Manually tested and unittests were added.
Closes#33876 from HyukjinKwon/SPARK-36626.
Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: Dominik Gehl <dog@open.ch>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to add `BooleanType` support to the `UnwrapCastInBinaryComparison` optimizer that is currently supports `NumericType` only.
The main idea is to treat `BooleanType` as 1 bit integer so that we can utilize all optimizations already defined in `UnwrapCastInBinaryComparison`.
This work is an extension of SPARK-24994 and SPARK-32858
### Why are the changes needed?
Current implementation of Spark without this PR cannot properly optimize the filter for the following case
```
SELECT * FROM t WHERE boolean_field = 2
```
The above query creates a filter of `cast(boolean_field, int) = 2`. The casting prevents from pushing down the filter. In contrast, this PR creates a `false` filter and returns early as there cannot be such a matching rows anyway (empty results.)
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Passed existing tests
```
build/sbt "catalyst/test"
build/sbt "sql/test"
```
Added unit tests
```
build/sbt "catalyst/testOnly *UnwrapCastInBinaryComparisonSuite -- -z SPARK-36607"
build/sbt "sql/testOnly *UnwrapCastInComparisonEndToEndSuite -- -z SPARK-36607"
```
Closes#33865 from kazuyukitanimura/SPARK-36607.
Authored-by: Kazuyuki Tanimura <ktanimura@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This change creates a new type of Trigger: Trigger.AvailableNow for streaming queries. It is like Trigger.Once, which process all available data then stop the query, but with better scalability since data can be processed in multiple batches instead of one.
To achieve this, this change proposes a new interface `SupportsTriggerAvailableNow`, which is an extension of `SupportsAdmissionControl`. It has one method, `prepareForTriggerAvailableNow`, which will be called at the beginning of streaming queries with Trigger.AvailableNow, to let the source record the offset for the current latest data at the time (a.k.a. the target offset for the query). The source should then behave as if there is no new data coming in after the beginning of the query, i.e., the source will not return an offset higher than the target offset when `latestOffset` is called.
This change also updates `FileStreamSource` to be an implementation of `SupportsTriggerAvailableNow`.
For other sources that does not implement `SupportsTriggerAvailableNow`, this change also provides a new class `FakeLatestOffsetSupportsTriggerAvailableNow`, which wraps the sources and makes them support Trigger.AvailableNow, by overriding their `latestOffset` method to always return the latest offset at the beginning of the query.
### Why are the changes needed?
Currently streaming queries with Trigger.Once will always load all of the available data in a single batch. Because of this, the amount of data a query can process is limited, or Spark driver will run out of memory.
### Does this PR introduce _any_ user-facing change?
Users will be able to use Trigger.AvailableNow (to process all available data then stop the streaming query) with this change.
### How was this patch tested?
Added unit tests.
Closes#33763 from bozhang2820/new-trigger.
Authored-by: Bo Zhang <bo.zhang@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
This PR proposes to add the support of `TimestampNTZType` in Arrow APIs.
Now, Arrow can write `TimestampNTZType` as Timestamp with `null` timezone in Arrow.
### Why are the changes needed?
To complete the support of `TimestampNTZType` in Apache Spark.
### Does this PR introduce _any_ user-facing change?
Yes, the Arrow APIs (`ArrowColumnVector`) can now write `TimestampNTZType`
### How was this patch tested?
Unittests were added.
Closes#33875 from HyukjinKwon/SPARK-36608-arrow.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to fix bugs around prefix-scan for both HDFS backed state store and RocksDB state store.
> HDFS backed state store
We did "shallow-copy" on copying prefix map, which leads the values of prefix map (mutable Set) to be "same instances" across multiple versions. This PR fixes it via creating a new mutable Set and copying elements.
> RocksDB state store
Prefix-scan iterators are only closed on RocksDB.rollback(), which is only called in RocksDBStateStore.abort().
While `RocksDBStateStore.abort()` method will be called for streaming session window (since it has two physical plans for read and write), other stateful operators which only have read-write physical plan will call either commit or abort, and don't close the iterators on committing. These unclosed iterators can be "reused" and produce incorrect outputs.
This PR ensures that resetting prefix-scan iterators is done on loading RocksDB, which was only done in rollback.
### Why are the changes needed?
Please refer the above section on explanation of bugs and treatments.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Modified UT which failed without this PR and passes with this PR.
Closes#33870 from HeartSaVioR/SPARK-36619.
Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
Remove unused listener in HiveThriftServer2AppStatusStore.
### Why are the changes needed?
`HiveThriftServer2AppStatusStore` provides a view of a KVStore with methods that make it easy to query SQL-specific state.
`HiveThriftServer2Listener` has no function in this class, and it should not appear in this class. We can remove it.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#33866 from timothy65535/master.
Authored-by: timothy65535 <timothy65535@163.com>
Signed-off-by: Kent Yao <yao@apache.org>
### What changes were proposed in this pull request?
Spark SQL includes a data source that can read data from other databases using JDBC.
Spark also supports the case-insensitive option `pushDownPredicate`.
According to http://spark.apache.org/docs/latest/sql-data-sources-jdbc.html, If set `pushDownPredicate` to false, no filter will be pushed down to the JDBC data source and thus all filters will be handled by Spark.
But I find it still be pushed down to JDBC data source.
### Why are the changes needed?
Fix bug `pushDownPredicate`=false failed to prevent push down filters to JDBC data source.
### Does this PR introduce _any_ user-facing change?
'No'.
The output of query will not change.
### How was this patch tested?
Jenkins test.
Closes#33822 from beliefer/SPARK-36574.
Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
The `try_add` function allows the following inputs:
- number, number
- date, number
- date, interval
- timestamp, interval
- interval, interval
And, the `try_divide` function allows the following inputs:
- number, number
- interval, number
However, in the current code, there are only examples and tests about the (number, number) inputs. We should enhance the docs to let users know that the functions can be used for datetime and interval operations too.
### Why are the changes needed?
Improve documentation and tests.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
New UT
Also build docs for preview:
![image](https://user-images.githubusercontent.com/1097932/131212897-8aea14c8-a882-4e12-94e2-f56bde7c0367.png)
Closes#33861 from gengliangwang/enhanceTryDoc.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR does minor changes in the file SaveAsHiveFile.scala.
It contains the below changes :
1. dropping getParent from below part of code
===============================
if (extURI.getScheme == "viewfs") {
getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir)
===============================
### Why are the changes needed?
Hive is creating .staging directories inside "/db/table" location but Spark-sql creates .staging directories inside /db/" location when we use hadoop federation(viewFs). But works as expected (creating .staging inside /db/table/ location for other filesystems like hdfs).
In HIVE:
```
beeline
> use dicedb;
> insert into table part_test partition (j=1) values (1);
...
INFO : Loading data to table dicedb.part_test partition (j=1) from **viewfs://cloudera/user/daisuke/dicedb/part_test/j=1/.hive-staging_hive_2021-07-19_13-04-44_989_6775328876605030677-1/-ext-10000**
```
but spark's behaviour,
```
spark-sql> use dicedb;
spark-sql> insert into table part_test partition (j=2) values (2);
21/07/19 13:07:37 INFO FileUtils: Creating directory if it doesn't exist: **viewfs://cloudera/user/daisuke/dicedb/.hive-staging_hive_2021-07-19_13-07-37_317_5083528872437596950-1**
...
```
The reason why we require this change is , if we allow spark-sql to create .staging directory inside /db/ location then we will end-up with security issues. We need to provide permission for "viewfs:///db/" location to all users who submit spark jobs.
After this change is applied spark-sql creates .staging inside /db/table/, similar to hive, as below,
```
spark-sql> use dicedb;
21/07/28 00:22:47 INFO SparkSQLCLIDriver: Time taken: 0.929 seconds
spark-sql> insert into table part_test partition (j=8) values (8);
21/07/28 00:23:25 INFO HiveMetaStoreClient: Closed a connection to metastore, current connections: 1
21/07/28 00:23:26 INFO FileUtils: Creating directory if it doesn't exist: **viewfs://cloudera/user/daisuke/dicedb/part_test/.hive-staging_hive_2021-07-28_00-23-26_109_4548714524589026450-1**
```
The reason why we don't see this issue in Hive but only occurs in Spark-sql:
In hive, "/db/table/tmp" directory structure is passed for path and hence path.getParent returns "db/table/" . But in Spark we just pass "/db/table" so it is not required to use "path.getParent" for hadoop federation(viewfs)
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Tested manually by creating hive-sql.jar
Closes#33577 from senthh/viewfs-792392.
Authored-by: senthilkumarb <senthilkumarb@cloudera.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>