Commit graph

11773 commits

Author SHA1 Message Date
Chao Sun a927b0836b [SPARK-36726] Upgrade Parquet to 1.12.1
### What changes were proposed in this pull request?

Upgrade Apache Parquet to 1.12.1

### Why are the changes needed?

Parquet 1.12.1 contains the following bug fixes:
- PARQUET-2064: Make Range public accessible in RowRanges
- PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream`
- PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding
- PARQUET-1633: Fix integer overflow
- PARQUET-2054: fix TCP leaking when calling ParquetFileWriter.appendFile
- PARQUET-2072: Do Not Determine Both Min/Max for Binary Stats
- PARQUET-2073: Fix estimate remaining row count in ColumnWriteStoreBase
- PARQUET-2078: Failed to read parquet file after writing with the same

In particular PARQUET-2078 is a blocker for the upcoming Apache Spark 3.2.0 release.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests + a new test for the issue in SPARK-36696

Closes #33969 from sunchao/upgrade-parquet-12.1.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2021-09-15 19:17:34 +00:00
Angerszhuuuu b665782f0d [SPARK-36755][SQL] ArraysOverlap should handle duplicated Double.NaN and Float.NaN
### What changes were proposed in this pull request?
For query
```
select arrays_overlap(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [false], but it should return [true].
This issue is caused by `scala.mutable.HashSet` can't handle `Double.NaN` and `Float.NaN`.

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
arrays_overlap won't handle equal `NaN` value

### How was this patch tested?
Added UT

Closes #34006 from AngersZhuuuu/SPARK-36755.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-15 22:31:46 +08:00
Angerszhuuuu 638085953f [SPARK-36702][SQL][FOLLOWUP] ArrayUnion handle duplicated Double.NaN and Float.NaN
### What changes were proposed in this pull request?
According to https://github.com/apache/spark/pull/33955#discussion_r708570515 use normalized  NaN

### Why are the changes needed?
Use normalized NaN for duplicated NaN value

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Exiting UT

Closes #34003 from AngersZhuuuu/SPARK-36702-FOLLOWUP.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-15 22:04:09 +08:00
Leona Yoda 0666f5c003 [SPARK-36751][SQL][PYTHON][R] Add bit/octet_length APIs to Scala, Python and R
### What changes were proposed in this pull request?

octet_length: caliculate the byte length of strings
bit_length: caliculate the bit length of strings
Those two string related functions are only implemented on SparkSQL, not on Scala, Python and R.

### Why are the changes needed?

Those functions would be useful for multi-bytes character users, who mainly working with Scala, Python or R.

### Does this PR introduce _any_ user-facing change?

Yes. Users can call octet_length/bit_length APIs on Scala(Dataframe), Python, and R.

### How was this patch tested?

unit tests

Closes #33992 from yoda-mon/add-bit-octet-length.

Authored-by: Leona Yoda <yodal@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-09-15 16:27:13 +09:00
Kousuke Saruta e43b9e8520 [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields
### What changes were proposed in this pull request?

This PR fixes a perf issue in `SchemaPruning` when a struct has many fields (e.g. >10K fields).
The root cause is `SchemaPruning.sortLeftFieldsByRight` does N * M order searching.
```
 val filteredRightFieldNames = rightStruct.fieldNames
    .filter(name => leftStruct.fieldNames.exists(resolver(_, name)))
```

To fix this issue, this PR proposes to use `HashMap` to expect a constant order searching.
This PR also adds `case _ if left == right => left` to the method as a short-circuit code.

### Why are the changes needed?

To fix a perf issue.

### Does this PR introduce _any_ user-facing change?

No. The logic should be identical.

### How was this patch tested?

I confirmed that the following micro benchmark finishes within a few seconds.
```
import org.apache.spark.sql.catalyst.expressions.SchemaPruning
import org.apache.spark.sql.types._

var struct1 = new StructType()
(1 to 50000).foreach { i =>
  struct1 = struct1.add(new StructField(i + "", IntegerType))
}

var struct2 = new StructType()
(50001 to 100000).foreach { i =>
  struct2 = struct2.add(new StructField(i + "", IntegerType))
}

SchemaPruning.sortLeftFieldsByRight(struct1, struct2)
SchemaPruning.sortLeftFieldsByRight(struct2, struct2)
```

The correctness should be checked by existing tests.

Closes #33981 from sarutak/improve-schemapruning-performance.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-15 10:33:58 +09:00
yangjie01 119ddd7e95 [SPARK-36737][BUILD][CORE][SQL][SS] Upgrade Apache commons-io to 2.11.0 and revert change of SPARK-36456
### What changes were proposed in this pull request?
SPARK-36456 change to use `JavaUtils. closeQuietly` instead of `IOUtils.closeQuietly`, but there is slightly different from the 2 methods in default behavior: swallowing IOException is same, but the former logs it as ERROR while the latter doesn't log by default.

`Apache commons-io` community decided to retain the `IOUtils.closeQuietly` method in the [new version](75f20dca72/src/main/java/org/apache/commons/io/IOUtils.java (L465-L467)) and removed deprecated annotation,  the change has been released in version 2.11.0.

So the change of this pr is to upgrade `Apache commons-io` to 2.11.0 and revert change of SPARK-36456 to maintain original behavior(don't print error log).

### Why are the changes needed?

1. Upgrade `Apache commons-io` to 2.11.0 to use non-deprecated `closeQuietly` API, other changes related to `Apache commons-io are detailed in [commons-io/changes-report](https://commons.apache.org/proper/commons-io/changes-report.html#a2.11.0)

2. Revert change of SPARK-36737 to maintain original `IOUtils.closeQuietly` API behavior(don't print error log).

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #33977 from LuciferYang/upgrade-commons-io.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-09-14 21:16:58 +09:00
Angerszhuuuu f71f37755d [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan
### What changes were proposed in this pull request?
For query
```
select array_union(array(cast('nan' as double), cast('nan' as double)), array())
```
This returns [NaN, NaN], but it should return [NaN].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr we add a wrap for OpenHashSet that can handle `null`, `Double.NaN`, `Float.NaN` together

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
ArrayUnion won't show duplicated `NaN` value

### How was this patch tested?
Added UT

Closes #33955 from AngersZhuuuu/SPARK-36702-WrapOpenHashSet.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-14 18:25:47 +08:00
Fu Chen 52c5ff20ca [SPARK-36715][SQL] InferFiltersFromGenerate should not infer filter for udf
### What changes were proposed in this pull request?

Fix InferFiltersFromGenerate bug, InferFiltersFromGenerate should not infer filter for generate when the children contain an expression which is instance of `org.apache.spark.sql.catalyst.expressions.UserDefinedExpression`.
Before this pr, the following case will throw an exception.

```scala
spark.udf.register("vec", (i: Int) => (0 until i).toArray)
sql("select explode(vec(8)) as c1").show
```

```
Once strategy's idempotence is broken for batch Infer Filters
 GlobalLimit 21                                                        GlobalLimit 21
 +- LocalLimit 21                                                      +- LocalLimit 21
    +- Project [cast(c1#3 as string) AS c1#12]                            +- Project [cast(c1#3 as string) AS c1#12]
       +- Generate explode(vec(8)), false, [c1#3]                            +- Generate explode(vec(8)), false, [c1#3]
          +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8)))            +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8)))
!            +- OneRowRelation                                                     +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8)))
!                                                                                     +- OneRowRelation

java.lang.RuntimeException:
Once strategy's idempotence is broken for batch Infer Filters
 GlobalLimit 21                                                        GlobalLimit 21
 +- LocalLimit 21                                                      +- LocalLimit 21
    +- Project [cast(c1#3 as string) AS c1#12]                            +- Project [cast(c1#3 as string) AS c1#12]
       +- Generate explode(vec(8)), false, [c1#3]                            +- Generate explode(vec(8)), false, [c1#3]
          +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8)))            +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8)))
!            +- OneRowRelation                                                     +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8)))
!                                                                                     +- OneRowRelation

	at org.apache.spark.sql.errors.QueryExecutionErrors$.onceStrategyIdempotenceIsBrokenForBatchError(QueryExecutionErrors.scala:1200)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.checkBatchIdempotence(RuleExecutor.scala:168)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:254)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200)
	at scala.collection.immutable.List.foreach(List.scala:431)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:179)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:179)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$optimizedPlan$1(QueryExecution.scala:138)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:196)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:196)
	at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:134)
	at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:130)
	at org.apache.spark.sql.execution.QueryExecution.assertOptimized(QueryExecution.scala:148)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executedPlan$1(QueryExecution.scala:166)
	at org.apache.spark.sql.execution.QueryExecution.withCteMap(QueryExecution.scala:73)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:163)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:163)
	at org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:214)
	at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$explainString(QueryExecution.scala:259)
	at org.apache.spark.sql.execution.QueryExecution.explainString(QueryExecution.scala:228)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:98)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3731)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:2755)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:2962)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:288)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:327)
	at org.apache.spark.sql.Dataset.show(Dataset.scala:807)
```

### Does this PR introduce _any_ user-facing change?

No, only bug fix.

### How was this patch tested?

Unit test.

Closes #33956 from cfmcgrady/SPARK-36715.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-14 09:26:11 +09:00
Lukas Rytz 1a62e6a2c1 [SPARK-36712][BUILD] Make scala-parallel-collections in 2.13 POM a direct dependency (not in maven profile)
As [reported on `devspark.apache.org`](https://lists.apache.org/thread.html/r84cff66217de438f1389899e6d6891b573780159cd45463acf3657aa%40%3Cdev.spark.apache.org%3E), the published POMs when building with Scala 2.13 have the `scala-parallel-collections` dependency only in the `scala-2.13` profile of the pom.

### What changes were proposed in this pull request?

This PR suggests to work around this by un-commenting the `scala-parallel-collections` dependency when switching to 2.13 using the the `change-scala-version.sh` script.

I included an upgrade to scala-parallel-collections version 1.0.3, the changes compared to 0.2.0 are minor.
  - removed OSGi metadata
  - renamed some internal inner classes
  - added `Automatic-Module-Name`

### Why are the changes needed?

According to the posts, this solves issues for developers that write unit tests for their applications.

Stephen Coy suggested to use the https://www.mojohaus.org/flatten-maven-plugin. While this sounds like a more principled solution, it is possibly too risky to do at this specific point in time?

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Locally

Closes #33948 from lrytz/parCollDep.

Authored-by: Lukas Rytz <lukas.rytz@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-09-13 11:06:50 -05:00
Max Gekk bd62ad9982 [SPARK-36736][SQL] Support ILIKE (ALL | ANY | SOME) - case insensitive LIKE
### What changes were proposed in this pull request?
In the PR, I propose to support a case-insensitive variant of the `LIKE (ALL | ANY | SOME)` expression - `ILIKE`. In this way, Spark's users can match strings to single pattern in the case-insensitive manner. For example:
```sql
spark-sql> create table ilike_example(subject varchar(20));
spark-sql> insert into ilike_example values
         > ('jane doe'),
         > ('Jane Doe'),
         > ('JANE DOE'),
         > ('John Doe'),
         > ('John Smith');
spark-sql> select *
         > from ilike_example
         > where subject ilike any ('jane%', '%SMITH')
         > order by subject;
JANE DOE
Jane Doe
John Smith
jane doe
```

The syntax of `ILIKE` is similar to `LIKE`:
```
str NOT? ILIKE (ANY | SOME | ALL) (pattern+)
```

### Why are the changes needed?
1. To improve user experience with Spark SQL. No need to use `lower(col_name)` in where clauses.
2. To make migration from other popular DMBSs to Spark SQL easier. DBMSs below support `ilike` in SQL:
    - [Snowflake](https://docs.snowflake.com/en/sql-reference/functions/ilike.html#ilike)
    - [PostgreSQL](https://www.postgresql.org/docs/12/functions-matching.html)
    - [CockroachDB](https://www.cockroachlabs.com/docs/stable/functions-and-operators.html)

### Does this PR introduce _any_ user-facing change?
No, it doesn't. The PR **extends** existing APIs.

### How was this patch tested?
1. By running of expression examples via:
```
$ build/sbt "sql/test:testOnly org.apache.spark.sql.expressions.ExpressionInfoSuite"
```
2. Added new test to test parsing of `ILIKE`:
```
$ build/sbt "test:testOnly *.ExpressionParserSuite"
```
3. Via existing test suites:
```
$ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z ilike-any.sql"
$ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z ilike-all.sql"
```

Closes #33966 from MaxGekk/ilike-any.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-13 22:51:49 +08:00
Kousuke Saruta e858cd568a [SPARK-36724][SQL] Support timestamp_ntz as a type of time column for SessionWindow
### What changes were proposed in this pull request?

This PR proposes to support `timestamp_ntz` as a type of time column for `SessionWIndow` like `TimeWindow` does.

### Why are the changes needed?

For better usability.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New test.

Closes #33965 from sarutak/session-window-ntz.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-09-13 21:47:43 +08:00
Yuto Akutsu 3747cfdb40 [SPARK-36738][SQL][DOC] Fixed the wrong documentation on Cot API
### What changes were proposed in this pull request?

Fixed wrong documentation on Cot API

### Why are the changes needed?

[Doc](https://spark.apache.org/docs/latest/api/sql/index.html#cot) says `1/java.lang.Math.cot` but it should be `1/java.lang.Math.tan`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual check.

Closes #33978 from yutoacts/SPARK-36738.

Authored-by: Yuto Akutsu <yuto.akutsu@nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-13 21:51:29 +09:00
ulysses-you 4a6b2b9fc8 [SPARK-33832][SQL] Support optimize skewed join even if introduce extra shuffle
### What changes were proposed in this pull request?

- move the rule `OptimizeSkewedJoin` from stage optimization phase to stage preparation phase.
- run the rule `EnsureRequirements` one more time after the `OptimizeSkewedJoin` rule in the stage preparation phase.
- add `SkewJoinAwareCost` to support estimate skewed join cost
- add new config to decide if force optimize skewed join
- in `OptimizeSkewedJoin`, we generate 2 physical plans, one with skew join optimization and one without. Then we use the cost evaluator w.r.t. the force-skew-join flag and pick the plan with lower cost.

### Why are the changes needed?

In general, skewed join has more impact on performance  than once more shuffle. It makes sense to force optimize skewed join even if introduce extra shuffle.

A common case:
```
HashAggregate
  SortMergJoin
    Sort
      Exchange
    Sort
      Exchange
```
and after this PR, the plan looks like:
```
HashAggregate
  Exchange
    SortMergJoin (isSkew=true)
      Sort
        Exchange
      Sort
        Exchange
```

Note that, the new introduced shuffle also can be optimized by AQE.

### Does this PR introduce _any_ user-facing change?

Yes, a new config.

### How was this patch tested?

* Add new test
* pass exists test `SPARK-30524: Do not optimize skew join if introduce additional shuffle`
* pass exists test `SPARK-33551: Do not use custom shuffle reader for repartition`

Closes #32816 from ulysses-you/support-extra-shuffle.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-13 17:21:27 +08:00
Kousuke Saruta c36d70836d [SPARK-36725][SQL][TESTS] Ensure HiveThriftServer2Suites to stop Thrift JDBC server on exit
### What changes were proposed in this pull request?

This PR aims to ensure that HiveThriftServer2Suites (e.g. `thriftserver.UISeleniumSuite`) stop Thrift JDBC server on exit using shutdown hook.

### Why are the changes needed?

Normally, HiveThriftServer2Suites stops Thrift JDBC server via `afterAll` method.
But, if they are killed by signal (e.g. Ctrl-C), Thrift JDBC server will be remain.
```
$ jps
2792969 SparkSubmit
```
### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Killed `thriftserver.UISeleniumSuite` by Ctrl-C and confirmed no Thrift JDBC server is remain by jps.

Closes #33967 from sarutak/stop-thrift-on-exit.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-09-11 15:54:35 -07:00
Huaxin Gao 1f679ed8e9 [SPARK-36556][SQL] Add DSV2 filters
Co-Authored-By: DB Tsai d_tsaiapple.com
Co-Authored-By: Huaxin Gao huaxin_gaoapple.com

### What changes were proposed in this pull request?
Add DSV2 Filters and use these in V2 codepath.

### Why are the changes needed?
The motivation of adding DSV2 filters:
1. The values in V1 filters are Scala types. When translating catalyst `Expression` to V1 filers, we have to call `convertToScala` to convert from Catalyst types used internally in rows to standard Scala types, and later convert Scala types back to Catalyst types. This is very inefficient. In V2 filters, we use `Expression`  for filter values, so the conversion from  Catalyst types to Scala types and Scala types back to Catalyst types are avoided.
2. Improve nested column filter support.
3. Make the filters work better with the rest of the DSV2 APIs.

### Does this PR introduce _any_ user-facing change?
Yes. The new V2 filters

### How was this patch tested?
new test

Closes #33803 from huaxingao/filter.

Lead-authored-by: Huaxin Gao <huaxin_gao@apple.com>
Co-authored-by: DB Tsai <d_tsai@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-09-11 10:12:21 -07:00
dgd-contributor 711577e238 [SPARK-36687][SQL][CORE] Rename error classes with _ERROR suffix
### What changes were proposed in this pull request?
redundant _ERROR suffix in error-classes.json

### Why are the changes needed?
Clean up error classes  to reduce clutter

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests

Closes #33944 from dgd-contributor/SPARK-36687.

Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-10 10:00:28 +09:00
Liang-Chi Hsieh 6bcf330191 [SPARK-36669][SQL] Add Lz4 wrappers for Hadoop Lz4 codec
### What changes were proposed in this pull request?

This patch proposes to add a few LZ4 wrapper classes for Parquet Lz4 compression output that uses Hadoop Lz4 codec.

### Why are the changes needed?

Currently we use Hadop 3.3.1's shaded client libraries. Lz4 is a provided dependency in Hadoop Common 3.3.1 for Lz4Codec. But it isn't excluded from relocation in these libraries. So to use lz4 as Parquet codec, we will hit the exception even we include lz4 as dependency.

```
[info]   Cause: java.lang.NoClassDefFoundError: org/apache/hadoop/shaded/net/jpountz/lz4/LZ4Factory
[info]   at org.apache.hadoop.io.compress.lz4.Lz4Compressor.<init>(Lz4Compressor.java:66)
[info]   at org.apache.hadoop.io.compress.Lz4Codec.createCompressor(Lz4Codec.java:119)
[info]   at org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:152)
[info]   at org.apache.hadoop.io.compress.CodecPool.getCompressor(CodecPool.java:168)
```

Before the issue is fixed at Hadoop new release, we can add a few wrapper classes for Lz4 codec.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Modified test.

Closes #33940 from viirya/lz4-wrappers.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-09-09 09:31:00 -07:00
Max Gekk b74a1ba69f [SPARK-36674][SQL] Support ILIKE - case insensitive LIKE
### What changes were proposed in this pull request?
In the PR, I propose to support a case-insensitive variant of the `like` expression - `ilike`. In this way, Spark's users can match strings to single pattern in the case-insensitive manner. For example:
```sql
spark-sql> create table ilike_ex(subject varchar(20));
spark-sql> insert into ilike_ex values
         > ('John  Dddoe'),
         > ('Joe   Doe'),
         > ('John_down'),
         > ('Joe down'),
         > (null);
spark-sql> select *
         >     from ilike_ex
         >     where subject ilike '%j%h%do%'
         >     order by 1;
John  Dddoe
John_down
```

The syntax of `ilike` is similar to `like`:
```
str ILIKE pattern[ ESCAPE escape]
```

#### Implementation details
`ilike` is implemented as a runtime replaceable expression to `Like(Lower(left), Lower(right), escapeChar)`. Such replacement is acceptable because `ilike`/`like` recognise only `_` and `%` as special characters but not special character classes.

**Note:** The PR aims to support `ilike` in SQL only. Others APIs can be updated separately on demand.

### Why are the changes needed?
1. To improve user experience with Spark SQL. No need to use `lower(col_name)` in where clauses.
2. To make migration from other popular DMBSs to Spark SQL easier. DBMSs below support `ilike` in SQL:
    - [Snowflake](https://docs.snowflake.com/en/sql-reference/functions/ilike.html#ilike)
    - [Redshift](https://docs.aws.amazon.com/redshift/latest/dg/r_patternmatching_condition_like.html)
    - [PostgreSQL](https://www.postgresql.org/docs/12/functions-matching.html)
    - [ClickHouse](https://clickhouse.tech/docs/en/sql-reference/functions/string-search-functions/#ilike)
    - [Vertica](https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/LanguageElements/Predicates/LIKE-predicate.htm)
    - [Impala](https://docs.cloudera.com/documentation/enterprise/6/6.3/topics/impala_operators.html#ilike)

### Does this PR introduce _any_ user-facing change?
No, it doesn't. The PR **extends** existing APIs.

### How was this patch tested?
1. By running of expression examples via:
```
$ build/sbt "sql/test:testOnly org.apache.spark.sql.expressions.ExpressionInfoSuite"
```
2. Added new test:
```
$ build/sbt "test:testOnly *.RegexpExpressionsSuite"
```
3. Via existing test suites:
```
$ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z regexp-functions.sql"
$ build/sbt "test:testOnly *SQLKeywordSuite"
$ build/sbt "sql/testOnly *ExpressionsSchemaSuite"
```

Closes #33919 from MaxGekk/ilike-single-pattern.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-09 11:55:20 +08:00
Andrew Liu 9b633f2075 [SPARK-36686][SQL] Fix SimplifyConditionalsInPredicate to be null-safe
### What changes were proposed in this pull request?

fix SimplifyConditionalsInPredicate to be null-safe

Reproducible:

```
import org.apache.spark.sql.types.{StructField, BooleanType, StructType}
import org.apache.spark.sql.Row

val schema = List(
  StructField("b", BooleanType, true)
)
val data = Seq(
  Row(true),
  Row(false),
  Row(null)
)
val df = spark.createDataFrame(
  spark.sparkContext.parallelize(data),
  StructType(schema)
)

// cartesian product of true / false / null
val df2 = df.select(col("b") as "cond").crossJoin(df.select(col("b") as "falseVal"))
df2.createOrReplaceTempView("df2")

spark.sql("SELECT * FROM df2 WHERE IF(cond, FALSE, falseVal)").show()
// actual:
// +-----+--------+
// | cond|falseVal|
// +-----+--------+
// |false|    true|
// +-----+--------+
spark.sql("SET spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.SimplifyConditionalsInPredicate")
spark.sql("SELECT * FROM df2 WHERE IF(cond, FALSE, falseVal)").show()
// expected:
// +-----+--------+
// | cond|falseVal|
// +-----+--------+
// |false|    true|
// | null|    true|
// +-----+--------+
```

### Why are the changes needed?

is a regression that leads to incorrect results

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #33928 from hypercubestart/fix-SimplifyConditionalsInPredicate.

Authored-by: Andrew Liu <andrewlliu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-09 11:32:40 +08:00
Hyukjin Kwon 34f80ef313 [SPARK-36625][SPARK-36661][PYTHON] Support TimestampNTZ in pandas API on Spark
### What changes were proposed in this pull request?

This PR adds:
- the support of `TimestampNTZType` in pandas API on Spark.
- the support of Py4J handling of `spark.sql.timestampType` configuration

### Why are the changes needed?

To complete `TimestampNTZ` support.

In more details:

- ([#33876](https://github.com/apache/spark/pull/33876)) For `TimestampNTZType` in Spark SQL at PySpark, we can successfully ser/de `TimestampNTZType` instances to naive `datetime` (see also https://docs.python.org/3/library/datetime.html#aware-and-naive-objects). This naive `datetime` interpretation is up to the program to decide how to interpret, e.g.) whether a local time vs UTC time as an example. Although some Python built-in APIs assume they are local time in general (see also https://docs.python.org/3/library/datetime.html#datetime.datetime.utcfromtimestamp):

    > Because naive datetime objects are treated by many datetime methods as local times ...

  semantically it is legitimate to assume:
    - that naive `datetime` is mapped to `TimestampNTZType` (unknown timezone).
    - if you want to handle them as if a local timezone, this interpretation is matched to `TimestamType` (local time)

- ([#33875](https://github.com/apache/spark/pull/33875)) For `TimestampNTZType` in Arrow, they provide the same semantic (see also https://github.com/apache/arrow/blob/master/format/Schema.fbs#L240-L278):
    - `Timestamp(..., timezone=sparkLocalTimezone)` ->  `TimestamType`
    - `Timestamp(..., timezone=null)` ->  `TimestampNTZType`

- (this PR) For `TimestampNTZType` in pandas API on Spark, it follows Python side in general - pandas implements APIs based on the assumption of time (e.g., naive `datetime` is a local time or a UTC time).

    One example is that pandas allows to convert these naive `datetime` as if they are in UTC by default:

    ```python
    >>> pd.Series(datetime.datetime(1970, 1, 1)).astype("int")
    0    0
    ```

    whereas in Spark:

    ```python
    >>> spark.createDataFrame([[datetime.datetime(1970, 1, 1, 0, 0, 0)]]).selectExpr("CAST(_1 as BIGINT)").show()
    +------+
    |    _1|
    +------+
    |-32400|
    +------+

    >>> spark.createDataFrame([[datetime.datetime(1970, 1, 1, 0, 0, 0, tzinfo=datetime.timezone.utc)]]).selectExpr("CAST(_1 as BIGINT)").show()
    +---+
    | _1|
    +---+
    |  0|
    +---+
    ```

    In contrast, some APIs like `pandas.fromtimestamp` assume they are local times:

    ```python
    >>> pd.Timestamp.fromtimestamp(pd.Series(datetime(1970, 1, 1, 0, 0, 0)).astype("int").iloc[0])
    Timestamp('1970-01-01 09:00:00')
    ```

    For native Python, users can decide how to interpret native `datetime` so it's fine. The problem is that pandas API on Spark case would require to have two implementations of the same pandas behavior for `TimestampType` and `TimestampNTZType` respectively, which might be non-trivial overhead and work.

    As far as I know, pandas API on Spark has not yet implemented such ambiguous APIs so they are left as future work.

### Does this PR introduce _any_ user-facing change?

Yes, now pandas API on Spark can handle `TimestampNTZType`.

```python
import datetime
spark.createDataFrame([(datetime.datetime.now(),)], schema="dt timestamp_ntz").to_pandas_on_spark()
```

```
                          dt
0 2021-08-31 19:58:55.024410
```

This PR also adds the support of Py4J handling with `spark.sql.timestampType` configuration:

```python
>>> lit(datetime.datetime.now())
Column<'TIMESTAMP '2021-09-03 19:34:03.949998''>
```
```python
>>> spark.conf.set("spark.sql.timestampType", "TIMESTAMP_NTZ")
>>> lit(datetime.datetime.now())
Column<'TIMESTAMP_NTZ '2021-09-03 19:34:24.864632''>
```

### How was this patch tested?

Unittests were added.

Closes #33877 from HyukjinKwon/SPARK-36625.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-09 09:57:38 +09:00
Huaxin Gao 23794fb303 [SPARK-34952][SQL][FOLLOWUP] Change column type to be NamedReference
### What changes were proposed in this pull request?
Currently, we have `FieldReference` for aggregate column type, should be `NamedReference` instead

### Why are the changes needed?
`FieldReference` is a private class, should use `NamedReference` instead

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existing tests

Closes #33927 from huaxingao/agg_followup.

Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-08 14:05:44 +08:00
yangjie01 acd9c92fa8 [SPARK-36684][SQL][TESTS] Add Jackson test dependencies to sql/core module at hadoop-2.7 profile
### What changes were proposed in this pull request?
SPARK-26346 upgrade Parquet related modules from 1.10.1 to 1.11.1 and `parquet-jackson 1.11.1` use `com.fasterxml.jackson` instead of `org.codehaus.jackson`.

So, there are warning logs related to

```
17:12:17.605 WARN org.apache.hadoop.fs.FileSystem: Cannot load filesystem
java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.hdfs.web.WebHdfsFileSystem could not be instantiated
...
Caused by: java.lang.ClassNotFoundException: org.codehaus.jackson.map.ObjectMapper
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
...
```

when test `sql/core` modules with `hadoop-2.7` profile.

This pr adds test dependencies related to `org.codehaus.jackson` in `sql/core` module when `hadoop-2.7` profile is activated.

### Why are the changes needed?
Clean up test warning logs that shouldn't exist.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

- Pass GA or Jenkins Tests.
- Manual test `mvn clean test -pl sql/core -am -DwildcardSuites=none -Phadoop-2.7`

**Before**

No test failed, but warning logs as follows:

```
[INFO] Running test.org.apache.spark.sql.JavaBeanDeserializationSuite
22:42:45.211 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
22:42:46.827 WARN org.apache.hadoop.fs.FileSystem: Cannot load filesystem
java.util.ServiceConfigurationError: org.apache.hadoop.fs.FileSystem: Provider org.apache.hadoop.hdfs.web.WebHdfsFileSystem could not be instantiated
	at java.util.ServiceLoader.fail(ServiceLoader.java:232)
	at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
	at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
	at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
	at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
	at org.apache.hadoop.fs.FileSystem.loadFileSystems(FileSystem.java:2631)
	at org.apache.hadoop.fs.FileSystem.getFileSystemClass(FileSystem.java:2650)
	at org.apache.hadoop.fs.FsUrlStreamHandlerFactory.<init>(FsUrlStreamHandlerFactory.java:62)
	at org.apache.spark.sql.internal.SharedState$.liftedTree1$1(SharedState.scala:181)
	at org.apache.spark.sql.internal.SharedState$.org$apache$spark$sql$internal$SharedState$$setFsUrlStreamHandlerFactory(SharedState.scala:180)
	at org.apache.spark.sql.internal.SharedState.<init>(SharedState.scala:54)
	at org.apache.spark.sql.SparkSession.$anonfun$sharedState$1(SparkSession.scala:135)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.SparkSession.sharedState$lzycompute(SparkSession.scala:135)
	at org.apache.spark.sql.SparkSession.sharedState(SparkSession.scala:134)
	at org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:335)
	at org.apache.spark.sql.test.TestSparkSession.sessionState$lzycompute(TestSQLContext.scala:42)
	at org.apache.spark.sql.test.TestSparkSession.sessionState(TestSQLContext.scala:41)
	at org.apache.spark.sql.SparkSession.$anonfun$new$3(SparkSession.scala:109)
	at scala.Option.map(Option.scala:230)
	at org.apache.spark.sql.SparkSession.$anonfun$new$1(SparkSession.scala:109)
	at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:194)
	at org.apache.spark.sql.types.DataType.sameType(DataType.scala:97)
	at org.apache.spark.sql.catalyst.analysis.TypeCoercion$.$anonfun$haveSameType$1(TypeCoercion.scala:291)
	at org.apache.spark.sql.catalyst.analysis.TypeCoercion$.$anonfun$haveSameType$1$adapted(TypeCoercion.scala:291)
	at scala.collection.LinearSeqOptimized.forall(LinearSeqOptimized.scala:85)
	at scala.collection.LinearSeqOptimized.forall$(LinearSeqOptimized.scala:82)
	at scala.collection.immutable.List.forall(List.scala:89)
	at org.apache.spark.sql.catalyst.analysis.TypeCoercion$.haveSameType(TypeCoercion.scala:291)
	at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck(Expression.scala:1074)
	at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck$(Expression.scala:1069)
	at org.apache.spark.sql.catalyst.expressions.If.dataTypeCheck(conditionalExpressions.scala:37)
	at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType(Expression.scala:1080)
	at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType$(Expression.scala:1079)
	at org.apache.spark.sql.catalyst.expressions.If.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType$lzycompute(conditionalExpressions.scala:37)
	at org.apache.spark.sql.catalyst.expressions.If.org$apache$spark$sql$catalyst$expressions$ComplexTypeMergingExpression$$internalDataType(conditionalExpressions.scala:37)
	at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType(Expression.scala:1084)
	at org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType$(Expression.scala:1084)
	at org.apache.spark.sql.catalyst.expressions.If.dataType(conditionalExpressions.scala:37)
	at org.apache.spark.sql.catalyst.expressions.objects.MapObjects.$anonfun$dataType$4(objects.scala:815)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.catalyst.expressions.objects.MapObjects.dataType(objects.scala:815)
	at org.apache.spark.sql.catalyst.expressions.CreateNamedStruct.$anonfun$dataType$9(complexTypeCreator.scala:416)
	at scala.collection.immutable.List.map(List.scala:290)
	at org.apache.spark.sql.catalyst.expressions.CreateNamedStruct.dataType$lzycompute(complexTypeCreator.scala:410)
	at org.apache.spark.sql.catalyst.expressions.CreateNamedStruct.dataType(complexTypeCreator.scala:409)
	at org.apache.spark.sql.catalyst.expressions.CreateNamedStruct.dataType(complexTypeCreator.scala:398)
	at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.isSerializedAsStruct(ExpressionEncoder.scala:309)
	at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.isSerializedAsStructForTopLevel(ExpressionEncoder.scala:319)
	at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.<init>(ExpressionEncoder.scala:248)
	at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.javaBean(ExpressionEncoder.scala:75)
	at org.apache.spark.sql.Encoders$.bean(Encoders.scala:154)
	at org.apache.spark.sql.Encoders.bean(Encoders.scala)
	at test.org.apache.spark.sql.JavaBeanDeserializationSuite.testBeanWithArrayFieldDeserialization(JavaBeanDeserializationSuite.java:75)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
	at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
	at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
	at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
	at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26)
	at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27)
	at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
	at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
	at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
	at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
	at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331)
	at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79)
	at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329)
	at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66)
	at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293)
	at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306)
	at org.junit.runners.ParentRunner.run(ParentRunner.java:413)
	at org.apache.maven.surefire.junit4.JUnit4Provider.execute(JUnit4Provider.java:364)
	at org.apache.maven.surefire.junit4.JUnit4Provider.executeWithRerun(JUnit4Provider.java:272)
	at org.apache.maven.surefire.junit4.JUnit4Provider.executeTestSet(JUnit4Provider.java:237)
	at org.apache.maven.surefire.junit4.JUnit4Provider.invoke(JUnit4Provider.java:158)
	at org.apache.maven.surefire.booter.ForkedBooter.runSuitesInProcess(ForkedBooter.java:428)
	at org.apache.maven.surefire.booter.ForkedBooter.execute(ForkedBooter.java:162)
	at org.apache.maven.surefire.booter.ForkedBooter.run(ForkedBooter.java:562)
	at org.apache.maven.surefire.booter.ForkedBooter.main(ForkedBooter.java:548)
Caused by: java.lang.NoClassDefFoundError: org/codehaus/jackson/map/ObjectMapper
	at org.apache.hadoop.hdfs.web.WebHdfsFileSystem.<clinit>(WebHdfsFileSystem.java:129)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at java.lang.Class.newInstance(Class.java:442)
	at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380)
	... 81 more
Caused by: java.lang.ClassNotFoundException: org.codehaus.jackson.map.ObjectMapper
	at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:419)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:352)
	... 88 more
```

**After**

There are no more warning logs like above

Closes #33926 from LuciferYang/SPARK-36684.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-09-07 21:40:41 -07:00
Pablo Langa feba05f181 [SPARK-35803][SQL] Support DataSource V2 CreateTempViewUsing
### What changes were proposed in this pull request?

Currently only DataSources V1 are supported in the CreateTempViewUsing command. This PR refactor DataframeReader to reuse the code for the creation of a DataFrame from a DataSource V2

### Why are the changes needed?

Improve the support of DataSourve V2 in this command

### Does this PR introduce _any_ user-facing change?

It does not change the current behavior, it only adds a new functionality

### How was this patch tested?

Unit testing

Closes #33922 from planga82/feature/spark35803_crateview_datasourceV2.

Authored-by: Pablo Langa <soypab@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-08 12:16:29 +08:00
Liang Zhang cb30683b65 [SPARK-36642][SQL] Add df.withMetadata: a syntax suger to update the metadata of a dataframe
### What changes were proposed in this pull request?

To make it easy to use/modify the semantic annotation, we want to have a shorter API to update the metadata in a dataframe. Currently we have `df.withColumn("col1", col("col1").alias("col1", metadata=metadata))` to update the metadata without changing the column name, and this is too verbose. We want to have a syntax suger API `df.withMetadata("col1", metadata=metadata)` to achieve the same functionality.

### Why are the changes needed?

A bit of background for the frequency of the update: We are working on inferring the semantic data types and use them in AutoML and store the semantic annotation in the metadata. So in many cases, we will suggest the user update the metadata to correct the wrong inference or add the annotation for weak inference.

### Does this PR introduce _any_ user-facing change?

Yes.
A syntax suger API `df.withMetadata("col1", metadata=metadata)` to achieve the same functionality as`df.withColumn("col1", col("col1").alias("col1", metadata=metadata))`.

### How was this patch tested?

A unit test in DataFrameSuite.scala.

Closes #33853 from liangz1/withMetadata.

Authored-by: Liang Zhang <liang.zhang@databricks.com>
Signed-off-by: Weichen Xu <weichen.xu@databricks.com>
2021-09-08 09:35:18 +08:00
Venkata Sai Akhil Gudesa 2ed6e7bc5d [SPARK-36677][SQL] NestedColumnAliasing should not push down aggregate functions into projections
### What changes were proposed in this pull request?

This PR filters out `ExtractValues`s that contains any aggregation function in the `NestedColumnAliasing` rule to prevent cases where aggregations are pushed down into projections.

### Why are the changes needed?

To handle a corner/missed case in `NestedColumnAliasing` that can cause users to encounter a runtime exception.

Consider the following schema:
```
root
 |-- a: struct (nullable = true)
 |    |-- c: struct (nullable = true)
 |    |    |-- e: string (nullable = true)
 |    |-- d: integer (nullable = true)
 |-- b: string (nullable = true)
```
and the query:
`SELECT MAX(a).c.e FROM (SELECT a, b FROM test_aggregates) GROUP BY b`

Executing the query before this PR will result in the error:
```
java.lang.UnsupportedOperationException: Cannot generate code for expression: max(input[0, struct<c:struct<e:string>,d:int>, true])
  at org.apache.spark.sql.errors.QueryExecutionErrors$.cannotGenerateCodeForExpressionError(QueryExecutionErrors.scala:83)
  at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:312)
  at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:311)
  at org.apache.spark.sql.catalyst.expressions.aggregate.AggregateExpression.doGenCode(interfaces.scala:99)
...
```
The optimised plan before this PR is:

```
'Aggregate [b#1], [_extract_e#5 AS max(a).c.e#3]
+- 'Project [max(a#0).c.e AS _extract_e#5, b#1]
   +- Relation default.test_aggregates[a#0,b#1] parquet
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

A new unit test in `NestedColumnAliasingSuite`. The test consists of the repro mentioned earlier.
The produced optimized plan is checked for equivalency with a plan of the form:
```
 Aggregate [b#452], [max(a#451).c.e AS max('a)[c][e]#456]
+- LocalRelation <empty>, [a#451, b#452]
```

Closes #33921 from vicennial/spark-36677.

Authored-by: Venkata Sai Akhil Gudesa <venkata.gudesa@databricks.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-09-07 18:15:48 -07:00
Liang-Chi Hsieh 5a0ae694d0 [SPARK-36670][SQL][TEST] Add FileSourceCodecSuite
### What changes were proposed in this pull request?

This patch mainly proposes to add some e2e test cases in Spark for codec used by main datasources.

### Why are the changes needed?

We found there is no e2e test cases available for main datasources like Parquet, Orc. It makes developers harder to identify possible bugs early. We should add such tests in Spark.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added tests.

Closes #33912 from viirya/SPARK-36670.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-09-07 16:53:11 -07:00
Andy Grove f78d8394dc [SPARK-36666][SQL] Fix regression in AQEShuffleReadExec
Fix regression in AQEShuffleReadExec when used in conjunction with Spark plugins with custom partitioning.

Signed-off-by: Andy Grove <andygrove73gmail.com>

### What changes were proposed in this pull request?

Return `UnknownPartitioning` rather than throw an exception in `AQEShuffleReadExec`.

### Why are the changes needed?

The [RAPIDS Accelerator for Apache Spark](https://github.com/NVIDIA/spark-rapids) replaces `AQEShuffleReadExec` with a custom operator that runs on the GPU. Due to changes in [SPARK-36315](dd80457ffb), Spark now throws an exception if the shuffle exchange does not have recognized partitioning, and this happens before the postStageOptimizer rules so there is no opportunity to replace this operator now.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I am still in the process of testing this change. I will update the PR in the next few days with status.

Closes #33910 from andygrove/SPARK-36666.

Authored-by: Andy Grove <andygrove73@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2021-09-07 13:49:45 -07:00
Jungtaek Lim 093c2080fe [SPARK-36667][SS][TEST] Close resources properly in StateStoreSuite/RocksDBStateStoreSuite
### What changes were proposed in this pull request?

This PR proposes to ensure StateStoreProvider instances are properly closed for each test in StateStoreSuite/RocksDBStateStoreSuite.

### Why are the changes needed?

While this doesn't break the test, this is a bad practice and may possibly make nasty problems in the future.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing UTs

Closes #33916 from HeartSaVioR/SPARK-36667.

Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-09-06 17:40:03 -07:00
Kousuke Saruta 0ab0cb108d [SPARK-36675][SQL] Support ScriptTransformation for timestamp_ntz
### What changes were proposed in this pull request?

This PR aims to support `ScriptTransformation` for `timestamp_ntz`.
In the current master, it doesn't work.
```
spark.sql("SELECT transform(col1) USING 'cat' AS (col1 timestamp_ntz) FROM VALUES timestamp_ntz'2021-09-06 20:19:13' t").show(false)
21/09/06 22:03:55 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
org.apache.spark.SparkException: SparkScriptTransformation without serde does not support TimestampNTZType$ as output data type
	at org.apache.spark.sql.errors.QueryExecutionErrors$.outputDataTypeUnsupportedByNodeWithoutSerdeError(QueryExecutionErrors.scala:1740)
	at org.apache.spark.sql.execution.BaseScriptTransformationExec.$anonfun$outputFieldWriters$1(BaseScriptTransformationExec.scala:245)
	at scala.collection.immutable.List.map(List.scala:293)
	at org.apache.spark.sql.execution.BaseScriptTransformationExec.org$apache$spark$sql$execution$BaseScriptTransformationExec$$outputFieldWriters(BaseScriptTransformationExec.scala:194)
	at org.apache.spark.sql.execution.BaseScriptTransformationExec.org$apache$spark$sql$execution$BaseScriptTransformationExec$$outputFieldWriters$(BaseScriptTransformationExec.scala:194)
	at org.apache.spark.sql.execution.SparkScriptTransformationExec.org$apache$spark$sql$execution$BaseScriptTransformationExec$$outputFieldWriters$lzycompute(SparkScriptTransformationExec.scala:38)
	at org.apache.spark.sql.execution.SparkScriptTransformationExec.org$apache$spark$sql$execution$BaseScriptTransformationExec$$outputFieldWriters(SparkScriptTransformationExec.scala:38)
	at org.apache.spark.sql.execution.BaseScriptTransformationExec$$anon$1.$anonfun$processRowWithoutSerde$1(BaseScriptTransformationExec.scala:121)
	at org.apache.spark.sql.execution.BaseScriptTransformationExec$$anon$1.next(BaseScriptTransformationExec.scala:162)
	at org.apache.spark.sql.execution.BaseScriptTransformationExec$$anon$1.next(BaseScriptTransformationExec.scala:113)
```

### Why are the changes needed?

For better usability.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New test.

Closes #33920 from sarutak/script-transformation-timestamp-ntz.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-09-06 20:58:07 +02:00
Yuto Akutsu db95960f4b [SPARK-36660][SQL] Add cot as Scala and Python functions
### What changes were proposed in this pull request?

Add cotangent support by Dataframe operations (e.g. `df.select(cot($"col"))`).

### Why are the changes needed?

Cotangent has been supported by Spark SQL since 2.3.0 but it cannot be called by Dataframe operations.

### Does this PR introduce _any_ user-facing change?

Yes, users can now call the cotangent function by Dataframe operations.

### How was this patch tested?

unit tests.

Closes #33906 from yutoacts/SPARK-36660.

Lead-authored-by: Yuto Akutsu <yuto.akutsu@nttdata.com>
Co-authored-by: Yuto Akutsu <87687356+yutoacts@users.noreply.github.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-06 13:38:18 +09:00
yangjie01 bdb73bbc27 [SPARK-36613][SQL][SS] Use EnumSet as the implementation of Table.capabilities method return value
### What changes were proposed in this pull request?
The `Table.capabilities` method return a `java.util.Set` of `TableCapability` enumeration type, which is implemented using `java.util.HashSet` now. Such Set can be replaced `with java.util.EnumSet` because `EnumSet` implementations can be much more efficient compared to other sets.

### Why are the changes needed?
Use more appropriate data structures.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- Pass GA or Jenkins Tests.
- Add a new benchmark to compare `create` and `contains` operation between `EnumSet` and `HashSet`

Closes #33867 from LuciferYang/SPARK-36613.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-09-05 08:23:05 -05:00
yangjie01 35848385ae [SPARK-36602][COER][SQL] Clean up redundant asInstanceOf casts
### What changes were proposed in this pull request?
The change of this pr is remove redundant asInstanceOf casts in Spark code.

### Why are the changes needed?
Code simplification

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- Pass GA or Jenkins Tests.

Closes #33852 from LuciferYang/cleanup-asInstanceof.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-09-05 08:22:28 -05:00
Senthil Kumar 6bd491ecb8 [SPARK-36643][SQL] Add more information in ERROR log while SparkConf is modified when spark.sql.legacy.setCommandRejectsSparkCoreConfs is set
### What changes were proposed in this pull request?

This PR adds additional information to ERROR log while SparkConf is modified when spark.sql.legacy.setCommandRejectsSparkCoreConfs is set

### Why are the changes needed?

Right now, by default sql.legacy.setCommandRejectsSparkCoreConfs is set as true in Spark 3.* versions int order to avoid changing Spark Confs. But from the error message we get confused if we can not modify/change Spark conf in Spark 3.* or not.

### Does this PR introduce any user-facing change?

Yes. Trivial change in the error messages is included

### How was this patch tested?

New Test added - SPARK-36643: Show migration guide when attempting SparkConf

Closes #33894 from senthh/1st_Sept_2021.

Lead-authored-by: Senthil Kumar <senthh@gmail.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-09-03 23:49:45 -07:00
Kousuke Saruta cf3bc65e69 [SPARK-36639][SQL] Fix an issue that sequence builtin function causes ArrayIndexOutOfBoundsException if the arguments are under the condition of start == stop && step < 0
### What changes were proposed in this pull request?

This PR fixes an issue that `sequence` builtin function causes `ArrayIndexOutOfBoundsException` if the arguments are under the condition of `start == stop && step < 0`.
This is an example.
```
SELECT sequence(timestamp'2021-08-31', timestamp'2021-08-31', -INTERVAL 1 month);
21/09/02 04:14:42 ERROR SparkSQLDriver: Failed in [SELECT sequence(timestamp'2021-08-31', timestamp'2021-08-31', -INTERVAL 1 month)]
java.lang.ArrayIndexOutOfBoundsException: 1
```
Actually, this example succeeded before SPARK-31980 (#28819) was merged.

### Why are the changes needed?

Bug fix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New tests.

Closes #33895 from sarutak/fix-sequence-issue.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-09-03 23:25:18 +09:00
Kent Yao 7f1ad7be18 [SPARK-36659][SQL] Promote spark.sql.execution.topKSortFallbackThreshold to a user-facing config
### What changes were proposed in this pull request?

Promote spark.sql.execution.topKSortFallbackThreshold to a user-facing config

### Why are the changes needed?

spark.sql.execution.topKSortFallbackThreshold now is an internal config hidden from users Integer.MAX_VALUE - 15 as its default. In many real-world cases, if the K is very big,  there would be performance issues.

It's better to leave this choice to users

### Does this PR introduce _any_ user-facing change?

 spark.sql.execution.topKSortFallbackThreshold is now user-facing

### How was this patch tested?

passing GA

Closes #33904 from yaooqinn/SPARK-36659.

Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Kent Yao <yao@apache.org>
2021-09-03 19:11:37 +08:00
Kazuyuki Tanimura d3e3df17aa [SPARK-36644][SQL] Push down boolean column filter
### What changes were proposed in this pull request?
This PR proposes to improve `DataSourceStrategy` to be able to push down boolean column filters. Currently boolean column filters do not get pushed down and may cause unnecessary IO.

### Why are the changes needed?
The following query does not push down the filter in the current implementation
```
SELECT * FROM t WHERE boolean_field
```
although the following query pushes down the filter as expected.
```
SELECT * FROM t WHERE boolean_field = true
```
This is because the Physical Planner (`DataSourceStrategy`) currently only pushes down limited expression patterns like`EqualTo`.
It is fair for Spark SQL users to expect `boolean_field` performs the same as `boolean_field = true`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added unit tests
```
build/sbt "core/testOnly *DataSourceStrategySuite   -- -z SPARK-36644"
```

Closes #33898 from kazuyukitanimura/SPARK-36644.

Authored-by: Kazuyuki Tanimura <ktanimura@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2021-09-03 07:39:14 +00:00
Cheng Su 9054a6ac00 [SPARK-36652][SQL] AQE dynamic join selection should not apply to non-equi join
### What changes were proposed in this pull request?

Currently `DynamicJoinSelection` has two features: 1.demote broadcast hash join, and 2.promote shuffled hash join. Both are achieved by adding join hint in query plan, and only works for equi join. However [the rule is matching with `Join` operator now](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/DynamicJoinSelection.scala#L71), so it would add hint for non-equi join by mistake (See added test query in `JoinHintSuite.scala` for an example).

This PR is to fix `DynamicJoinSelection` to only apply to equi-join, and improve `checkHintNonEquiJoin` to check we should not add `PREFER_SHUFFLE_HASH` for non-equi join.

### Why are the changes needed?

Improve the logic of codebase to be better.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit test in `JoinHintSuite.scala`.

Closes #33899 from c21/aqe-test.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-09-02 20:04:49 -07:00
Huaxin Gao 38b6fbd9b8 [SPARK-36351][SQL] Refactor filter push down in file source v2
### What changes were proposed in this pull request?

Currently in `V2ScanRelationPushDown`, we push the filters (partition filters + data filters) to file source, and then pass all the filters (partition filters + data filters) as post scan filters to v2 Scan, and later in `PruneFileSourcePartitions`, we separate partition filters and data filters, set them in the format of `Expression` to file source.

Changes in this PR:
When we push filters to file sources in `V2ScanRelationPushDown`, since we already have the information about partition column , we want to separate partition filter and data filter there.

The benefit of doing this:
- we can handle all the filter related work for v2 file source at one place instead of two (`V2ScanRelationPushDown` and `PruneFileSourcePartitions`), so the code will be cleaner and easier to maintain.
- we actually have to separate partition filters and data filters at `V2ScanRelationPushDown`, otherwise, there is no way to find out which filters are partition filters, and we can't push down aggregate for parquet even if we only have partition filter.
- By separating the filters early at `V2ScanRelationPushDown`, we only needs to check data filters to find out which one needs to be converted to data source filters (e.g. Parquet predicates, ORC predicates) and pushed down to file source, right now we are checking all the filters (both partition filters and data filters)
- Similarly, we can only pass data filters as post scan filters to v2 Scan, because partition filters are used for partition pruning only, no need to pass them as post scan filters.

In order to do this, we will have the following changes

-  add `pushFilters` in file source v2. In this method:
    - push both Expression partition filter and Expression data filter to file source. Have to use Expression filters because we need these for partition pruning.
    - data filters are used for filter push down. If file source needs to push down data filters, it translates the data filters from `Expression` to `Sources.Filer`, and then decides which filters to push down.
    - partition filters are used for partition pruning.
- file source v2 no need to implement `SupportsPushdownFilters` any more, because when we separating the two types of filters, we have already set them on file data sources. It's redundant to use `SupportsPushdownFilters` to set the filters again on file data sources.

### Why are the changes needed?

see section one

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests

Closes #33650 from huaxingao/partition_filter.

Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-09-02 19:11:43 -07:00
William Hyun b72fa5ef1c [SPARK-36657][SQL] Update comment in 'gen-sql-config-docs.py'
### What changes were proposed in this pull request?
This PR aims to update comments in `gen-sql-config-docs.py`.

### Why are the changes needed?
To make it up to date according to Spark version 3.2.0 release.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
N/A.

Closes #33902 from williamhyun/fixtool.

Authored-by: William Hyun <william@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-09-02 18:50:59 -07:00
Angerszhuuuu 568ad6aa44 [SPARK-36637][SQL] Provide proper error message when use undefined window frame
### What changes were proposed in this pull request?
Two case of using undefined window frame as below should provide proper error message

1. For case using undefined window frame with window function
```
SELECT nth_value(employee_name, 2) OVER w second_highest_salary
FROM basic_pays;
```
origin error message is
```
Window function nth_value(employee_name#x, 2, false) requires an OVER clause.
```
It's confused that in use use a window frame `w` but it's not defined.
Now the error message is
```
Window specification w is not defined in the WINDOW clause.
```

2. For case using undefined window frame with aggregation function
```
SELECT SUM(salary) OVER w sum_salary
FROM basic_pays;
```
origin error message is
```
Error in query: unresolved operator 'Aggregate [unresolvedwindowexpression(sum(salary#2), WindowSpecReference(w)) AS sum_salary#34]
+- SubqueryAlias spark_catalog.default.basic_pays
+- HiveTableRelation [`default`.`employees`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [name#0, dept#1, salary#2, age#3], Partition Cols: []]
```
In this case, when convert GlobalAggregate, should skip UnresolvedWindowExpression
Now the error message is
```
Window specification w is not defined in the WINDOW clause.
```

### Why are the changes needed?
Provide proper error message

### Does this PR introduce _any_ user-facing change?
Yes, error messages are improved as described in desc

### How was this patch tested?
Added UT

Closes #33892 from AngersZhuuuu/SPARK-36637.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-02 22:32:31 +08:00
Kousuke Saruta 94c306284a [SPARK-36400][TEST][FOLLOWUP] Add test for redacting sensitive information in UI by config
### What changes were proposed in this pull request?

This PR adds a test for SPARK-36400 (#33743).

### Why are the changes needed?

SPARK-36512 (#33741) was fixed so we can add this test now.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New test.

Closes #33885 from sarutak/add-reduction-test.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-09-02 17:04:49 +09:00
Hyukjin Kwon 9c5bcac61e [SPARK-36626][PYTHON] Support TimestampNTZ in createDataFrame/toPandas and Python UDFs
### What changes were proposed in this pull request?

This PR proposes to implement `TimestampNTZType` support in PySpark's `SparkSession.createDataFrame`, `DataFrame.toPandas`, Python UDFs, and pandas UDFs with and without Arrow.

### Why are the changes needed?

To complete `TimestampNTZType` support.

### Does this PR introduce _any_ user-facing change?

Yes.

- Users now can use `TimestampNTZType` type in `SparkSession.createDataFrame`, `DataFrame.toPandas`, Python UDFs, and pandas UDFs with and without Arrow.

- If `spark.sql.timestampType` is configured to `TIMESTAMP_NTZ`, PySpark will infer the `datetime` without timezone as `TimestampNTZType`. If it has a timezone, it will be inferred as `TimestampType` in `SparkSession.createDataFrame`.

    - If `TimestampType` and `TimestampNTZType` conflict during merging inferred schema, `TimestampType` has a higher precedence.

- If the type is `TimestampNTZType`, treat this internally as an unknown timezone, and compute w/ UTC (same as JVM side), and avoid localization externally.

### How was this patch tested?

Manually tested and unittests were added.

Closes #33876 from HyukjinKwon/SPARK-36626.

Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: Dominik Gehl <dog@open.ch>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-02 14:00:27 +09:00
Kazuyuki Tanimura 799a0116a8 [SPARK-36607][SQL] Support BooleanType in UnwrapCastInBinaryComparison
### What changes were proposed in this pull request?
This PR proposes to add `BooleanType` support to the `UnwrapCastInBinaryComparison` optimizer that is currently supports `NumericType` only.

The main idea is to treat `BooleanType` as 1 bit integer so that we can utilize all optimizations already defined in `UnwrapCastInBinaryComparison`.

This work is an extension of SPARK-24994 and SPARK-32858

### Why are the changes needed?
Current implementation of Spark without this PR cannot properly optimize the filter for the following case
```
SELECT * FROM t WHERE boolean_field = 2
```
The above query creates a filter of `cast(boolean_field, int) = 2`. The casting prevents from pushing down the filter. In contrast, this PR creates a `false` filter and returns early as there cannot be such a matching rows anyway (empty results.)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Passed existing tests
```
build/sbt "catalyst/test"
build/sbt "sql/test"
```
Added unit tests
```
build/sbt "catalyst/testOnly *UnwrapCastInBinaryComparisonSuite   -- -z SPARK-36607"
build/sbt "sql/testOnly *UnwrapCastInComparisonEndToEndSuite  -- -z SPARK-36607"
```

Closes #33865 from kazuyukitanimura/SPARK-36607.

Authored-by: Kazuyuki Tanimura <ktanimura@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-01 14:27:30 +08:00
Bo Zhang e33cdfb317 [SPARK-36533][SS] Trigger.AvailableNow for running streaming queries like Trigger.Once in multiple batches
### What changes were proposed in this pull request?

This change creates a new type of Trigger: Trigger.AvailableNow for streaming queries. It is like Trigger.Once, which process all available data then stop the query, but with better scalability since data can be processed in multiple batches instead of one.

To achieve this, this change proposes a new interface `SupportsTriggerAvailableNow`, which is an extension of `SupportsAdmissionControl`. It has one method, `prepareForTriggerAvailableNow`, which will be called at the beginning of streaming queries with Trigger.AvailableNow, to let the source record the offset for the current latest data at the time (a.k.a. the target offset for the query). The source should then behave as if there is no new data coming in after the beginning of the query, i.e., the source will not return an offset higher than the target offset when `latestOffset` is called.

This change also updates `FileStreamSource` to be an implementation of `SupportsTriggerAvailableNow`.

For other sources that does not implement `SupportsTriggerAvailableNow`, this change also provides a new class `FakeLatestOffsetSupportsTriggerAvailableNow`, which wraps the sources and makes them support Trigger.AvailableNow, by overriding their `latestOffset` method to always return the latest offset at the beginning of the query.

### Why are the changes needed?

Currently streaming queries with Trigger.Once will always load all of the available data in a single batch. Because of this, the amount of data a query can process is limited, or Spark driver will run out of memory.

### Does this PR introduce _any_ user-facing change?

Users will be able to use Trigger.AvailableNow (to process all available data then stop the streaming query) with this change.

### How was this patch tested?

Added unit tests.

Closes #33763 from bozhang2820/new-trigger.

Authored-by: Bo Zhang <bo.zhang@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-09-01 15:02:21 +09:00
Hyukjin Kwon 4ed2dab5ee [SPARK-36608][SQL] Support TimestampNTZ in Arrow
### What changes were proposed in this pull request?

This PR proposes to add the support of `TimestampNTZType` in Arrow APIs.
Now, Arrow can write `TimestampNTZType` as Timestamp with `null` timezone in Arrow.

### Why are the changes needed?

To complete the support of `TimestampNTZType` in Apache Spark.

### Does this PR introduce _any_ user-facing change?

Yes, the Arrow APIs (`ArrowColumnVector`) can now write `TimestampNTZType`

### How was this patch tested?

Unittests were added.

Closes #33875 from HyukjinKwon/SPARK-36608-arrow.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-01 10:23:42 +09:00
Jungtaek Lim 60a72c938a [SPARK-36619][SS] Fix bugs around prefix-scan for HDFS backed state store and RocksDB state store
### What changes were proposed in this pull request?

This PR proposes to fix bugs around prefix-scan for both HDFS backed state store and RocksDB state store.

> HDFS backed state store

We did "shallow-copy" on copying prefix map, which leads the values of prefix map (mutable Set) to be "same instances" across multiple versions. This PR fixes it via creating a new mutable Set and copying elements.

> RocksDB state store

Prefix-scan iterators are only closed on RocksDB.rollback(), which is only called in RocksDBStateStore.abort().

While `RocksDBStateStore.abort()` method will be called for streaming session window (since it has two physical plans for read and write), other stateful operators which only have read-write physical plan will call either commit or abort, and don't close the iterators on committing. These unclosed iterators can be "reused" and produce incorrect outputs.

This PR ensures that resetting prefix-scan iterators is done on loading RocksDB, which was only done in rollback.

### Why are the changes needed?

Please refer the above section on explanation of bugs and treatments.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Modified UT which failed without this PR and passes with this PR.

Closes #33870 from HeartSaVioR/SPARK-36619.

Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-09-01 00:51:45 +08:00
timothy65535 aae7310698 [SPARK-36611] Remove unused listener in HiveThriftServer2AppStatusStore
### What changes were proposed in this pull request?

Remove unused listener in HiveThriftServer2AppStatusStore.

### Why are the changes needed?

`HiveThriftServer2AppStatusStore` provides a view of a KVStore with methods that make it easy to query SQL-specific state.

`HiveThriftServer2Listener` has no function in this class, and it should not appear in this class. We can remove it.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

Closes #33866 from timothy65535/master.

Authored-by: timothy65535 <timothy65535@163.com>
Signed-off-by: Kent Yao <yao@apache.org>
2021-08-31 11:40:34 +08:00
gengjiaan fcc91cfec4 [SPARK-36574][SQL] pushDownPredicate=false should prevent push down filters to JDBC data source
### What changes were proposed in this pull request?
Spark SQL includes a data source that can read data from other databases using JDBC.
Spark also supports the case-insensitive option `pushDownPredicate`.
According to http://spark.apache.org/docs/latest/sql-data-sources-jdbc.html, If set `pushDownPredicate` to false, no filter will be pushed down to the JDBC data source and thus all filters will be handled by Spark.
But I find it still be pushed down to JDBC data source.

### Why are the changes needed?
Fix bug `pushDownPredicate`=false failed to prevent push down filters to JDBC data source.

### Does this PR introduce _any_ user-facing change?
'No'.
The output of query will not change.

### How was this patch tested?
Jenkins test.

Closes #33822 from beliefer/SPARK-36574.

Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-08-30 19:09:28 +08:00
Gengliang Wang 8a52ad9f82 [SPARK-36606][DOCS][TESTS] Enhance the docs and tests of try_add/try_divide
### What changes were proposed in this pull request?

The `try_add` function allows the following inputs:
- number, number
- date, number
- date, interval
- timestamp, interval
- interval, interval

And, the `try_divide` function allows the following inputs:

- number, number
- interval, number

However, in the current code, there are only examples and tests about the (number, number) inputs. We should enhance the docs to let users know that the functions can be used for datetime and interval operations too.

### Why are the changes needed?

Improve documentation and tests.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

New UT
Also build docs for preview:
![image](https://user-images.githubusercontent.com/1097932/131212897-8aea14c8-a882-4e12-94e2-f56bde7c0367.png)

Closes #33861 from gengliangwang/enhanceTryDoc.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-08-29 10:30:04 +09:00
senthilkumarb fe7bf5f96f [SPARK-36327][SQL] Spark sql creates staging dir inside database directory rather than creating inside table directory
### What changes were proposed in this pull request?

This PR does minor changes in the file SaveAsHiveFile.scala.

It contains the below changes :

1. dropping getParent from below part of code
===============================
if (extURI.getScheme == "viewfs") {
getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir)
===============================

### Why are the changes needed?

Hive is creating .staging directories inside "/db/table" location but Spark-sql creates .staging directories inside /db/" location when we use hadoop federation(viewFs). But works as expected (creating .staging inside /db/table/ location for other filesystems like hdfs).

In HIVE:
```
 beeline
> use dicedb;
> insert into table part_test partition (j=1) values (1);
...
INFO  : Loading data to table dicedb.part_test partition (j=1) from **viewfs://cloudera/user/daisuke/dicedb/part_test/j=1/.hive-staging_hive_2021-07-19_13-04-44_989_6775328876605030677-1/-ext-10000**
```

but spark's behaviour,

```
spark-sql> use dicedb;
spark-sql> insert into table part_test partition (j=2) values (2);
21/07/19 13:07:37 INFO FileUtils: Creating directory if it doesn't exist: **viewfs://cloudera/user/daisuke/dicedb/.hive-staging_hive_2021-07-19_13-07-37_317_5083528872437596950-1**
...
```

The reason why we require this change is , if we allow spark-sql to create .staging directory inside /db/ location then we will end-up with security issues. We need to provide permission for "viewfs:///db/" location to all users who submit spark jobs.

After this change is applied spark-sql creates .staging inside /db/table/,  similar to hive, as below,

```
spark-sql> use dicedb;
21/07/28 00:22:47 INFO SparkSQLCLIDriver: Time taken: 0.929 seconds
spark-sql> insert into table part_test partition (j=8) values (8);
21/07/28 00:23:25 INFO HiveMetaStoreClient: Closed a connection to metastore, current connections: 1
21/07/28 00:23:26 INFO FileUtils: Creating directory if it doesn't exist: **viewfs://cloudera/user/daisuke/dicedb/part_test/.hive-staging_hive_2021-07-28_00-23-26_109_4548714524589026450-1**
```

The reason why we don't see this issue in Hive but only occurs in Spark-sql:

In hive, "/db/table/tmp" directory structure is passed for path and hence path.getParent returns "db/table/" . But in Spark we just pass "/db/table" so it is not required to use "path.getParent" for hadoop federation(viewfs)

### Does this PR introduce _any_ user-facing change?
 No

### How was this patch tested?

Tested manually by creating hive-sql.jar

Closes #33577 from senthh/viewfs-792392.

Authored-by: senthilkumarb <senthilkumarb@cloudera.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-08-27 12:58:28 -07:00