Commit graph

7553 commits

Author SHA1 Message Date
Chao Sun 27cd945c15 [SPARK-32381][CORE][SQL][FOLLOWUP] More cleanup on HadoopFSUtils
### What changes were proposed in this pull request?

This PR is a follow-up of #29471 and does the following improvements for `HadoopFSUtils`:
1. Removes the extra `filterFun` from the listing API and combines it with the `filter`.
2. Removes `SerializableBlockLocation` and `SerializableFileStatus` given that `BlockLocation` and `FileStatus` are already serializable.
3. Hides the `isRootLevel` flag from the top-level API.

### Why are the changes needed?

Main purpose is to simplify the logic within `HadoopFSUtils` as well as cleanup the API.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing unit tests (e.g., `FileIndexSuite`)

Closes #29959 from sunchao/hadoop-fs-utils-followup.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Holden Karau <hkarau@apple.com>
2020-11-18 12:39:00 -08:00
Bryan Cutler 8e2a0bdce7 [SPARK-24554][PYTHON][SQL] Add MapType support for PySpark with Arrow
### What changes were proposed in this pull request?

This change adds MapType support for PySpark with Arrow, if using pyarrow >= 2.0.0.

### Why are the changes needed?

MapType was previous unsupported with Arrow.

### Does this PR introduce _any_ user-facing change?

User can now enable MapType for `createDataFrame()`, `toPandas()` with Arrow optimization, and with Pandas UDFs.

### How was this patch tested?

Added new PySpark tests for createDataFrame(), toPandas() and Scalar Pandas UDFs.

Closes #30393 from BryanCutler/arrow-add-MapType-SPARK-24554.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-18 21:18:19 +09:00
Liang-Chi Hsieh 7f3d99a8a5 [MINOR][SQL][DOCS] Update schema_of_csv and schema_of_json doc
### What changes were proposed in this pull request?

This minor PR updates the docs of `schema_of_csv` and `schema_of_json`. They allow foldable string column instead of a string literal now.

### Why are the changes needed?

The function doc of  `schema_of_csv` and `schema_of_json` are not updated accordingly with previous PRs.

### Does this PR introduce _any_ user-facing change?

Yes, update user-facing doc.

### How was this patch tested?

Unit test.

Closes #30396 from viirya/minor-json-csv.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-18 11:32:27 +09:00
Liang-Chi Hsieh 928348408e [SPARK-33427][SQL] Add subexpression elimination for interpreted expression evaluation
### What changes were proposed in this pull request?

This patch proposes to add subexpression elimination for interpreted expression evaluation. Interpreted expression evaluation is used when codegen was not able to work, for example complex schema.

### Why are the changes needed?

Currently we only do subexpression elimination for codegen. For some reasons, we may need to run interpreted expression evaluation. For example, codegen fails to compile and fallbacks to interpreted mode, or complex input/output schema of expressions. It is commonly seen for complex schema from expressions that is possibly caused by the query optimizer too, e.g. SPARK-32945.

We should also support subexpression elimination for interpreted evaluation. That could reduce performance difference when Spark fallbacks from codegen to interpreted expression evaluation, and improve Spark usability.

#### Benchmark

Update `SubExprEliminationBenchmark`:

Before:

```
OpenJDK 64-Bit Server VM 1.8.0_265-b01 on Mac OS X 10.15.6
 Intel(R) Core(TM) i7-9750H CPU  2.60GHz
 from_json as subExpr:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 -------------------------------------------------------------------------------------------------------------------------
subexpressionElimination on, codegen off           24707          25688         903          0.0   247068775.9       1.0X
```

After:
```
OpenJDK 64-Bit Server VM 1.8.0_265-b01 on Mac OS X 10.15.6
 Intel(R) Core(TM) i7-9750H CPU  2.60GHz
 from_json as subExpr:                      Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 -------------------------------------------------------------------------------------------------------------------------
subexpressionElimination on, codegen off            2360           2435          87          0.0    23604320.7      11.2X
```

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test. Benchmark manually.

Closes #30341 from viirya/SPARK-33427.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-17 14:29:37 +00:00
HyukjinKwon e2c7bfce40 [SPARK-33407][PYTHON] Simplify the exception message from Python UDFs (disabled by default)
### What changes were proposed in this pull request?

This PR proposes to simplify the exception messages from Python UDFS.

Currently, the exception message from Python UDFs is as below:

```python
from pyspark.sql.functions import udf; spark.range(10).select(udf(lambda x: x/0)("id")).collect()
```

```python
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../python/pyspark/sql/dataframe.py", line 427, in show
    print(self._jdf.showString(n, 20, vertical))
  File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/.../python/pyspark/sql/utils.py", line 127, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from
pyspark.sql.utils.PythonException:
  An exception was thrown from Python worker in the executor:
Traceback (most recent call last):
  File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 605, in main
    process()
  File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 597, in process
    serializer.dump_stream(out_iter, outfile)
  File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream
    for obj in iterator:
  File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched
    for item in iterator:
  File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda>
    return lambda *a: f(*a)
  File "/.../python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper
    return f(*args, **kwargs)
  File "<stdin>", line 1, in <lambda>
ZeroDivisionError: division by zero
```

Actually, almost all cases, users only care about `ZeroDivisionError: division by zero`. We don't really have to show the internal stuff in 99% cases.

This PR adds a configuration `spark.sql.execution.pyspark.udf.simplifiedException.enabled` (disabled by default) that hides the internal tracebacks related to Python worker, (de)serialization, etc.

```python
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../python/pyspark/sql/dataframe.py", line 427, in show
    print(self._jdf.showString(n, 20, vertical))
  File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/.../python/pyspark/sql/utils.py", line 127, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from
pyspark.sql.utils.PythonException:
  An exception was thrown from Python worker in the executor:
Traceback (most recent call last):
  File "<stdin>", line 1, in <lambda>
ZeroDivisionError: division by zero
```

The trackback will be shown from the point when any non-PySpark file is seen in the traceback.

### Why are the changes needed?

Without this configuration. such internal tracebacks are exposed to users directly especially for shall or notebook users in PySpark. 99% cases people don't care about the internal Python worker, (de)serialization and related tracebacks. It just makes the exception more difficult to read. For example, one statement of `x/0` above shows a very long traceback and most of them are unnecessary.

This configuration enables the ability to show simplified tracebacks which users will likely be most interested in.

### Does this PR introduce _any_ user-facing change?

By default, no. It adds one configuration that simplifies the exception message. See the example above.

### How was this patch tested?

Manually tested:

```bash
$ pyspark --conf spark.sql.execution.pyspark.udf.simplifiedException.enabled=true
```
```python
from pyspark.sql.functions import udf; spark.sparkContext.setLogLevel("FATAL"); spark.range(10).select(udf(lambda x: x/0)("id")).collect()
```

and unittests were also added.

Closes #30309 from HyukjinKwon/SPARK-33407.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-17 14:15:31 +09:00
Prakhar Jain f5e3302840 [SPARK-33399][SQL] Normalize output partitioning and sortorder with respect to aliases to avoid unneeded exchange/sort nodes
### What changes were proposed in this pull request?
This pull request tries to remove unneeded exchanges/sorts by normalizing the output partitioning and sortorder information correctly with respect to aliases.

Example: consider this join of three tables:

     |SELECT t2id, t3.id as t3id
     |FROM (
     |    SELECT t1.id as t1id, t2.id as t2id
     |    FROM t1, t2
     |    WHERE t1.id = t2.id
     |) t12, t3
     |WHERE t1id = t3.id

The plan for this looks like:

      *(9) Project [t2id#1034L, id#1004L AS t3id#1035L]
      +- *(9) SortMergeJoin [t1id#1033L], [id#1004L], Inner
         :- *(6) Sort [t1id#1033L ASC NULLS FIRST], false, 0
         :  +- Exchange hashpartitioning(t1id#1033L, 5), true, [id=#1343]   <------------------------------
         :     +- *(5) Project [id#996L AS t1id#1033L, id#1000L AS t2id#1034L]
         :        +- *(5) SortMergeJoin [id#996L], [id#1000L], Inner
         :           :- *(2) Sort [id#996L ASC NULLS FIRST], false, 0
         :           :  +- Exchange hashpartitioning(id#996L, 5), true, [id=#1329]
         :           :     +- *(1) Range (0, 10, step=1, splits=2)
         :           +- *(4) Sort [id#1000L ASC NULLS FIRST], false, 0
         :              +- Exchange hashpartitioning(id#1000L, 5), true, [id=#1335]
         :                 +- *(3) Range (0, 20, step=1, splits=2)
         +- *(8) Sort [id#1004L ASC NULLS FIRST], false, 0
            +- Exchange hashpartitioning(id#1004L, 5), true, [id=#1349]
               +- *(7) Range (0, 30, step=1, splits=2)

In this plan, the marked exchange could have been avoided as the data is already partitioned on "t1.id". This happens because AliasAwareOutputPartitioning class handles aliases only related to HashPartitioning. This change normalizes all output partitioning based on aliasing happening in Project.

### Why are the changes needed?
To remove unneeded exchanges.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
New UT added.

On TPCDS 1000 scale, this change improves the performance of query 95 from 330 seconds to 170 seconds by removing the extra Exchange.

Closes #30300 from prakharjain09/SPARK-33399-outputpartitioning.

Authored-by: Prakhar Jain <prakharjain09@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-11-17 10:35:43 +09:00
xuewei.linxuewei b5eca18af0 [SPARK-33460][SQL] Accessing map values should fail if key is not found
### What changes were proposed in this pull request?

Instead of returning NULL, throws runtime NoSuchElementException towards invalid key accessing in map-like functions, such as element_at, GetMapValue, when ANSI mode is on.

### Why are the changes needed?

For ANSI mode.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added UT and Existing UT.

Closes #30386 from leanken/leanken-SPARK-33460.

Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-16 16:14:31 +00:00
Max Gekk 6883f29465 [SPARK-33453][SQL][TESTS] Unify v1 and v2 SHOW PARTITIONS tests
### What changes were proposed in this pull request?
1. Move `SHOW PARTITIONS` parsing tests to `ShowPartitionsParserSuite`
2. Place Hive tests for `SHOW PARTITIONS` from `HiveCommandSuite` to the base test suite `v1.ShowPartitionsSuiteBase`. This will allow to run the tests w/ and w/o Hive.

The changes follow the approach of https://github.com/apache/spark/pull/30287.

### Why are the changes needed?
- The unification will allow to run common `SHOW PARTITIONS` tests for both DSv1 and Hive DSv1, DSv2
- We can detect missing features and differences between DSv1 and DSv2 implementations.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running:
- new test suites `build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowPartitionsSuite"`
- and old one `build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly org.apache.spark.sql.hive.execution.HiveCommandSuite"`

Closes #30377 from MaxGekk/unify-dsv1_v2-show-partitions-tests.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-16 16:11:42 +00:00
luluorta dfa6fb46f4 [SPARK-33389][SQL] Make internal classes of SparkSession always using active SQLConf
### What changes were proposed in this pull request?

This PR makes internal classes of SparkSession always using active SQLConf. We should remove all `conf: SQLConf`s from ctor-parameters of this classes (`Analyzer`, `SparkPlanner`, `SessionCatalog`, `CatalogManager` `SparkSqlParser` and etc.) and use  `SQLConf.get` instead.

### Why are the changes needed?

Code refine.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing test

Closes #30299 from luluorta/SPARK-33389.

Authored-by: luluorta <luluorta@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-16 15:27:18 +00:00
xuewei.linxuewei aa508fcc03 [SPARK-33140][SQL][FOLLOW-UP] Revert code that not use passed-in SparkSession to get SQLConf
### What changes were proposed in this pull request?

Revert code that does not use passed-in SparkSession to get SQLConf in [SPARK-33140]. The change scope of [SPARK-33140] change passed-in SQLConf instance and place using SparkSession to get SQLConf to be unified to use SQLConf.get. And the code reverted in the patch, the passed-in SparkSession was not about to get SQLConf, but using its catalog, it's better to be consistent.

### Why are the changes needed?

Potential regression bug.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing UT.

Closes #30364 from leanken/leanken-SPARK-33140.

Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-16 11:57:50 +00:00
Liang-Chi Hsieh 10b011f837 [SPARK-33456][SQL][TEST][FOLLOWUP] Fix SUBEXPRESSION_ELIMINATION_ENABLED config name
### What changes were proposed in this pull request?

To fix wrong config name in `subexp-elimination.sql`.

### Why are the changes needed?

`CONFIG_DIM` should use config name's key.

### Does this PR introduce _any_ user-facing change?

No, dev only.

### How was this patch tested?

Unit test.

Closes #30384 from viirya/SPARK-33456-followup.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-16 17:53:31 +09:00
Max Gekk 4e5d2e0695 [SPARK-33394][SQL][TESTS] Throw NoSuchNamespaceException for not existing namespace in InMemoryTableCatalog.listTables()
### What changes were proposed in this pull request?
Throw `NoSuchNamespaceException` in `listTables()` of the custom test catalog `InMemoryTableCatalog` if the passed namespace doesn't exist.

### Why are the changes needed?
1. To align behavior of V2 `InMemoryTableCatalog` to V1 session catalog.
2. To distinguish two situations:
    1. A namespace **does exist** but does not contain any tables. In that case, `listTables()` returns empty result.
    2. A namespace **does not exist**. `listTables()` throws `NoSuchNamespaceException` in this case.

### Does this PR introduce _any_ user-facing change?
Yes. For example, `SHOW TABLES` returns empty result before the changes.

### How was this patch tested?
By running V1/V2 ShowTablesSuites.

Closes #30358 from MaxGekk/show-tables-in-not-existing-namespace.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-16 07:08:21 +00:00
Liang-Chi Hsieh d4cf1483fd [SPARK-33456][SQL][TEST] Add end-to-end test for subexpression elimination
### What changes were proposed in this pull request?

This patch proposes to add end-to-end test for subexpression elimination.

### Why are the changes needed?

We have subexpression elimination feature for expression evaluation but we don't have end-to-end tests for the feature. We should have one to make sure we don't break it.

### Does this PR introduce _any_ user-facing change?

No, dev only.

### How was this patch tested?

Unit tests.

Closes #30381 from viirya/SPARK-33456.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-16 15:47:35 +09:00
Liang-Chi Hsieh eea846b895
[SPARK-33455][SQL][TEST] Add SubExprEliminationBenchmark for benchmarking subexpression elimination
### What changes were proposed in this pull request?

This patch adds a benchmark `SubExprEliminationBenchmark` for benchmarking subexpression elimination feature.

### Why are the changes needed?

We need a benchmark for subexpression elimination feature for change such as #30341.

### Does this PR introduce _any_ user-facing change?

No, dev only.

### How was this patch tested?

Unit test.

Closes #30379 from viirya/SPARK-33455.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-11-14 19:02:36 -08:00
luluorta 156704ba0d
[SPARK-33432][SQL] SQL parser should use active SQLConf
### What changes were proposed in this pull request?

This PR makes SQL parser using active SQLConf instead of the one in ctor-parameters.

### Why are the changes needed?

In ANSI mode, schema string parsing should fail if the schema uses ANSI reserved keyword as attribute name:

```scala
spark.conf.set("spark.sql.ansi.enabled", "true")
spark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', map('timestampFormat',  'dd/MM/yyyy'));""").show
```

output:

> Cannot parse the data type:
> no viable alternative at input 'time'(line 1, pos 0)
>
> == SQL ==
> time Timestamp
> ^^^

But this query may accidentally succeed in certain cases cause the DataType parser sticks to the configs of the first created session in the current thread:

```scala
DataType.fromDDL("time Timestamp")
val newSpark = spark.newSession()
newSpark.conf.set("spark.sql.ansi.enabled", "true")
newSpark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MM/yyyy'));""").show
```

output:

> +--------------------------------+
> |from_json({"time":"26/10/2015"})|
> +--------------------------------+
> |                   {2015-10-26 00:00...|
> +--------------------------------+

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Newly and updated UTs

Closes #30357 from luluorta/SPARK-33432.

Authored-by: luluorta <luluorta@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-11-14 13:37:12 -08:00
xuewei.linxuewei 234711a328 Revert "[SPARK-33139][SQL] protect setActionSession and clearActiveSession"
### What changes were proposed in this pull request?

In [SPARK-33139] we defined `setActionSession` and `clearActiveSession` as deprecated API, it turns out it is widely used, and after discussion, even if without this PR, it should work with unify view feature, it might only be a risk if user really abuse using these two API. So revert the PR is needed.

[SPARK-33139] has two commit, include a follow up. Revert them both.

### Why are the changes needed?

Revert.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing UT.

Closes #30367 from leanken/leanken-revert-SPARK-33139.

Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-13 13:35:45 +00:00
Kent Yao cdd8e51742 [SPARK-33419][SQL] Unexpected behavior when using SET commands before a query in SparkSession.sql
### What changes were proposed in this pull request?

SparkSession.sql converts a string value to a DataFrame, and the string value should be one single SQL statement ending up w/ or w/o one or more semicolons. e.g.

```sql
scala> spark.sql(" select 2").show
+---+
|  2|
+---+
|  2|
+---+
scala> spark.sql(" select 2;").show
+---+
|  2|
+---+
|  2|
+---+

scala> spark.sql(" select 2;;;;").show
+---+
|  2|
+---+
|  2|
+---+
```
If we put 2 or more statements in, it fails in the parser as expected, e.g.

```sql
scala> spark.sql(" select 2; select 1;").show
org.apache.spark.sql.catalyst.parser.ParseException:
extraneous input 'select' expecting {<EOF>, ';'}(line 1, pos 11)

== SQL ==
 select 2; select 1;
-----------^^^

  at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130)
  at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:51)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:81)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:610)
  at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:610)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:769)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:607)
  ... 47 elided
```

As a very generic user scenario, users may want to change some settings before they execute
the queries. They may pass a string value like `set spark.sql.abc=2; select 1;` into this API, which creates a confusing gap between the actual effect and the user's expectations.

The user may want the query to be executed with spark.sql.abc=2, but Spark actually treats the whole part of `2; select 1;` as the value of the property 'spark.sql.abc',
e.g.

```
scala> spark.sql("set spark.sql.abc=2; select 1;").show
+-------------+------------+
|          key|       value|
+-------------+------------+
|spark.sql.abc|2; select 1;|
+-------------+------------+
```

What's more, the SET symbol could digest everything behind it, which makes it unstable from version to version, e.g.

#### 3.1
```sql
scala> spark.sql("set;").show
org.apache.spark.sql.catalyst.parser.ParseException:
Expected format is 'SET', 'SET key', or 'SET key=value'. If you want to include special characters in key, please use quotes, e.g., SET `ke y`=value.(line 1, pos 0)

== SQL ==
set;
^^^

  at org.apache.spark.sql.execution.SparkSqlAstBuilder.$anonfun$visitSetConfiguration$1(SparkSqlParser.scala:83)
  at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:113)
  at org.apache.spark.sql.execution.SparkSqlAstBuilder.visitSetConfiguration(SparkSqlParser.scala:72)
  at org.apache.spark.sql.execution.SparkSqlAstBuilder.visitSetConfiguration(SparkSqlParser.scala:58)
  at org.apache.spark.sql.catalyst.parser.SqlBaseParser$SetConfigurationContext.accept(SqlBaseParser.java:2161)
  at org.antlr.v4.runtime.tree.AbstractParseTreeVisitor.visit(AbstractParseTreeVisitor.java:18)
  at org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitSingleStatement$1(AstBuilder.scala:77)
  at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:113)
  at org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleStatement(AstBuilder.scala:77)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.$anonfun$parsePlan$1(ParseDriver.scala:82)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:113)
  at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:51)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:81)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:610)
  at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:610)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:769)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:607)
  ... 47 elided

scala> spark.sql("set a;").show
org.apache.spark.sql.catalyst.parser.ParseException:
Expected format is 'SET', 'SET key', or 'SET key=value'. If you want to include special characters in key, please use quotes, e.g., SET `ke y`=value.(line 1, pos 0)

== SQL ==
set a;
^^^

  at org.apache.spark.sql.execution.SparkSqlAstBuilder.$anonfun$visitSetConfiguration$1(SparkSqlParser.scala:83)
  at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:113)
  at org.apache.spark.sql.execution.SparkSqlAstBuilder.visitSetConfiguration(SparkSqlParser.scala:72)
  at org.apache.spark.sql.execution.SparkSqlAstBuilder.visitSetConfiguration(SparkSqlParser.scala:58)
  at org.apache.spark.sql.catalyst.parser.SqlBaseParser$SetConfigurationContext.accept(SqlBaseParser.java:2161)
  at org.antlr.v4.runtime.tree.AbstractParseTreeVisitor.visit(AbstractParseTreeVisitor.java:18)
  at org.apache.spark.sql.catalyst.parser.AstBuilder.$anonfun$visitSingleStatement$1(AstBuilder.scala:77)
  at org.apache.spark.sql.catalyst.parser.ParserUtils$.withOrigin(ParserUtils.scala:113)
  at org.apache.spark.sql.catalyst.parser.AstBuilder.visitSingleStatement(AstBuilder.scala:77)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.$anonfun$parsePlan$1(ParseDriver.scala:82)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:113)
  at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:51)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:81)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:610)
  at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:610)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:769)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:607)
  ... 47 elided
```

#### 2.4

```sql
scala> spark.sql("set;").show
+---+-----------+
|key|      value|
+---+-----------+
|  ;|<undefined>|
+---+-----------+

scala> spark.sql("set a;").show
+---+-----------+
|key|      value|
+---+-----------+
| a;|<undefined>|
+---+-----------+
```

In this PR,
1.  make `set spark.sql.abc=2; select 1;` in `SparkSession.sql` fail directly, user should call `.sql` for each statement separately.
2. make the semicolon as the separator of statements, and if users want to use it as part of the property value, shall use quotes too.

### Why are the changes needed?

1. disambiguation for  `SparkSession.sql`
2. make semicolon work same both w/ `SET` and other statements

### Does this PR introduce _any_ user-facing change?

yes,
the semicolon works as a separator of statements now, it will be trimmed if it is at the end of the statement and fail the statement if it is in the middle. you need to use quotes if you want it to be part of the property value

### How was this patch tested?

new tests

Closes #30332 from yaooqinn/SPARK-33419.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-13 06:58:16 +00:00
Max Gekk 539c2deb89 [SPARK-33426][SQL][TESTS] Unify Hive SHOW TABLES tests
### What changes were proposed in this pull request?
1. Create the separate test suite `org.apache.spark.sql.hive.execution.command.ShowTablesSuite`.
2. Re-use V1 SHOW TABLES tests added by https://github.com/apache/spark/pull/30287 in the Hive test suites.
3. Add new test case for the pattern `'table_name_1*|table_name_2*'` in the common test suite.

### Why are the changes needed?
To test V1 + common  SHOW TABLES tests in Hive.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running v1/v2 and Hive v1 `ShowTablesSuite`:
```
$  build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowTablesSuite"
```

Closes #30340 from MaxGekk/show-tables-hive-tests.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-13 05:15:13 +00:00
Liang-Chi Hsieh 2c64b731ae
[SPARK-33259][SS] Disable streaming query with possible correctness issue by default
### What changes were proposed in this pull request?

This patch proposes to disable the streaming query with possible correctness issue in chained stateful operators. The behavior can be controlled by a SQL config, so if users understand the risk and still want to run the query, they can disable the check.

### Why are the changes needed?

The possible correctness in chained stateful operators in streaming query is not straightforward for users. From users perspective, it will be considered as a Spark bug. It is also possible the worse case, users are not aware of the correctness issue and use wrong results.

A better approach should be to disable such queries and let users choose to run the query if they understand there is such risk, instead of implicitly running the query and let users to find out correctness issue by themselves and report this known to Spark community.

### Does this PR introduce _any_ user-facing change?

Yes. Streaming query with possible correctness issue will be blocked to run, except for users explicitly disable the SQL config.

### How was this patch tested?

Unit test.

Closes #30210 from viirya/SPARK-33259.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-11-12 15:31:57 -08:00
Chao Sun cf3b6551ce
[SPARK-33435][SQL] DSv2: REFRESH TABLE should invalidate caches referencing the table
### What changes were proposed in this pull request?

This changes `RefreshTableExec` in DSv2 to also invalidate caches with references to the target table to be refreshed. The change itself is similar to what's done in #30211. Note that though, since we currently don't support caching a DSv2 table directly, this doesn't add recache logic as in the DSv1 impl. I marked it as a TODO for now.

### Why are the changes needed?

Currently the behavior in DSv1 and DSv2 is inconsistent w.r.t refreshing table: in DSv1 we invalidate both metadata cache as well as all table caches that are related to the table, but in DSv2 we only do the former. This addresses the issue and make the behavior consistent.

### Does this PR introduce _any_ user-facing change?

Yes, now refreshing a v2 table also invalidate all the related caches.

### How was this patch tested?

Added a new UT.

Closes #30359 from sunchao/SPARK-33435.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-11-12 15:22:56 -08:00
gengjiaan 2f07c56810 [SPARK-33278][SQL] Improve the performance for FIRST_VALUE
### What changes were proposed in this pull request?
https://github.com/apache/spark/pull/29800 provides a performance improvement for `NTH_VALUE`.
`FIRST_VALUE` also could use the `UnboundedOffsetWindowFunctionFrame` and `UnboundedPrecedingOffsetWindowFunctionFrame`.

### Why are the changes needed?
Improve the performance for `FIRST_VALUE`.

### Does this PR introduce _any_ user-facing change?
 'No'.

### How was this patch tested?
Jenkins test.

Closes #30178 from beliefer/SPARK-33278.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-12 14:59:22 +00:00
xuewei.linxuewei 6d31daeb6a [SPARK-33386][SQL] Accessing array elements in ElementAt/Elt/GetArrayItem should failed if index is out of bound
### What changes were proposed in this pull request?

Instead of returning NULL, throws runtime ArrayIndexOutOfBoundsException when ansiMode is enable for `element_at`,`elt`, `GetArrayItem` functions.

### Why are the changes needed?

For ansiMode.

### Does this PR introduce any user-facing change?

When `spark.sql.ansi.enabled` = true, Spark will throw `ArrayIndexOutOfBoundsException` if out-of-range index when accessing array elements

### How was this patch tested?

Added UT and existing UT.

Closes #30297 from leanken/leanken-SPARK-33386.

Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-12 08:50:32 +00:00
Yuanjian Li 9f983a68f1 [SPARK-30294][SS][FOLLOW-UP] Directly override RDD methods
### Why are the changes needed?
Follow the comment: https://github.com/apache/spark/pull/26935#discussion_r514697997

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing test and Mima test.

Closes #30344 from xuanyuanking/SPARK-30294-follow.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-12 12:22:25 +09:00
Max Gekk 7e867298fe
[SPARK-33404][SQL][FOLLOWUP] Update benchmark results for date_trunc
### What changes were proposed in this pull request?
Updated results of `DateTimeBenchmark` in the environment:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge (spot instance) |
| AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) |
| Java | OpenJDK8/11 installed by`sudo add-apt-repository ppa:openjdk-r/ppa` & `sudo apt install openjdk-11-jdk`|

### Why are the changes needed?
The fix https://github.com/apache/spark/pull/30303 slowed down `date_trunc`. This PR updates benchmark results to have actual info about performance of `date_trunc`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By regenerating benchmark results:
```
$ SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeBenchmark"
```

Closes #30338 from MaxGekk/fix-trunc_date-benchmark.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-11-11 08:50:43 -08:00
stczwd 1eb236b936 [SPARK-32512][SQL] add alter table add/drop partition command for datasourcev2
### What changes were proposed in this pull request?
This patch is trying to add `AlterTableAddPartitionExec` and `AlterTableDropPartitionExec` with the new table partition API, defined in #28617.

### Does this PR introduce _any_ user-facing change?
Yes. User can use `alter table add partition` or `alter table drop partition` to create/drop partition in V2Table.

### How was this patch tested?
Run suites and fix old tests.

Closes #29339 from stczwd/SPARK-32512-new.

Lead-authored-by: stczwd <qcsd2011@163.com>
Co-authored-by: Jacky Lee <qcsd2011@163.com>
Co-authored-by: Jackey Lee <qcsd2011@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-11 09:30:42 +00:00
Takeshi Yamamuro 4b367976a8 [SPARK-33417][SQL][TEST] Correct the behaviour of query filters in TPCDSQueryBenchmark
### What changes were proposed in this pull request?

This PR intends to fix the behaviour of query filters in `TPCDSQueryBenchmark`. We can use an option `--query-filter` for selecting TPCDS queries to run, e.g., `--query-filter q6,q8,q13`. But, the current master has a weird behaviour about the option. For example, if we pass `--query-filter q6` so as to run the TPCDS q6 only, `TPCDSQueryBenchmark` runs `q6` and `q6-v2.7` because the `filterQueries` method does not respect the name suffix. So, there is no way now to run the TPCDS q6 only.

### Why are the changes needed?

Bugfix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually checked.

Closes #30324 from maropu/FilterBugInTPCDSQueryBenchmark.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-11-11 15:24:05 +09:00
Terry Kim 6d5d030957 [SPARK-33414][SQL] Migrate SHOW CREATE TABLE command to use UnresolvedTableOrView to resolve the identifier
### What changes were proposed in this pull request?

This PR proposes to migrate `SHOW CREATE TABLE` to use `UnresolvedTableOrView` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).

Note that `SHOW CREATE TABLE` works only with a v1 table and a permanent view, and not supported for v2 tables.

### Why are the changes needed?

The changes allow consistent resolution behavior when resolving the table identifier. For example, the following is the current behavior:
```scala
sql("CREATE TEMPORARY VIEW t AS SELECT 1")
sql("CREATE DATABASE db")
sql("CREATE TABLE t (key INT, value STRING) USING hive")
sql("USE db")
sql("SHOW CREATE TABLE t AS SERDE") // Succeeds
```
With this change, `SHOW CREATE TABLE ... AS SERDE` above fails with the following:
```
org.apache.spark.sql.AnalysisException: t is a temp view not table or permanent view.; line 1 pos 0
  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.$anonfun$applyOrElse$43(Analyzer.scala:883)
  at scala.Option.map(Option.scala:230)
```
, which is expected since temporary view is resolved first and `SHOW CREATE TABLE ... AS SERDE` doesn't support a temporary view.

Note that there is no behavior change for `SHOW CREATE TABLE` without `AS SERDE` since it was already resolving to a temporary view first. See below for more detail.

### Does this PR introduce _any_ user-facing change?

After this PR, `SHOW CREATE TABLE t AS SERDE` is resolved to a temp view `t` instead of table `db.t` in the above scenario.

Note that there is no behavior change for `SHOW CREATE TABLE` without `AS SERDE`, but the exception message changes from `SHOW CREATE TABLE is not supported on a temporary view` to `t is a temp view not table or permanent view`.

### How was this patch tested?

Updated existing tests.

Closes #30321 from imback82/show_create_table.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-11 05:54:27 +00:00
Max Gekk 1e2eeda20e [SPARK-33382][SQL][TESTS] Unify datasource v1 and v2 SHOW TABLES tests
### What changes were proposed in this pull request?
In the PR, I propose to gather common `SHOW TABLES` tests into one trait `org.apache.spark.sql.execution.command.ShowTablesSuite`, and put datasource specific tests to the `v1.ShowTablesSuite` and `v2.ShowTablesSuite`. Also tests for parsing `SHOW TABLES` are extracted to `ShowTablesParserSuite`.

### Why are the changes needed?
- The unification will allow to run common `SHOW TABLES` tests for both DSv1 and DSv2
- We can detect missing features and differences between DSv1 and DSv2 implementations.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running new test suites:
- `org.apache.spark.sql.execution.command.v1.ShowTablesSuite`
- `org.apache.spark.sql.execution.command.v2.ShowTablesSuite`
- `ShowTablesParserSuite`

Closes #30287 from MaxGekk/unify-dsv1_v2-tests.

Lead-authored-by: Max Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-11 05:26:46 +00:00
ulysses 5197c5d2e7 [SPARK-33390][SQL] Make Literal support char array
### What changes were proposed in this pull request?

Make Literal support char array.

### Why are the changes needed?

We always use `Literal()` to create foldable value, and `char[]` is a usual data type. We can make it easy that support create String Literal with `char[]`.

### Does this PR introduce _any_ user-facing change?

Yes, user can call `Literal()` with `char[]`.

### How was this patch tested?

Add test.

Closes #30295 from ulysses-you/SPARK-33390.

Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-11 11:39:11 +09:00
Chao Sun 4934da56bc [SPARK-33305][SQL] DSv2: DROP TABLE command should also invalidate cache
### What changes were proposed in this pull request?

This changes `DropTableExec` to also invalidate caches referencing the table to be dropped, in a cascading manner.

### Why are the changes needed?

In DSv1, `DROP TABLE` command also invalidate caches as described in [SPARK-19765](https://issues.apache.org/jira/browse/SPARK-19765). However in DSv2 the same command only drops the table but doesn't handle the caches. This could lead to correctness issue.

### Does this PR introduce _any_ user-facing change?

Yes. Now DSv2 `DROP TABLE` command also invalidates cache.

### How was this patch tested?

Added a new UT

Closes #30211 from sunchao/SPARK-33305.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-10 14:37:42 +00:00
Yuanjian Li ad02ceda29 [SPARK-33244][SQL] Unify the code paths for spark.table and spark.read.table
### What changes were proposed in this pull request?

- Call `spark.read.table` in `spark.table`.
- Add comments for `spark.table` to emphasize it also support streaming temp view reading.

### Why are the changes needed?
The code paths of `spark.table` and `spark.read.table` should be the same. This behavior is broke in SPARK-32592 since we need to respect options in `spark.read.table` API.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing UT.

Closes #30148 from xuanyuanking/SPARK-33244.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-10 05:46:45 +00:00
Terry Kim 90f6f39e42 [SPARK-33366][SQL] Migrate LOAD DATA command to use UnresolvedTable to resolve the identifier
### What changes were proposed in this pull request?

This PR proposes to migrate `LOAD DATA` to use `UnresolvedTable` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).

Note that `LOAD DATA` is not supported for v2 tables.

### Why are the changes needed?

The changes allow consistent resolution behavior when resolving the table identifier. For example, the following is the current behavior:
```scala
sql("CREATE TEMPORARY VIEW t AS SELECT 1")
sql("CREATE DATABASE db")
sql("CREATE TABLE t (key INT, value STRING) USING hive")
sql("USE db")
sql("LOAD DATA LOCAL INPATH 'examples/src/main/resources/kv1.txt' INTO TABLE t") // Succeeds
```
With this change, `LOAD DATA` above fails with the following:
```
org.apache.spark.sql.AnalysisException: t is a temp view not table.; line 1 pos 0
    at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
    at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.$anonfun$applyOrElse$39(Analyzer.scala:865)
    at scala.Option.foreach(Option.scala:407)
```
, which is expected since temporary view is resolved first and `LOAD DATA` doesn't support a temporary view.

### Does this PR introduce _any_ user-facing change?

After this PR, `LOAD DATA ... t` is resolved to a temp view `t` instead of table `db.t` in the above scenario.

### How was this patch tested?

Updated existing tests.

Closes #30270 from imback82/load_data_cmd.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-10 05:28:06 +00:00
Gengliang Wang a1f84d8714 [SPARK-33369][SQL] DSV2: Skip schema inference in write if table provider supports external metadata
### What changes were proposed in this pull request?

When TableProvider.supportsExternalMetadata() is true, Spark will use the input Dataframe's schema in `DataframeWriter.save()`/`DataStreamWriter.start()` and skip schema/partitioning inference.

### Why are the changes needed?

For all the v2 data sources which are not FileDataSourceV2, Spark always infers the table schema/partitioning on `DataframeWriter.save()`/`DataStreamWriter.start()`.
The inference of table schema/partitioning can be expensive. However, there is no such trait or flag for indicating a V2 source can use the input DataFrame's schema on `DataframeWriter.save()`/`DataStreamWriter.start()`. We can resolve the problem by adding a new expected behavior for the method `TableProvider.supportsExternalMetadata()`.

### Does this PR introduce _any_ user-facing change?

Yes, a new behavior for the data source v2 API `TableProvider.supportsExternalMetadata()` when it returns true.

### How was this patch tested?

Unit test

Closes #30273 from gengliangwang/supportsExternalMetadata.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-10 04:43:32 +00:00
Gabor Somogyi 4ac8133866 [SPARK-33223][SS][UI] Structured Streaming Web UI state information
### What changes were proposed in this pull request?
Structured Streaming UI is not containing state information. In this PR I've added it.

### Why are the changes needed?
Missing state information.

### Does this PR introduce _any_ user-facing change?
Additional UI elements appear.

### How was this patch tested?
Existing unit tests + manual test.
<img width="1044" alt="Screenshot 2020-10-30 at 15 14 21" src="https://user-images.githubusercontent.com/18561820/97715405-a1797000-1ac2-11eb-886a-e3e6efa3af3e.png">

Closes #30151 from gaborgsomogyi/SPARK-33223.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
2020-11-10 11:22:35 +09:00
Peter Toth 84dc374611 [SPARK-33303][SQL] Deduplicate deterministic PythonUDF calls
### What changes were proposed in this pull request?
This PR modifies the `ExtractPythonUDFs` rule to deduplicate deterministic PythonUDF calls.

Before this PR the dataframe: `df.withColumn("c", batchedPythonUDF(col("a"))).withColumn("d", col("c"))` has the plan:
```
*(1) Project [value#1 AS a#4, pythonUDF1#15 AS c#7, pythonUDF1#15 AS d#10]
+- BatchEvalPython [dummyUDF(value#1), dummyUDF(value#1)], [pythonUDF0#14, pythonUDF1#15]
   +- LocalTableScan [value#1]
```
After this PR the deterministic PythonUDF calls are deduplicated:
```
*(1) Project [value#1 AS a#4, pythonUDF0#14 AS c#7, pythonUDF0#14 AS d#10]
+- BatchEvalPython [dummyUDF(value#1)], [pythonUDF0#14]
   +- LocalTableScan [value#1]
```

### Why are the changes needed?
To fix a performance issue.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
New and existing UTs.

Closes #30203 from peter-toth/SPARK-33303-deduplicate-deterministic-udf-calls.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-09 19:27:36 +09:00
Linhong Liu 4e1c89400d [SPARK-33140][SQL][FOLLOW-UP] Use sparkSession in AQE context when applying rules
### What changes were proposed in this pull request?
After #30097, all rules are using `SparkSession.active` to get `SQLConf`
and `SparkSession`. But in AQE, when applying the rules for the initial plan,
we should use the spark session in AQE context.

### Why are the changes needed?
Fix potential problem caused by using the wrong spark session

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing ut

Closes #30294 from linhongliu-db/SPARK-33140-followup.

Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-09 09:44:58 +00:00
Yuming Wang 7a5647a93a [SPARK-33385][SQL] Support bucket pruning for IsNaN
### What changes were proposed in this pull request?

This pr add support bucket pruning on `IsNaN` predicate.

### Why are the changes needed?

Improve query performance.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #30291 from wangyum/SPARK-33385.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-09 09:20:31 +00:00
Yuming Wang 69799c514f [SPARK-33372][SQL] Fix InSet bucket pruning
### What changes were proposed in this pull request?

This pr fix `InSet` bucket pruning because of it's values should not be `Literal`:
cbd3fdea62/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/expressions.scala (L253-L255)

### Why are the changes needed?

Fix bug.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test and manual test:

```scala
spark.sql("select id as a, id as b from range(10000)").write.bucketBy(100, "a").saveAsTable("t")
spark.sql("select * from t where a in (1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11)").show
```

Before this PR | After this PR
-- | --
![image](https://user-images.githubusercontent.com/5399861/98380788-fb120980-2083-11eb-8fae-4e21ad873e9b.png) | ![image](https://user-images.githubusercontent.com/5399861/98381095-5ba14680-2084-11eb-82ca-2d780c85305c.png)

Closes #30279 from wangyum/SPARK-33372.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-09 08:32:51 +00:00
Wenchen Fan 98730b7ee2 [SPARK-33087][SQL] DataFrameWriterV2 should delegate table resolution to the analyzer
### What changes were proposed in this pull request?

This PR makes `DataFrameWriterV2` to create query plans with `UnresolvedRelation` and leave the table resolution work to the analyzer.

### Why are the changes needed?

Table resolution work should be done by the analyzer. After this PR, the behavior is more consistent between different APIs (DataFrameWriter, DataFrameWriterV2 and SQL). See the next section for behavior changes.

### Does this PR introduce _any_ user-facing change?

Yes.
1. writes to a temp view of v2 relation: previously it fails with table not found exception, now it works if the v2 relation is writable. This is consistent with `DataFrameWriter` and SQL INSERT.
2. writes to other temp views: previously it fails with table not found exception, now it fails with a more explicit error message, saying that writing to a temp view of non-v2-relation is not allowed.
3. writes to a view: previously it fails with table not writable error, now it fails with a more explicit error message, saying that writing to a view is not allowed.
4. writes to a v1 table: previously it fails with table not writable error, now it fails with a more explicit error message, saying that writing to a v1 table is not allowed. (We can allow it later, by falling back to v1 command)

### How was this patch tested?

new tests

Closes #29970 from cloud-fan/refactor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-09 08:08:00 +00:00
Huaxin Gao bfb257f078 [SPARK-32405][SQL] Apply table options while creating tables in JDBC Table Catalog
### What changes were proposed in this pull request?
Currently in JDBCTableCatalog, we ignore the table options when creating table.
```
    // TODO (SPARK-32405): Apply table options while creating tables in JDBC Table Catalog
    if (!properties.isEmpty) {
      logWarning("Cannot create JDBC table with properties, these properties will be " +
        "ignored: " + properties.asScala.map { case (k, v) => s"$k=$v" }.mkString("[", ", ", "]"))
    }
```

### Why are the changes needed?
need to apply the table options when we create table

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
add new test

Closes #30154 from huaxingao/table_options.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-09 07:02:14 +00:00
Liang-Chi Hsieh c269b53f07 [SPARK-33384][SS] Delete temporary file when cancelling writing to final path even underlying stream throwing error
### What changes were proposed in this pull request?

In `RenameBasedFSDataOutputStream.cancel`, we do two things: closing underlying stream and delete temporary file, in a single try/catch block. Closing `OutputStream` could possibly throw `IOException` so we possibly missing deleting temporary file.

This patch proposes to delete temporary even underlying stream throwing error.

### Why are the changes needed?

To avoid leaving temporary files during canceling writing in `RenameBasedFSDataOutputStream`.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test.

Closes #30290 from viirya/SPARK-33384.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-11-08 18:44:26 -08:00
yangjie01 02fd52cfbc [SPARK-33352][CORE][SQL][SS][MLLIB][AVRO][K8S] Fix procedure-like declaration compilation warnings in Scala 2.13
### What changes were proposed in this pull request?
There are two similar compilation warnings about procedure-like declaration in Scala 2.13:

```
[WARNING] [Warn] /spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala:70: procedure syntax is deprecated for constructors: add `=`, as in method definition
```
and

```
[WARNING] [Warn] /spark/core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala:211: procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `run`'s return type
```

this pr is the first part to resolve SPARK-33352:

- For constructors method definition add `=` to convert to function syntax

- For without `return type` methods definition add `: Unit =` to convert to function syntax

### Why are the changes needed?
Eliminate compilation warnings in Scala 2.13 and this change should be compatible with Scala 2.12

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #30255 from LuciferYang/SPARK-29392-FOLLOWUP.1.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-11-08 12:51:48 -06:00
Terry Kim 68c032c246 [SPARK-33364][SQL] Introduce the "purge" option in TableCatalog.dropTable for v2 catalog
### What changes were proposed in this pull request?

This PR proposes to introduce the `purge` option in `TableCatalog.dropTable` so that v2 catalogs can use the option if needed.

Related discussion: https://github.com/apache/spark/pull/30079#discussion_r510594110

### Why are the changes needed?

Spark DDL supports passing the purge option to `DROP TABLE` command. However, the option is not used (ignored) for v2 catalogs.

### Does this PR introduce _any_ user-facing change?

This PR introduces a new API in `TableCatalog`.

### How was this patch tested?

Added a test.

Closes #30267 from imback82/purge_table.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-11-05 22:00:45 -08:00
Prashant Sharma 733a468726 [SPARK-33130][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MsSqlServer dialect)
### What changes were proposed in this pull request?

Override the default SQL strings for:
ALTER TABLE RENAME COLUMN
ALTER TABLE UPDATE COLUMN NULLABILITY
in the following MsSQLServer JDBC dialect according to official documentation.
Write MsSqlServer integration tests for JDBC.

### Why are the changes needed?

To add the support for alter table when interacting with MSSql Server.

### Does this PR introduce _any_ user-facing change?

### How was this patch tested?

added tests

Closes #30038 from ScrapCodes/mssql-dialect.

Authored-by: Prashant Sharma <prashsh1@in.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-06 05:46:38 +00:00
Wenchen Fan d16311051d [SPARK-32934][SQL][FOLLOW-UP] Refine class naming and code comments
### What changes were proposed in this pull request?

1. Rename `OffsetWindowSpec` to `OffsetWindowFunction`, as it's the base class for all offset based window functions.
2. Refine and add more comments.
3. Remove `isRelative` as it's useless.

### Why are the changes needed?

code refinement

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #30261 from cloud-fan/window.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-06 05:20:25 +00:00
Wenchen Fan cd4e3d3b0c [SPARK-33360][SQL] Simplify DS v2 write resolution
### What changes were proposed in this pull request?

Removing duplicated code in `ResolveOutputRelation`, by adding `V2WriteCommand.withNewQuery`

### Why are the changes needed?

code cleanup

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing tests

Closes #30264 from cloud-fan/ds-minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-11-05 15:44:04 -08:00
Jungtaek Lim (HeartSaVioR) 21413b7dd4 [SPARK-30294][SS] Explicitly defines read-only StateStore and optimize for HDFSBackedStateStore
### What changes were proposed in this pull request?

There's a concept of 'read-only' and 'read+write' state store in Spark which is defined "implicitly". Spark doesn't prevent write for 'read-only' state store; Spark just assumes read-only stateful operator will not modify the state store. Given it's not defined explicitly, the instance of state store has to be implemented as 'read+write' even it's being used as 'read-only', which sometimes brings confusion.

For example, abort() in HDFSBackedStateStore - d38f816748/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala (L143-L155)

The comment sounds as if statement works differently between 'read-only' and 'read+write', but that's not true as both state store has state initialized as UPDATING (no difference). So 'read-only' state also creates the temporary file, initializes output streams to write to temporary file, closes output streams, and finally deletes the temporary file. This unnecessary operations are being done per batch/partition.

This patch explicitly defines 'read-only' StateStore, and enables state store provider to create 'read-only' StateStore instance if requested. Relevant code paths are modified, as well as 'read-only' StateStore implementation for HDFSBackedStateStore is introduced. The new implementation gets rid of unnecessary operations explained above.

In point of backward-compatibility view, the only thing being changed in public API side is `StateStoreProvider`. The trait `StateStoreProvider` has to be changed to allow requesting 'read-only' StateStore; this patch adds default implementation which leverages 'read+write' StateStore but wrapping with 'write-protected' StateStore instance, so that custom providers don't need to change their code to reflect the change. But if the providers can optimize for read-only workload, they'll be happy to make a change.

Please note that this patch makes ReadOnlyStateStore extend StateStore and being referred as StateStore, as StateStore is being used in so many places and it's not easy to support both traits if we differentiate them. So unfortunately these write methods are still exposed for read-only state; it just throws UnsupportedOperationException.

### Why are the changes needed?

The new API opens the chance to optimize read-only state store instance compared with read+write state store instance. HDFSBackedStateStoreProvider is modified to provide read-only version of state store which doesn't deal with temporary file as well as state machine.

### Does this PR introduce any user-facing change?

Clearly "no" for most end users, and also "no" for custom state store providers as it doesn't touch trait `StateStore` as well as provides default implementation for added method in trait `StateStoreProvider`.

### How was this patch tested?

Modified UT. Existing UTs ensure the change doesn't break anything.

Closes #26935 from HeartSaVioR/SPARK-30294.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
2020-11-05 18:21:17 +09:00
HyukjinKwon d530ed0ea8 Revert "[SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming after the task ends"
This reverts commit b8a440f098.
2020-11-05 16:15:17 +09:00
Dongjoon Hyun 42c0b175ce [SPARK-33338][SQL] GROUP BY using literal map should not fail
### What changes were proposed in this pull request?

This PR aims to fix `semanticEquals` works correctly on `GetMapValue` expressions having literal maps with `ArrayBasedMapData` and `GenericArrayData`.

### Why are the changes needed?

This is a regression from Apache Spark 1.6.x.
```scala
scala> sc.version
res1: String = 1.6.3

scala> sqlContext.sql("SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k]").show
+---+
|_c0|
+---+
| v1|
+---+
```

Apache Spark 2.x ~ 3.0.1 raise`RuntimeException` for the following queries.
```sql
CREATE TABLE t USING ORC AS SELECT map('k1', 'v1') m, 'k1' k
SELECT map('k1', 'v1')[k] FROM t GROUP BY 1
SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k]
SELECT map('k1', 'v1')[k] a FROM t GROUP BY a
```

**BEFORE**
```scala
Caused by: java.lang.RuntimeException: Couldn't find k#3 in [keys: [k1], values: [v1][k#3]#6]
	at scala.sys.package$.error(package.scala:27)
	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:85)
	at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:79)
	at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52)
```

**AFTER**
```sql
spark-sql> SELECT map('k1', 'v1')[k] FROM t GROUP BY 1;
v1
Time taken: 1.278 seconds, Fetched 1 row(s)
spark-sql> SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k];
v1
Time taken: 0.313 seconds, Fetched 1 row(s)
spark-sql> SELECT map('k1', 'v1')[k] a FROM t GROUP BY a;
v1
Time taken: 0.265 seconds, Fetched 1 row(s)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs with the newly added test case.

Closes #30246 from dongjoon-hyun/SPARK-33338.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-11-04 08:35:10 -08:00
Terry Kim 0ad35ba5f8 [SPARK-33321][SQL] Migrate ANALYZE TABLE commands to use UnresolvedTableOrView to resolve the identifier
### What changes were proposed in this pull request?

This PR proposes to migrate `ANALYZE TABLE` and `ANALYZE TABLE ... FOR COLUMNS` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).

Note that `ANALYZE TABLE` is not supported for v2 tables.

### Why are the changes needed?

The changes allow consistent resolution behavior when resolving the table/view identifier. For example, the following is the current behavior:
```scala
sql("create temporary view t as select 1")
sql("create database db")
sql("create table db.t using csv as select 1")
sql("use db")
sql("ANALYZE TABLE t compute statistics") // Succeeds
```
With this change, ANALYZE TABLE above fails with the following:
```
    org.apache.spark.sql.AnalysisException: t is a temp view not table or permanent view.; line 1 pos 0
	at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.$anonfun$applyOrElse$40(Analyzer.scala:872)
	at scala.Option.map(Option.scala:230)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.applyOrElse(Analyzer.scala:870)
	at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.applyOrElse(Analyzer.scala:856)
```
, which is expected since temporary view is resolved first and ANALYZE TABLE doesn't support a temporary view.

### Does this PR introduce _any_ user-facing change?

After this PR, `ANALYZE TABLE t` is resolved to a temp view `t` instead of table `db.t`.

### How was this patch tested?

Updated existing tests.

Closes #30229 from imback82/parse_v1table.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-04 06:50:37 +00:00
Wenchen Fan 034070a23a Revert "[SPARK-33248][SQL] Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size"
This reverts commit 0c943cd2fb.
2020-11-04 12:30:38 +08:00
Chao Sun d900c6ff49 [SPARK-33293][SQL][FOLLOW-UP] Rename TableWriteExec to TableWriteExecHelper
### What changes were proposed in this pull request?

Rename `TableWriteExec` in `WriteToDataSourceV2Exec.scala` to `TableWriteExecHelper`.

### Why are the changes needed?

See [discussion](https://github.com/apache/spark/pull/30193#discussion_r516412653). The former is too general.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

N/A

Closes #30235 from sunchao/SPARK-33293-2.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-11-03 14:53:01 -08:00
Max Gekk bdabf60fb4 [SPARK-33299][SQL][DOCS] Don't mention schemas in JSON format in docs for from_json
### What changes were proposed in this pull request?
Remove the JSON formatted schema from comments for `from_json()` in Scala/Python APIs.

Closes #30201

### Why are the changes needed?
Schemas in JSON format is internal (not documented). It shouldn't be recommenced for usage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By linters.

Closes #30226 from MaxGekk/from_json-common-schema-parsing-2.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-11-02 10:10:24 -08:00
Cheng Su e52b858ef7 [SPARK-33027][SQL] Add DisableUnnecessaryBucketedScan rule to AQE
### What changes were proposed in this pull request?

As a followup comment from https://github.com/apache/spark/pull/29804#issuecomment-700650620 , here we add add the physical plan rule DisableUnnecessaryBucketedScan into AQE AdaptiveSparkPlanExec.queryStagePreparationRules, to make auto bucketed scan work with AQE.

The change is mostly in:
* `AdaptiveSparkPlanExec.scala`: add physical plan rule `DisableUnnecessaryBucketedScan`
* `DisableUnnecessaryBucketedScan.scala`: propagate logical plan link for the file source scan exec operator, otherwise we lose the logical plan link information when AQE is enabled, and will get exception [here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala#L176). (for example, for query `SELECT * FROM bucketed_table` with AQE is enabled)
* `DisableUnnecessaryBucketedScanSuite.scala`: add new test suite for AQE enabled - `DisableUnnecessaryBucketedScanWithoutHiveSupportSuiteAE`, and changed some of tests to use `AdaptiveSparkPlanHelper.find/collect`, to make the plan verification work when AQE enabled.

### Why are the changes needed?

It's reasonable to add the support to allow disabling unnecessary bucketed scan with AQE is enabled, this helps optimize the query when AQE is enabled.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit test in `DisableUnnecessaryBucketedScanSuite`.

Closes #30200 from c21/auto-bucket-aqe.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-02 06:44:07 +00:00
Prashant Sharma 6226ccc092 [SPARK-33095] Follow up, support alter table column rename
### What changes were proposed in this pull request?

Support rename column for mysql dialect.

### Why are the changes needed?

At the moment, it does not work for mysql version 5.x. So, we should throw proper exception for that case.

### Does this PR introduce _any_ user-facing change?

Yes, `column rename` with mysql dialect should work correctly.

### How was this patch tested?

Added tests for rename column.
Ran the tests to pass with both versions of mysql.

* `export MYSQL_DOCKER_IMAGE_NAME=mysql:5.7.31`

* `export MYSQL_DOCKER_IMAGE_NAME=mysql:8.0`

Closes #30142 from ScrapCodes/mysql-dialect-rename.

Authored-by: Prashant Sharma <prashsh1@in.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-11-02 05:03:41 +00:00
Takuya UESHIN b8a440f098 [SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming after the task ends
### What changes were proposed in this pull request?

As the Python evaluation consumes the parent iterator in a separate thread, it could consume more data from the parent even after the task ends and the parent is closed. Thus, we should use `ContextAwareIterator` to stop consuming after the task ends.

### Why are the changes needed?

Python/Pandas UDF right after off-heap vectorized reader could cause executor crash.

E.g.,:

```py
spark.range(0, 100000, 1, 1).write.parquet(path)

spark.conf.set("spark.sql.columnVector.offheap.enabled", True)

def f(x):
    return 0

fUdf = udf(f, LongType())

spark.read.parquet(path).select(fUdf('id')).head()
```

This is because, the Python evaluation consumes the parent iterator in a separate thread and it consumes more data from the parent even after the task ends and the parent is closed. If an off-heap column vector exists in the parent iterator, it could cause segmentation fault which crashes the executor.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added tests, and manually.

Closes #30177 from ueshin/issues/SPARK-33277/python_pandas_udf.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-11-01 20:28:12 +09:00
wangguangxin.cn 69c27f49ac [SPARK-33306][SQL] Timezone is needed when cast date to string
### What changes were proposed in this pull request?
When `spark.sql.legacy.typeCoercion.datetimeToString.enabled` is enabled, spark will cast date to string when compare date with string. In Spark3, timezone is needed when casting date to string as 72ad9dcd5d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala (L309).

Howerver, the timezone may not be set because `CastBase.needsTimeZone` returns false for this kind of casting.

A simple way to reproduce this is
```
spark-shell --conf spark.sql.legacy.typeCoercion.datetimeToString.enabled=true

```
when we execute the following sql,
```
select a.d1 from
(select to_date(concat('2000-01-0', id)) as d1 from range(1, 2)) a
join
(select concat('2000-01-0', id) as d2 from range(1, 2)) b
on a.d1 = b.d2
```
it will throw
```
java.util.NoSuchElementException: None.get
  at scala.None$.get(Option.scala:529)
  at scala.None$.get(Option.scala:527)
  at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId(datetimeExpressions.scala:56)
  at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId$(datetimeExpressions.scala:56)
  at org.apache.spark.sql.catalyst.expressions.CastBase.zoneId$lzycompute(Cast.scala:253)
  at org.apache.spark.sql.catalyst.expressions.CastBase.zoneId(Cast.scala:253)
  at org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter$lzycompute(Cast.scala:287)
  at org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter(Cast.scala:287)
```

### Why are the changes needed?
As described above, it's a bug here.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add more UT

Closes #30213 from WangGuangxin/SPARK-33306.

Authored-by: wangguangxin.cn <wangguangxin.cn@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-31 15:14:46 -07:00
Chao Sun c51e5fc14b [SPARK-33293][SQL] Refactor WriteToDataSourceV2Exec and reduce code duplication
### What changes were proposed in this pull request?

Refactor `WriteToDataSourceV2Exec` via removing code duplication around write to table logic:
- renamed `AtomicTableWriteExec` to `TableWriteExec` so that the table write logic in this trait can be modified and shared with `CreateTableAsSelectExec`, `ReplaceTableAsSelectExec`, `AtomicCreateTableAsSelectExec ` and `AtomicReplaceTableAsSelectExec`.
- similar to the above, renamed `writeToStagedTable` to `writeToTable` in `TableWriteExec`.
- extended `writeToTable` so that it can handle both staged table as well as non-staged table.

### Why are the changes needed?

Simplify the logic and remove duplication, to make this piece of code easier to maintain.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass CIs with the existing test coverage.

Closes #30193 from sunchao/SPARK-33293.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-31 10:01:31 -07:00
Chao Sun 32b78d3795 [SPARK-33290][SQL] REFRESH TABLE should invalidate cache even though the table itself may not be cached
### What changes were proposed in this pull request?

In `CatalogImpl.refreshTable`, this moves the `uncacheQuery` call out of the condition `if (cache.nonEmpty)` so that it will be called whether the table itself is cached or not.

### Why are the changes needed?

In the case like the following:
```sql
CREATE TABLE t ...;
CREATE VIEW t1 AS SELECT * FROM t;
REFRESH TABLE t;
```

If the table `t` is refreshed, the view `t1` which is depending on `t` will not be invalidated. This could lead to incorrect result and is similar to [SPARK-19765](https://issues.apache.org/jira/browse/SPARK-19765).

On the other hand, if we have:

```sql
CREATE TABLE t ...;
CACHE TABLE t;
CREATE VIEW t1 AS SELECT * FROM t;
REFRESH TABLE t;
```

Then the view `t1` will be refreshed. The behavior is somewhat inconsistent.

### Does this PR introduce _any_ user-facing change?

Yes, with the change any cache that are depending on the table refreshed will be invalidated with the change. Previously this only happens if the table itself is cached.

### How was this patch tested?

Added a new UT for the case.

Closes #30187 from sunchao/SPARK-33290.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-31 09:49:18 -07:00
ulysses d59f6a7095 [SPARK-33294][SQL] Add query resolved check before analyze InsertIntoDir
### What changes were proposed in this pull request?

Add `query.resolved` before analyze `InsertIntoDir`.

### Why are the changes needed?

For better error msg.
```
INSERT OVERWRITE DIRECTORY '/tmp/file' USING PARQUET
SELECT * FROM (
 SELECT c3 FROM (
  SELECT c1, c2 from values(1,2) t(c1, c2)
  )
)
```
 Before this PR, we get such error msg
```
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to toAttribute on unresolved object, tree: *
  at org.apache.spark.sql.catalyst.analysis.Star.toAttribute(unresolved.scala:244)
  at org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicLogicalOperators.scala:52)
  at org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicLogicalOperators.scala:52)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
  at scala.collection.immutable.List.foreach(List.scala:392)
```

### Does this PR introduce _any_ user-facing change?

Yes, error msg changed.

### How was this patch tested?

New test.

Closes #30197 from ulysses-you/SPARK-33294.

Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-30 08:18:10 +00:00
angerszhu 0c943cd2fb [SPARK-33248][SQL] Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size
### What changes were proposed in this pull request?
Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size.
Since we can't decide whether it's a but and some use need it behavior same as Hive.

### Why are the changes needed?
Provides a compatible choice between historical behavior and Hive

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existed UT

Closes #30156 from AngersZhuuuu/SPARK-33284.

Lead-authored-by: angerszhu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-30 14:11:25 +09:00
Max Gekk 343e0bb3ad [SPARK-33286][SQL] Improve the error message about schema parsing by from_json/from_csv
# What changes were proposed in this pull request?
In the PR, I propose to improve the error message from `from_json`/`from_csv` by combining errors from all schema parsers:
- DataType.fromJson (except CSV)
- CatalystSqlParser.parseDataType
- CatalystSqlParser.parseTableSchema

Before the changes, `from_json` does not show error messages from the first parser in the chain that could mislead users.

### Why are the changes needed?
Currently, `from_json` outputs the error message from the fallback schema parser which can confuse end-users. For example:

```scala
    val invalidJsonSchema = """{"fields": [{"a":123}], "type": "struct"}"""
    df.select(from_json($"json", invalidJsonSchema, Map.empty[String, String])).show()
```
The JSON schema has an issue in `{"a":123}` but the error message doesn't point it out:
```
mismatched input '{' expecting {'ADD', 'AFTER', ...}(line 1, pos 0)

== SQL ==
{"fields": [{"a":123}], "type": "struct"}
^^^

org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input '{' expecting {'ADD', 'AFTER',  ... }(line 1, pos 0)

== SQL ==
{"fields": [{"a":123}], "type": "struct"}
^^^
```

### Does this PR introduce _any_ user-facing change?
Yes, after the changes for the example above:
```
Cannot parse the schema in JSON format: Failed to convert the JSON string '{"a":123}' to a field.
Failed fallback parsing: Cannot parse the data type:
mismatched input '{' expecting {'ADD', 'AFTER', ...}(line 1, pos 0)

== SQL ==
{"fields": [{"a":123}], "type": "struct"}
^^^

Failed fallback parsing:
mismatched input '{' expecting {'ADD', 'AFTER', ...}(line 1, pos 0)

== SQL ==
{"fields": [{"a":123}], "type": "struct"}
^^^
```

### How was this patch tested?
- By existing tests suites like `JsonFunctionsSuite` and `JsonExpressionsSuite`.
- Add new test to `JsonFunctionsSuite`.
- Re-gen results for `json-functions.sql`.

Closes #30183 from MaxGekk/fromDDL-error-msg.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-30 11:18:47 +09:00
luluorta cbd3fdea62 [SPARK-33008][SQL] Division by zero on divide-like operations returns incorrect result
### What changes were proposed in this pull request?
In ANSI mode, when a division by zero occurs performing a divide-like operation (Divide, IntegralDivide, Remainder or Pmod), we are returning an incorrect value. Instead, we should throw an exception, as stated in the SQL standard.

### Why are the changes needed?
Result corrupt.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
added UT + existing UTs (improved)

Closes #29882 from luluorta/SPARK-33008.

Authored-by: luluorta <luluorta@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-29 16:44:17 +00:00
Liang-Chi Hsieh 056b62264b [SPARK-33263][SS] Configurable StateStore compression codec
### What changes were proposed in this pull request?

This patch proposes to make StateStore compression codec configurable.

### Why are the changes needed?

Currently the compression codec of StateStore is not configurable and hard-coded to be lz4. It is better if we can follow Spark other modules to configure the compression codec of StateStore. For example, we can choose zstd codec and zstd is configurable with different compression level.

### Does this PR introduce _any_ user-facing change?

Yes, after this change users can config different codec for StateStore.

### How was this patch tested?

Unit test.

Closes #30162 from viirya/SPARK-33263.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-29 07:44:44 -07:00
Max Gekk b409025641 [SPARK-33281][SQL] Return SQL schema instead of Catalog string from the SchemaOfCsv expression
### What changes were proposed in this pull request?
Return schema in SQL format instead of Catalog string from the SchemaOfCsv expression.

### Why are the changes needed?
To unify output of the `schema_of_json()` and `schema_of_csv()`.

### Does this PR introduce _any_ user-facing change?
Yes, they can but `schema_of_csv()` is usually used in combination with `from_csv()`, so, the format of schema shouldn't be much matter.

Before:
```
> SELECT schema_of_csv('1,abc');
  struct<_c0:int,_c1:string>
```

After:
```
> SELECT schema_of_csv('1,abc');
  STRUCT<`_c0`: INT, `_c1`: STRING>
```

### How was this patch tested?
By existing test suites `CsvFunctionsSuite` and `CsvExpressionsSuite`.

Closes #30180 from MaxGekk/schema_of_csv-sql-schema.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-29 21:02:10 +09:00
Max Gekk 9d5e48ea95 [SPARK-33270][SQL] Return SQL schema instead of Catalog string from the SchemaOfJson expression
### What changes were proposed in this pull request?
Return schema in SQL format instead of Catalog string from the `SchemaOfJson` expression.

### Why are the changes needed?
In some cases, `from_json()` cannot parse schemas returned by `schema_of_json`, for instance, when JSON fields have spaces (gaps). Such fields will be quoted after the changes, and can be parsed by `from_json()`.

Here is the example:
```scala
val in = Seq("""{"a b": 1}""").toDS()
in.select(from_json('value, schema_of_json("""{"a b": 100}""")) as "parsed")
```
raises the exception:
```
== SQL ==
struct<a b:bigint>
------^^^

	at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263)
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130)
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parseTableSchema(ParseDriver.scala:76)
	at org.apache.spark.sql.types.DataType$.fromDDL(DataType.scala:131)
	at org.apache.spark.sql.catalyst.expressions.ExprUtils$.evalTypeExpr(ExprUtils.scala:33)
	at org.apache.spark.sql.catalyst.expressions.JsonToStructs.<init>(jsonExpressions.scala:537)
	at org.apache.spark.sql.functions$.from_json(functions.scala:4141)
```

### Does this PR introduce _any_ user-facing change?
Yes. For example, `schema_of_json` for the input `{"col":0}`.

Before: `struct<col:bigint>`
After: `STRUCT<`col`: BIGINT>`

### How was this patch tested?
By existing test suites `JsonFunctionsSuite` and `JsonExpressionsSuite`.

Closes #30172 from MaxGekk/schema_of_json-sql-schema.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-29 10:30:41 +09:00
Jungtaek Lim (HeartSaVioR) a744fea3be [SPARK-33267][SQL] Fix NPE issue on 'In' filter when one of values contains null
### What changes were proposed in this pull request?

This PR proposes to fix the NPE issue on `In` filter when one of values contain null. In real case, you can trigger this issue when you try to push down the filter with `in (..., null)` against V2 source table. `DataSourceStrategy` caches the mapping (filter instance -> expression) in HashMap, which leverages hash code on the key, hence it could trigger the NPE issue.

### Why are the changes needed?

This is an obvious bug as `In` filter doesn't care about null value when calculating hash code.

### Does this PR introduce _any_ user-facing change?

Yes, previously the query with having `null` in "in" condition against data source V2 source table supporting push down filter failed with NPE, whereas after the PR the query will not fail.

### How was this patch tested?

UT added. The new UT fails without the PR and passes with the PR.

Closes #30170 from HeartSaVioR/SPARK-33267.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-28 10:00:29 -07:00
zky.zhoukeyong b26ae98407 [SPARK-33208][SQL] Update the document of SparkSession#sql
Change-Id: I82db1f9e8f667573aa3a03e05152cbed0ea7686b

### What changes were proposed in this pull request?
Update the document of SparkSession#sql, mention that this API eagerly runs DDL/DML commands, but not for SELECT queries.

### Why are the changes needed?
To clarify the behavior of SparkSession#sql.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
No needed.

Closes #30168 from waitinfuture/master.

Authored-by: zky.zhoukeyong <zky.zhoukeyong@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-28 13:17:28 +00:00
gengjiaan 3c3ad5f7c0 [SPARK-32934][SQL] Improve the performance for NTH_VALUE and reactor the OffsetWindowFunction
### What changes were proposed in this pull request?
Spark SQL supports some window function like `NTH_VALUE`.
If we specify window frame like `UNBOUNDED PRECEDING AND CURRENT ROW` or `UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING`, we can elimate some calculations.
For example: if we execute the SQL show below:
```
SELECT NTH_VALUE(col,
         2) OVER(ORDER BY rank UNBOUNDED PRECEDING
        AND CURRENT ROW)
FROM tab;
```
The output for row number greater than 1, return the fixed value. otherwise, return null. So we just calculate the value once and notice whether the row number less than 2.
`UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING` is simpler.

### Why are the changes needed?
Improve the performance for `NTH_VALUE`, `FIRST_VALUE` and `LAST_VALUE`.

### Does this PR introduce _any_ user-facing change?
 'No'.

### How was this patch tested?
Jenkins test.

Closes #29800 from beliefer/optimize-nth_value.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-28 06:40:23 +00:00
allisonwang-db 9fb45361fd [SPARK-33183][SQL] Fix Optimizer rule EliminateSorts and add a physical rule to remove redundant sorts
### What changes were proposed in this pull request?
This PR aims to fix a correctness bug in the optimizer rule `EliminateSorts`. It also adds a new physical rule to remove redundant sorts that cannot be eliminated in the Optimizer rule after the bugfix.

### Why are the changes needed?
A global sort should not be eliminated even if its child is ordered since we don't know if its child ordering is global or local. For example, in the following scenario, the first sort shouldn't be removed because it has a stronger guarantee than the second sort even if the sort orders are the same for both sorts.

```
Sort(orders, global = True, ...)
  Sort(orders, global = False, ...)
```

Since there is no straightforward way to identify whether a node's output ordering is local or global, we should not remove a global sort even if its child is already ordered.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
Unit tests

Closes #30093 from allisonwang-db/fix-sort.

Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-28 05:51:47 +00:00
Terry Kim 528160f001 [SPARK-33174][SQL] Migrate DROP TABLE to use UnresolvedTableOrView to resolve the identifier
### What changes were proposed in this pull request?

This PR proposes to migrate `DROP TABLE` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).

### Why are the changes needed?

The current behavior is not consistent between v1 and v2 commands when resolving a temp view.
In v2, the `t` in the following example is resolved to a table:
```scala
sql("CREATE TABLE testcat.ns.t (id bigint) USING foo")
sql("CREATE TEMPORARY VIEW t AS SELECT 2")
sql("USE testcat.ns")
sql("DROP TABLE t") // 't' is resolved to testcat.ns.t
```
whereas in v1, the `t` is resolved to a temp view:
```scala
sql("CREATE DATABASE test")
sql("CREATE TABLE spark_catalog.test.t (id bigint) USING csv")
sql("CREATE TEMPORARY VIEW t AS SELECT 2")
sql("USE spark_catalog.test")
sql("DROP TABLE t") // 't' is resolved to a temp view
```

### Does this PR introduce _any_ user-facing change?

After this PR, for v2, `DROP TABLE t` is resolved to a temp view `t` instead of `testcat.ns.t`, consistent with v1 behavior.

### How was this patch tested?

Added a new test

Closes #30079 from imback82/drop_table_consistent.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-28 05:44:55 +00:00
Jungtaek Lim (HeartSaVioR) fcf8aa59b5 [SPARK-33240][SQL] Fail fast when fails to instantiate configured v2 session catalog
### What changes were proposed in this pull request?

This patch proposes to change the behavior on failing fast when Spark fails to instantiate configured v2 session catalog.

### Why are the changes needed?

The Spark behavior is against the intention of the end users - if end users configure session catalog which Spark would fail to initialize, Spark would swallow the error with only logging the error message and silently use the default catalog implementation.

This follows the voices on [discussion thread](https://lists.apache.org/thread.html/rdfa22a5ebdc4ac66e2c5c8ff0cd9d750e8a1690cd6fb456d119c2400%40%3Cdev.spark.apache.org%3E) in dev mailing list.

### Does this PR introduce _any_ user-facing change?

Yes. After the PR Spark will fail immediately if Spark fails to instantiate configured session catalog.

### How was this patch tested?

New UT added.

Closes #30147 from HeartSaVioR/SPARK-33240.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-28 03:31:11 +00:00
Ankur Dave 3f2a2b5fe6 [SPARK-33260][SQL] Fix incorrect results from SortExec when sortOrder is Stream
### What changes were proposed in this pull request?

The following query produces incorrect results. The query has two essential features: (1) it contains a string aggregate, resulting in a `SortExec` node, and (2) it contains a duplicate grouping key, causing `RemoveRepetitionFromGroupExpressions` to produce a sort order stored as a `Stream`.

```sql
SELECT bigint_col_1, bigint_col_9, MAX(CAST(bigint_col_1 AS string))
FROM table_4
GROUP BY bigint_col_1, bigint_col_9, bigint_col_9
```

When the sort order is stored as a `Stream`, the line `ordering.map(_.child.genCode(ctx))` in `GenerateOrdering#createOrderKeys()` produces unpredictable side effects to `ctx`. This is because `genCode(ctx)` modifies `ctx`. When ordering is a `Stream`, the modifications will not happen immediately as intended, but will instead occur lazily when the returned `Stream` is used later.

Similar bugs have occurred at least three times in the past: https://issues.apache.org/jira/browse/SPARK-24500, https://issues.apache.org/jira/browse/SPARK-25767, https://issues.apache.org/jira/browse/SPARK-26680.

The fix is to check if `ordering` is a `Stream` and force the modifications to happen immediately if so.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added a unit test for `SortExec` where `sortOrder` is a `Stream`. The test previously failed and now passes.

Closes #30160 from ankurdave/SPARK-33260.

Authored-by: Ankur Dave <ankurdave@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-27 13:20:22 -07:00
Huaxin Gao f284218dae [SPARK-33137][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (Postgres dialect)
### What changes were proposed in this pull request?
Override the default SQL strings in Postgres Dialect for:

- ALTER TABLE UPDATE COLUMN TYPE
- ALTER TABLE UPDATE COLUMN NULLABILITY

Add new docker integration test suite `jdbc/v2/PostgreSQLIntegrationSuite.scala`

### Why are the changes needed?
supports Postgres specific ALTER TABLE syntax.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add new test `PostgreSQLIntegrationSuite`

Closes #30089 from huaxingao/postgres_docker.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-27 15:04:53 +00:00
xuewei.linxuewei 537a49fc09 [SPARK-33140][SQL] remove SQLConf and SparkSession in all sub-class of Rule[QueryPlan]
### What changes were proposed in this pull request?

Since Issue [SPARK-33139](https://issues.apache.org/jira/browse/SPARK-33139) has been done, and SQLConf.get and SparkSession.active are more reliable. We are trying to refine the existing code usage of passing SQLConf and SparkSession into sub-class of Rule[QueryPlan].

In this PR.

* remove SQLConf from ctor-parameter of all sub-class of Rule[QueryPlan].
* using SQLConf.get to replace the original SQLConf instance.
* remove SparkSession from ctor-parameter of all sub-class of Rule[QueryPlan].
* using SparkSession.active to replace the original SparkSession instance.

### Why are the changes needed?

Code refine.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?

Existing test

Closes #30097 from leanken/leanken-SPARK-33140.

Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-27 12:40:57 +00:00
angerszhu e43cd8ccef [SPARK-32388][SQL] TRANSFORM with schema-less mode should keep the same with hive
### What changes were proposed in this pull request?
In current Spark script transformation with hive serde mode, in case of schema less, result is different with hive.
This pr to keep result same with hive script transform  serde.

#### Hive Scrip Transform with serde in schemaless
```
hive> create table t (c0 int, c1 int, c2 int);
hive> INSERT INTO t VALUES (1, 1, 1);
hive> INSERT INTO t VALUES (2, 2, 2);
hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t;

hive> DESCRIBE v;
key                 	string
value               	string

hive> SELECT * FROM v;
1	1	1
2	2	2

hive> SELECT key FROM v;
1
2

hive> SELECT value FROM v;
1	1
2	2
```

#### Spark script transform with hive serde in schema less.
```
hive> create table t (c0 int, c1 int, c2 int);
hive> INSERT INTO t VALUES (1, 1, 1);
hive> INSERT INTO t VALUES (2, 2, 2);
hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t;

hive> SELECT * FROM v;
1   1
2   2
```

**No serde mode in hive (ROW FORMATTED DELIMITED)**
![image](https://user-images.githubusercontent.com/46485123/90088770-55841e00-dd52-11ea-92dd-7fe52d93f0b3.png)

### Why are the changes needed?
Keep same behavior with hive script transform

### Does this PR introduce _any_ user-facing change?
Before this pr with hive serde script transform
```
select transform(*)
USING 'cat'
from (
select 1, 2, 3, 4
) tmp

key     value
1         2
```
After
```
select transform(*)
USING 'cat'
from (
select 1, 2, 3, 4
) tmp

key     value
1         2   3  4
```
### How was this patch tested?
UT

Closes #29421 from AngersZhuuuu/SPARK-32388.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-27 09:25:53 +09:00
Steve Loughran 02fa19f102 [SPARK-33230][SQL] Hadoop committers to get unique job ID in "spark.sql.sources.writeJobUUID"
### What changes were proposed in this pull request?

This reinstates the old option `spark.sql.sources.write.jobUUID` to set a unique jobId in the jobconf so that hadoop MR committers have a unique ID which is (a) consistent across tasks and workers and (b) not brittle compared to generated-timestamp job IDs. The latter matches that of what JobID requires, but as they are generated per-thread, may not always be unique within a cluster.

### Why are the changes needed?

If a committer (e.g s3a staging committer) uses job-attempt-ID as a unique ID then any two jobs started within the same second have the same ID, so can clash.

### Does this PR introduce _any_ user-facing change?

Good Q. It is "developer-facing" in the context of anyone writing a committer. But it reinstates a property which was in Spark 1.x and "went away"

### How was this patch tested?

Testing: no test here. You'd have to create a new committer which extracted the value in both job and task(s) and verified consistency. That is possible (with a task output whose records contained the UUID), but it would be pretty convoluted and a high maintenance cost.

Because it's trying to address a race condition, it's hard to regenerate the problem downstream and so verify a fix in a test run...I'll just look at the logs to see what temporary dir is being used in the cluster FS and verify it's a UUID

Closes #30141 from steveloughran/SPARK-33230-jobId.

Authored-by: Steve Loughran <stevel@cloudera.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-26 12:31:05 -07:00
Cheng Su 1042d49bf9 [SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query)
### What changes were proposed in this pull request?

This PR is to enable auto bucketed table scan by default, with exception to only disable for cached query (similar to AQE). The reason why disabling auto scan for cached query is that, the cached query output partitioning can be leveraged later to avoid shuffle and sort when doing join and aggregate.

### Why are the changes needed?

Enable auto bucketed table scan by default is useful as it can optimize query automatically under the hood, without users interaction.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit test for cached query in `DisableUnnecessaryBucketedScanSuite.scala`. Also change a bunch of unit tests which should disable auto bucketed scan to make them work.

Closes #30138 from c21/enable-auto-bucket.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-10-26 20:23:24 +09:00
Cheng Su d87a0bb2ca [SPARK-32862][SS] Left semi stream-stream join
### What changes were proposed in this pull request?

This is to support left semi join in stream-stream join. The implementation of left semi join is (mostly in `StreamingSymmetricHashJoinExec` and `SymmetricHashJoinStateManager`):
* For left side input row, check if there's a match on right side state store.
  * if there's a match, output the left side row, but do not put the row in left side state store (no need to put in state store).
  * if there's no match, output nothing, but put the row in left side state store (with "matched" field to set to false in state store).
* For right side input row, check if there's a match on left side state store.
  * For all matched left rows in state store, output the rows with "matched" field as false. Set all left rows with "matched" field to be true. Only output the left side rows matched for the first time to guarantee left semi join semantics.
* State store eviction: evict rows from left/right side state store below watermark, same as inner join.

Note a followup optimization can be to evict matched left side rows from state store earlier, even when the rows are still above watermark. However this needs more change in `SymmetricHashJoinStateManager`, so will leave this as a followup.

### Why are the changes needed?

Current stream-stream join supports inner, left outer and right outer join (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L166 ). We do see internally a lot of users are using left semi stream-stream join (not spark structured streaming), e.g. I want to get the ad impression (join left side) which has click (joint right side), but I don't care how many clicks per ad (left semi semantics).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit tests in `UnsupportedOperationChecker.scala` and `StreamingJoinSuite.scala`.

Closes #30076 from c21/stream-join.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
2020-10-26 13:33:06 +09:00
HyukjinKwon 369cc614f3 Revert "[SPARK-32388][SQL] TRANSFORM with schema-less mode should keep the same with hive"
This reverts commit 56ab60fb7a.
2020-10-26 11:38:48 +09:00
angerszhu 56ab60fb7a [SPARK-32388][SQL] TRANSFORM with schema-less mode should keep the same with hive
### What changes were proposed in this pull request?
In current Spark script transformation with hive serde mode, in case of schema less, result is different with hive.
This pr to keep result same with hive script transform  serde.

#### Hive Scrip Transform with serde in schemaless
```
hive> create table t (c0 int, c1 int, c2 int);
hive> INSERT INTO t VALUES (1, 1, 1);
hive> INSERT INTO t VALUES (2, 2, 2);
hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t;

hive> DESCRIBE v;
key                 	string
value               	string

hive> SELECT * FROM v;
1	1	1
2	2	2

hive> SELECT key FROM v;
1
2

hive> SELECT value FROM v;
1	1
2	2
```

#### Spark script transform with hive serde in schema less.
```
hive> create table t (c0 int, c1 int, c2 int);
hive> INSERT INTO t VALUES (1, 1, 1);
hive> INSERT INTO t VALUES (2, 2, 2);
hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t;

hive> SELECT * FROM v;
1   1
2   2
```

**No serde mode in hive (ROW FORMATTED DELIMITED)**
![image](https://user-images.githubusercontent.com/46485123/90088770-55841e00-dd52-11ea-92dd-7fe52d93f0b3.png)

### Why are the changes needed?
Keep same behavior with hive script transform

### Does this PR introduce _any_ user-facing change?
Before this pr with hive serde script transform
```
select transform(*)
USING 'cat'
from (
select 1, 2, 3, 4
) tmp

key     value
1         2
```
After
```
select transform(*)
USING 'cat'
from (
select 1, 2, 3, 4
) tmp

key     value
1         2   3  4
```
### How was this patch tested?
UT

Closes #29421 from AngersZhuuuu/SPARK-32388.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-26 11:20:29 +09:00
Takeshi Yamamuro 87b498462b [SPARK-33228][SQL] Don't uncache data when replacing a view having the same logical plan
### What changes were proposed in this pull request?

SPARK-30494's updated the `CreateViewCommand` code to implicitly drop cache when replacing an existing view. But, this change drops cache even when replacing a view having the same logical plan. A sequence of queries to reproduce this as follows;
```
// Spark v2.4.6+
scala> val df = spark.range(1).selectExpr("id a", "id b")
scala> df.cache()
scala> df.explain()
== Physical Plan ==
*(1) ColumnarToRow
+- InMemoryTableScan [a#2L, b#3L]
      +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 replicas)
            +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
               +- *(1) Range (0, 1, step=1, splits=4)

scala> df.createOrReplaceTempView("t")
scala> sql("select * from t").explain()
== Physical Plan ==
*(1) ColumnarToRow
+- InMemoryTableScan [a#2L, b#3L]
      +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 replicas)
            +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
               +- *(1) Range (0, 1, step=1, splits=4)

// If one re-runs the same query `df.createOrReplaceTempView("t")`, the cache's swept away
scala> df.createOrReplaceTempView("t")
scala> sql("select * from t").explain()
== Physical Plan ==
*(1) Project [id#0L AS a#2L, id#0L AS b#3L]
+- *(1) Range (0, 1, step=1, splits=4)

// Until v2.4.6
scala> val df = spark.range(1).selectExpr("id a", "id b")
scala> df.cache()
scala> df.createOrReplaceTempView("t")
scala> sql("select * from t").explain()
20/10/23 22:33:42 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
== Physical Plan ==
*(1) InMemoryTableScan [a#2L, b#3L]
   +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 replicas)
         +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
            +- *(1) Range (0, 1, step=1, splits=4)

scala> df.createOrReplaceTempView("t")
scala> sql("select * from t").explain()
== Physical Plan ==
*(1) InMemoryTableScan [a#2L, b#3L]
   +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 replicas)
         +- *(1) Project [id#0L AS a#2L, id#0L AS b#3L]
            +- *(1) Range (0, 1, step=1, splits=4)
```

### Why are the changes needed?

bugfix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added tests.

Closes #30140 from maropu/FixBugInReplaceView.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-25 16:15:55 -07:00
Jungtaek Lim (HeartSaVioR) 0c66a88d1d [SPARK-29438][SS][FOLLOWUP] Add regression tests for Streaming Aggregation and flatMapGroupsWithState
### What changes were proposed in this pull request?

This patch adds new UTs to prevent SPARK-29438 for streaming aggregation as well as flatMapGroupsWithState, as we agree about the review comment quote here:

https://github.com/apache/spark/pull/26162#issuecomment-576929692

> LGTM for this PR. But on a additional note, this is a very subtle and easy-to-make bug with TaskContext.getPartitionId. I wonder if this bug is present in any other stateful operation. I wonder if this bug is present in any other stateful operation. Can you please verify how partitionId is used in the other stateful operations?

For now they're not broken, but even better if we have UTs to prevent the case for the future.

### Why are the changes needed?

New UTs will prevent streaming aggregation and flatMapGroupsWithState to be broken in future where it is placed on the right side of UNION and the number of partition is changing on the left side of UNION. Please refer SPARK-29438 for more details.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added UTs.

Closes #27333 from HeartSaVioR/SPARK-29438-add-regression-test.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2020-10-24 15:36:41 -07:00
Kent Yao e21bb710e5 [SPARK-32991][SQL] Use conf in shared state as the original configuraion for RESET
### What changes were proposed in this pull request?

####  case

the case here covers the static and dynamic SQL configs behavior in `sharedState` and `sessionState`,  and the specially handled config `spark.sql.warehouse.dir`
the case can be found here - https://github.com/yaooqinn/sugar/blob/master/src/main/scala/com/netease/mammut/spark/training/sql/WarehouseSCBeforeSS.scala

```scala

import java.lang.reflect.Field

import org.apache.spark.sql.SparkSession
import org.apache.spark.{SparkConf, SparkContext}

object WarehouseSCBeforeSS extends App {

  val wh = "spark.sql.warehouse.dir"
  val td = "spark.sql.globalTempDatabase"
  val custom = "spark.sql.custom"

  val conf = new SparkConf()
    .setMaster("local")
    .setAppName("SPARK-32991")
    .set(wh, "./data1")
    .set(td, "bob")

  val sc = new SparkContext(conf)

  val spark = SparkSession.builder()
    .config(wh, "./data2")
    .config(td, "alice")
    .config(custom, "kyao")
    .getOrCreate()

  val confField: Field = spark.sharedState.getClass.getDeclaredField("conf")
  confField.setAccessible(true)
  private val shared: SparkConf = confField.get(spark.sharedState).asInstanceOf[SparkConf]
  println()
  println(s"=====> SharedState: $wh=${shared.get(wh)}")
  println(s"=====> SharedState: $td=${shared.get(td)}")
  println(s"=====> SharedState: $custom=${shared.get(custom, "")}")

  println(s"=====> SessionState: $wh=${spark.conf.get(wh)}")
  println(s"=====> SessionState: $td=${spark.conf.get(td)}")
  println(s"=====> SessionState: $custom=${spark.conf.get(custom, "")}")

  val spark2 = SparkSession.builder().config(td, "fred").getOrCreate()

  println(s"=====> SessionState 2: $wh=${spark2.conf.get(wh)}")
  println(s"=====> SessionState 2: $td=${spark2.conf.get(td)}")
  println(s"=====> SessionState 2: $custom=${spark2.conf.get(custom, "")}")

  SparkSession.setActiveSession(spark)
  spark.sql("RESET")

  println(s"=====> SessionState RESET: $wh=${spark.conf.get(wh)}")
  println(s"=====> SessionState RESET: $td=${spark.conf.get(td)}")
  println(s"=====> SessionState RESET: $custom=${spark.conf.get(custom, "")}")

  val spark3 = SparkSession.builder().getOrCreate()

  println(s"=====> SessionState 3: $wh=${spark2.conf.get(wh)}")
  println(s"=====> SessionState 3: $td=${spark2.conf.get(td)}")
  println(s"=====> SessionState 3: $custom=${spark2.conf.get(custom, "")}")
}
```

#### outputs and analysis
```
// 1. Make the cloned spark conf in shared state respect the warehouse dir from the 1st SparkSession
//=====> SharedState: spark.sql.warehouse.dir=./data1
// 2. 
//=====> SharedState: spark.sql.globalTempDatabase=alice
//=====> SharedState: spark.sql.custom=kyao
//=====> SessionState: spark.sql.warehouse.dir=./data2
//=====> SessionState: spark.sql.globalTempDatabase=alice
//=====> SessionState: spark.sql.custom=kyao
//=====> SessionState 2: spark.sql.warehouse.dir=./data2
//=====> SessionState 2: spark.sql.globalTempDatabase=alice
//=====> SessionState 2: spark.sql.custom=kyao
// 2'.🔼 OK until here
// 3. Make the below 3 ones respect the cloned spark conf in shared state with issue 1 fixed
//=====> SessionState RESET: spark.sql.warehouse.dir=./data1
//=====> SessionState RESET: spark.sql.globalTempDatabase=bob
//=====> SessionState RESET: spark.sql.custom=
// 4. Then the SparkSessions created after RESET will be corrected.
//=====> SessionState 3: spark.sql.warehouse.dir=./data1
//=====> SessionState 3: spark.sql.globalTempDatabase=bob
//=====> SessionState 3: spark.sql.custom=
```

In this PR, we gather all valid config to the cloned conf of `sharedState` during being constructed, well, actually only `spark.sql.warehouse.dir` is missing. Then we use this conf as defaults for `RESET` Command.

`SparkSession.clearActiveSession/clearDefaultSession` will make the shared state invisible and unsharable. They will be internal only soon (confirmed with Wenchen), so cases with them called will not be a problem.

### Why are the changes needed?

bugfix for programming API to call RESET while users creating SparkContext first and config SparkSession later.

### Does this PR introduce _any_ user-facing change?

yes, before this change when you use programming API and call RESET, all configs will be reset to  SparkContext.conf, now they go to SparkSession.sharedState.conf

### How was this patch tested?

new tests

Closes #30045 from yaooqinn/SPARK-32991.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-23 05:52:38 +00:00
Max Gekk a03d77d326 [SPARK-33160][SQL][FOLLOWUP] Replace the parquet metadata key org.apache.spark.int96NoRebase by org.apache.spark.legacyINT96
### What changes were proposed in this pull request?
1. Replace the metadata key `org.apache.spark.int96NoRebase` by `org.apache.spark.legacyINT96`.
2. Change the condition when new key should be saved to parquet metadata: it should be saved when the SQL config `spark.sql.legacy.parquet.int96RebaseModeInWrite` is set to `LEGACY`.
3. Change handling the metadata key in read:
    - If there is no the key in parquet metadata, take the rebase mode from the SQL config: `spark.sql.legacy.parquet.int96RebaseModeInRead`
    - If parquet files were saved by Spark < 3.1.0, use the `LEGACY` rebasing mode for INT96 type.
    - For files written by Spark >= 3.1.0, if the `org.apache.spark.legacyINT96` presents in metadata, perform rebasing otherwise don't.

### Why are the changes needed?
- To not increase parquet size by default when `spark.sql.legacy.parquet.int96RebaseModeInWrite` is `EXCEPTION` after https://github.com/apache/spark/pull/30121.
- To have the implementation similar to `org.apache.spark.legacyDateTime`
- To minimise impact on other subsystems that are based on file sizes like gathering statistics.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Modified test in `ParquetIOSuite`

Closes #30132 from MaxGekk/int96-flip-metadata-rebase-key.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-22 15:57:03 +00:00
yangjie01 b38f3a5557 [SPARK-32978][SQL] Make sure the number of dynamic part metric is correct
### What changes were proposed in this pull request?

The purpose of this pr is to resolve SPARK-32978.

The main reason of bad case describe in SPARK-32978 is the `BasicWriteTaskStatsTracker` directly reports the new added partition number of each task, which makes it impossible to remove duplicate data in driver side.

The main of this pr is change to report partitionValues to driver and remove duplicate data at driver side to make sure the number of dynamic part metric is correct.

### Why are the changes needed?
The the number of dynamic part metric we display on the UI should be correct.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add a new test case refer to described in SPARK-32978

Closes #30026 from LuciferYang/SPARK-32978.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-22 14:01:07 +00:00
Prashant Sharma 8cae7f88b0 [SPARK-33095][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MySQL dialect)
### What changes were proposed in this pull request?

Override the default SQL strings for:
ALTER TABLE UPDATE COLUMN TYPE
ALTER TABLE UPDATE COLUMN NULLABILITY
in the following MySQL JDBC dialect according to official documentation.
Write MySQL integration tests for JDBC.

### Why are the changes needed?
Improved code coverage and support mysql dialect for jdbc.

### Does this PR introduce _any_ user-facing change?

Yes, Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MySQL dialect)

### How was this patch tested?

Added tests.

Closes #30025 from ScrapCodes/mysql-dialect.

Authored-by: Prashant Sharma <prashsh1@in.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-22 13:51:42 +00:00
gengjiaan eb33bcb4b2 [SPARK-30796][SQL] Add parameter position for REGEXP_REPLACE
### What changes were proposed in this pull request?
`REGEXP_REPLACE` could replace all substrings of string that match regexp with replacement string.
But `REGEXP_REPLACE` lost some flexibility. such as: converts camel case strings to a string containing lower case words separated by an underscore:
AddressLine1 -> address_line_1
If we support the parameter position, we can do like this(e.g. Oracle):

```
WITH strings as (
  SELECT 'AddressLine1' s FROM dual union all
  SELECT 'ZipCode' s FROM dual union all
  SELECT 'Country' s FROM dual
)
  SELECT s "STRING",
         lower(regexp_replace(s, '([A-Z0-9])', '_\1', 2)) "MODIFIED_STRING"
  FROM strings;
```
The output:
```
  STRING               MODIFIED_STRING
-------------------- --------------------
AddressLine1         address_line_1
ZipCode              zip_code
Country              country
```

There are some mainstream database support the syntax.

**Oracle**
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/REGEXP_REPLACE.html#GUID-EA80A33C-441A-4692-A959-273B5A224490

**Vertica**
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/RegularExpressions/REGEXP_REPLACE.htm?zoom_highlight=regexp_replace

**Redshift**
https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_REPLACE.html

### Why are the changes needed?
The parameter position for `REGEXP_REPLACE` is very useful.

### Does this PR introduce _any_ user-facing change?
'Yes'.

### How was this patch tested?
Jenkins test.

Closes #29891 from beliefer/add-position-for-regex_replace.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-22 07:59:49 +00:00
Max Gekk ba13b94f6b [SPARK-33210][SQL] Set the rebasing mode for parquet INT96 type to EXCEPTION by default
### What changes were proposed in this pull request?
1. Set the default value for the SQL configs `spark.sql.legacy.parquet.int96RebaseModeInWrite` and `spark.sql.legacy.parquet.int96RebaseModeInRead` to `EXCEPTION`.
2. Update the SQL migration guide.

### Why are the changes needed?
Current default value `LEGACY` may lead to shifting timestamps in read or in write. We should leave the decision about rebasing to users.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
By existing test suites like `ParquetIOSuite`.

Closes #30121 from MaxGekk/int96-exception-by-default.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-22 03:04:29 +00:00
Max Gekk bbf2d6f6df [SPARK-33160][SQL][FOLLOWUP] Update benchmarks of INT96 type rebasing
### What changes were proposed in this pull request?
1. Turn off/on the SQL config `spark.sql.legacy.parquet.int96RebaseModeInWrite` which was added by https://github.com/apache/spark/pull/30056 in `DateTimeRebaseBenchmark`. The parquet readers should infer correct rebasing mode automatically from metadata.
2. Regenerate benchmark results of `DateTimeRebaseBenchmark` in the environment:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge (spot instance) |
| AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) |
| Java | OpenJDK8/11 installed by`sudo add-apt-repository ppa:openjdk-r/ppa` & `sudo apt install openjdk-11-jdk`|

### Why are the changes needed?
To have up-to-date info about INT96 performance which is the default type for Catalyst's timestamp type.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By updating benchmark results:
```
$ SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeRebaseBenchmark"
```

Closes #30118 from MaxGekk/int96-rebase-benchmark.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-22 10:03:41 +09:00
Gabor Somogyi fbb6843620 [SPARK-32229][SQL] Fix PostgresConnectionProvider and MSSQLConnectionProvider by accessing wrapped driver
### What changes were proposed in this pull request?
Postgres and MSSQL connection providers are not able to get custom `appEntry` because under some circumstances the driver is wrapped with `DriverWrapper`. Such case is not handled in the mentioned providers. In this PR I've added this edge case handling by passing unwrapped `Driver` from `JdbcUtils`.

### Why are the changes needed?
`DriverWrapper` is not considered.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing + additional unit tests.

Closes #30024 from gaborgsomogyi/SPARK-32229.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-10-20 15:14:38 +09:00
Max Gekk a44e008de3 [SPARK-33160][SQL] Allow saving/loading INT96 in parquet w/o rebasing
### What changes were proposed in this pull request?
1. Add the SQL config `spark.sql.legacy.parquet.int96RebaseModeInWrite` to control timestamps rebasing in saving them as INT96. It supports the same set of values as `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` but the default value is `LEGACY` to preserve backward compatibility with Spark <= 3.0.
2. Write the metadata key `org.apache.spark.int96NoRebase` to parquet files if the files are saved with `spark.sql.legacy.parquet.int96RebaseModeInWrite` isn't set to `LEGACY`.
3. Add the SQL config `spark.sql.legacy.parquet.datetimeRebaseModeInRead` to control loading INT96 timestamps when parquet metadata doesn't have enough info (the `org.apache.spark.int96NoRebase` tag) about parquet writer - either INT96 was written by Proleptic Gregorian system or some Julian one.
4. Modified Vectorized and Parquet-mr Readers to support loading/saving INT96 timestamps w/o rebasing depending on SQL config and the metadata tag:
    - **No rebasing** in testing when the SQL config `spark.test.forceNoRebase` is set to `true`
    - **No rebasing** if parquet metadata contains the tag `org.apache.spark.int96NoRebase`. This is the case when parquet files are saved by Spark >= 3.1 with `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` is set to `CORRECTED`, or saved by other systems with the tag `org.apache.spark.int96NoRebase`.
    - **With rebasing** if parquet files saved by Spark (any versions) without the metadata tag `org.apache.spark.int96NoRebase`.
    - Rebasing depend on the SQL config `spark.sql.legacy.parquet.datetimeRebaseModeInRead` if there are no metadata tags `org.apache.spark.version` and `org.apache.spark.int96NoRebase`.

New SQL configs are added instead of re-using existing `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` and `spark.sql.legacy.parquet.datetimeRebaseModeInRead` because of:
- To allow users have different modes for INT96 and for TIMESTAMP_MICROS (MILLIS). For example, users might want to save INT96 as LEGACY but TIMESTAMP_MICROS as CORRECTED.
- To have different modes for INT96 and DATE in load (or in save).
- To be backward compatible with Spark 2.4. For now, `spark.sql.legacy.parquet.datetimeRebaseModeInWrite/Read` are set to `EXCEPTION` by default.

### Why are the changes needed?
1. Parquet spec says that INT96 must be stored as Julian days (see https://github.com/apache/parquet-format/pull/49). This doesn't mean that a reader ( or a writer) is based on the Julian calendar. So, rebasing from Proleptic Gregorian to Julian calendar can be not needed.
2. Rebasing from/to Julian calendar can loose information because dates in one calendar don't exist in another one. Like 1582-10-04..1582-10-15 exist in Proleptic Gregorian calendar but not in the hybrid calendar (Julian + Gregorian), and visa versa, Julian date 1000-02-29 doesn't exist in Proleptic Gregorian calendar. We should allow users to save timestamps without loosing such dates (rebasing shifts such dates to the next valid date).
3. It would also make Spark compatible with other systems such as Impala and newer versions of Hive that write proleptic Gregorian based INT96 timestamps.

### Does this PR introduce _any_ user-facing change?
It can when `spark.sql.legacy.parquet.int96RebaseModeInWrite` is set non-default value `LEGACY`.

### How was this patch tested?
- Added a test to check the metadata key `org.apache.spark.int96NoRebase`
- By `ParquetIOSuite`

Closes #30056 from MaxGekk/parquet-rebase-int96.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-20 14:58:59 +09:00
Nan Zhu 35133901f7 [SPARK-32351][SQL] Show partially pushed down partition filters in explain()
### What changes were proposed in this pull request?

Currently, actual non-dynamic partition pruning is executed in the optimizer phase (PruneFileSourcePartitions) if an input relation has a catalog file index. The current code assumes the same partition filters are generated again in FileSourceStrategy and passed into FileSourceScanExec. FileSourceScanExec uses the partition filters when listing files, but these non-dynamic partition filters do nothing because unnecessary partitions are already pruned in advance, so the filters are mainly used for explain output in this case. If a WHERE clause has DNF-ed predicates, FileSourceStrategy cannot extract the same filters with PruneFileSourcePartitions and then PartitionFilters is not shown in explain output.

This patch proposes to extract partition filters in FileSourceStrategy and HiveStrategy with `extractPredicatesWithinOutputSet` added in https://github.com/apache/spark/pull/29101/files#diff-6be42cfa3c62a7536b1eb1d6447c073c again, then It will show the partially pushed down partition filter in explain().

### Why are the changes needed?

without the patch, the explained plan is inconsistent with what is actually executed

<b>without the change </b> the explained plan of `"SELECT * FROM t WHERE p = '1' OR (p = '2' AND i = 1)"` for datasource and hive tables are like the following respectively (missing pushed down partition filters)

```
== Physical Plan ==
*(1) Filter ((p#21 = 1) OR ((p#21 = 2) AND (i#20 = 1)))
+- *(1) ColumnarToRow
   +- FileScan parquet default.t[i#20,p#21] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/nanzhu/code/spark/sql/hive/target/tmp/hive_execution_test_group/war..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<i:int>
```

```
   == Physical Plan ==
   *(1) Filter ((p#33 = 1) OR ((p#33 = 2) AND (i#32 = 1)))
   +- Scan hive default.t [i#32, p#33], HiveTableRelation [`default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [i#32], Partition Cols: [p#33], Pruned Partitions: [(p=1), (p=2)]]
```

<b> with change </b> the  plan looks like (the actually executed partition filters are exhibited)

```
== Physical Plan ==
*(1) Filter ((p#21 = 1) OR ((p#21 = 2) AND (i#20 = 1)))
+- *(1) ColumnarToRow
   +- FileScan parquet default.t[i#20,p#21] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/nanzhu/code/spark/sql/hive/target/tmp/hive_execution_test_group/war..., PartitionFilters: [((p#21 = 1) OR (p#21 = 2))], PushedFilters: [], ReadSchema: struct<i:int>
```

```
== Physical Plan ==
*(1) Filter ((p#37 = 1) OR ((p#37 = 2) AND (i#36 = 1)))
+- Scan hive default.t [i#36, p#37], HiveTableRelation [`default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [i#36], Partition Cols: [p#37], Pruned Partitions: [(p=1), (p=2)]], [((p#37 = 1) OR (p#37 = 2))]
```

### Does this PR introduce _any_ user-facing change

no

### How was this patch tested?
Unit test.

Closes #29831 from CodingCat/SPARK-32351.

Lead-authored-by: Nan Zhu <nanzhu@uber.com>
Co-authored-by: Nan Zhu <CodingCat@users.noreply.github.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-20 11:13:16 +09:00
Max Gekk 26b13c70c3 [SPARK-33169][SQL][TESTS] Check propagation of datasource options to underlying file system for built-in file-based datasources
### What changes were proposed in this pull request?
1. Add the common trait `CommonFileDataSourceSuite` with tests that can be executed for all built-in file-based datasources.
2. Add a test `CommonFileDataSourceSuite` to check that datasource options are propagated to underlying file systems as Hadoop configs.
3. Mix `CommonFileDataSourceSuite` to `AvroSuite`, `OrcSourceSuite`, `TextSuite`, `JsonSuite`, CSVSuite` and to `ParquetFileFormatSuite`.
4. Remove duplicated tests from `AvroSuite` and from `OrcSourceSuite`.

### Why are the changes needed?
To improve test coverage and test all built-in file-based datasources.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the affected test suites.

Closes #30067 from MaxGekk/ds-options-common-test.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-19 17:47:49 +09:00
Liang-Chi Hsieh 3010e9044e [SPARK-33170][SQL] Add SQL config to control fast-fail behavior in FileFormatWriter
### What changes were proposed in this pull request?

This patch proposes to add a config we can control fast-fail behavior in FileFormatWriter and set it false by default.

### Why are the changes needed?

In SPARK-29649, we catch `FileAlreadyExistsException` in `FileFormatWriter` and fail fast for the task set to prevent task retry.

Due to latest discussion, it is important to be able to keep original behavior that is to retry tasks even `FileAlreadyExistsException` is thrown, because `FileAlreadyExistsException` could be recoverable in some cases.

We are going to add a config we can control this behavior and set it false for fast-fail by default.

### Does this PR introduce _any_ user-facing change?

Yes. By default the task in FileFormatWriter will retry even if `FileAlreadyExistsException` is thrown. This is the behavior before Spark 3.0. User can control fast-fail behavior by enabling it.

### How was this patch tested?

Unit test.

Closes #30073 from viirya/SPARK-33170.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-17 21:02:25 -07:00
Liang-Chi Hsieh e574fcd230 [SPARK-32376][SQL] Make unionByName null-filling behavior work with struct columns
### What changes were proposed in this pull request?

SPARK-29358 added support for `unionByName` to work when the two datasets didn't necessarily have the same schema, but it does not work with nested columns like structs. This patch adds the support to work with struct columns.

The behavior before this PR:

```scala
scala> val df1 = spark.range(1).selectExpr("id c0", "named_struct('c', id + 1, 'b', id + 2, 'a', id + 3) c1")
scala> val df2 = spark.range(1).selectExpr("id c0", "named_struct('c', id + 1, 'b', id + 2) c1")
scala> df1.unionByName(df2, true).printSchema
org.apache.spark.sql.AnalysisException: Union can only be performed on tables with the compatible column types. struct<c:bigint,b:bigint> <> struct<c:bigint,b:bigint,a:bigint> at the second column of the second table;;
'Union false, false
:- Project [id#0L AS c0#2L, named_struct(c, (id#0L + cast(1 as bigint)), b, (id#0L + cast(2 as bigint)), a, (id#0L + cast(3 as bigint))) AS c1#3]
:  +- Range (0, 1, step=1, splits=Some(12))
+- Project [c0#8L, c1#9]
   +- Project [id#6L AS c0#8L, named_struct(c, (id#6L + cast(1 as bigint)), b, (id#6L + cast(2 as bigint))) AS c1#9]
      +- Range (0, 1, step=1, splits=Some(12))
```

The behavior after this PR:

```scala
scala> df1.unionByName(df2, true).printSchema
root
 |-- c0: long (nullable = false)
 |-- c1: struct (nullable = false)
 |    |-- a: long (nullable = true)
 |    |-- b: long (nullable = false)
 |    |-- c: long (nullable = false)
scala> df1.unionByName(df2, true).show()
+---+-------------+
| c0|           c1|
+---+-------------+
|  0|    {3, 2, 1}|
|  0|{ null, 2, 1}|
+---+-------------+
```

### Why are the changes needed?

The `allowMissingColumns` of `unionByName` is a feature allowing merging different schema from two datasets when unioning them together. Nested column support makes the feature more general and flexible for usage.

### Does this PR introduce _any_ user-facing change?

Yes, after this change users can union two datasets with different schema with different structs.

### How was this patch tested?

Unit tests.

Closes #29587 from viirya/SPARK-32376.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2020-10-16 14:48:14 -07:00
Max Gekk acb79f52db [MINOR][SQL] Re-use binaryToSQLTimestamp() in ParquetRowConverter
### What changes were proposed in this pull request?
The function `binaryToSQLTimestamp()` is used by Parquet Vectorized reader. Parquet MR reader has similar code for de-serialization of INT96 timestamps. In this PR, I propose to de-duplicate code and re-use `binaryToSQLTimestamp()`.

### Why are the changes needed?
This should improve maintenance, and should allow to avoid errors while changing Vectorized and regular parquet readers.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By existing test suites, for instance `ParquetIOSuite`.

Closes #30069 from MaxGekk/int96-common-serde.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-16 14:27:27 -07:00
Dongjoon Hyun ab0bad9544 [SPARK-33171][INFRA] Mark ParquetV*FilterSuite/ParquetV*SchemaPruningSuite as ExtendedSQLTest
### What changes were proposed in this pull request?

This PR aims to mark ParquetV1FilterSuite and ParquetV2FilterSuite as `ExtendedSQLTest`.
- ParquetV1FilterSuite/ParquetV2FilterSuite
- ParquetV1SchemaPruningSuite/ParquetV2SchemaPruningSuite

### Why are the changes needed?

Currently, `sql - other tests` is the longest job. This PR will move the above tests to `sql - slow tests` job.

**BEFORE**
- https://github.com/apache/spark/runs/1264150802 (1 hour 37 minutes)

**AFTER**
- https://github.com/apache/spark/pull/30068/checks?check_run_id=1265879896 (1 hour 21 minutes)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the Github Action with the reduced time.

Closes #30068 from dongjoon-hyun/MOVE3.

Lead-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-16 12:52:45 -07:00
neko e029e891ab [SPARK-33145][WEBUI] Fix when Succeeded Jobs has many child url elements,they will extend over the edge of the page
### What changes were proposed in this pull request?
In Execution web page, when `Succeeded Job`(or Failed Jobs) has many child url elements,they will extend over the edge of the page.

### Why are the changes needed?
To make the page more friendly.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

Munual test result shows  as below:

![fixed](https://user-images.githubusercontent.com/52202080/95977319-50734600-0e4b-11eb-93c0-b8deb565bcd8.png)

Closes #30035 from akiyamaneko/sql_execution_job_overflow.

Authored-by: neko <echohlne@gmail.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-10-16 23:13:22 +08:00
ulysses 3ae1520185 [SPARK-33131][SQL] Fix grouping sets with having clause can not resolve qualified col name
### What changes were proposed in this pull request?

Correct the resolution of having clause.

### Why are the changes needed?

Grouping sets construct new aggregate lost the qualified name of grouping expression. Here is a example:
```
-- Works resolved by `ResolveReferences`
select c1 from values (1) as t1(c1) group by grouping sets(t1.c1) having c1 = 1

-- Works because of the extra expression c1
select c1 as c2 from values (1) as t1(c1) group by grouping sets(t1.c1) having t1.c1 = 1

-- Failed
select c1 from values (1) as t1(c1) group by grouping sets(t1.c1) having t1.c1 = 1
```

It wroks with `Aggregate` without grouping sets through `ResolveReferences`, but Grouping sets not works since the exprId has been changed.

### Does this PR introduce _any_ user-facing change?

Yes, bug fix.

### How was this patch tested?

add test.

Closes #30029 from ulysses-you/SPARK-33131.

Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-16 11:26:27 +00:00
xuewei.linxuewei 306872eefa [SPARK-33139][SQL] protect setActionSession and clearActiveSession
### What changes were proposed in this pull request?

This PR is a sub-task of [SPARK-33138](https://issues.apache.org/jira/browse/SPARK-33138). In order to make SQLConf.get reliable and stable, we need to make sure user can't pollute the SQLConf and SparkSession Context via calling setActiveSession and clearActiveSession.

Change of the PR:

* add legacy config spark.sql.legacy.allowModifyActiveSession to fallback to old behavior if user do need to call these two API.
* by default, if user call these two API, it will throw exception
* add extra two internal and private API setActiveSessionInternal and clearActiveSessionInternal for current internal usage
* change all internal reference to new internal API except for SQLContext.setActive and SQLContext.clearActive

### Why are the changes needed?

Make SQLConf.get reliable and stable.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?

* Add UT in SparkSessionBuilderSuite to test the legacy config
* Existing test

Closes #30042 from leanken/leanken-SPARK-33139.

Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-16 06:05:17 +00:00
Takeshi Yamamuro a5c17de241 [SPARK-33165][SQL][TEST] Remove dependencies(scalatest,scalactic) from Benchmark
### What changes were proposed in this pull request?

This PR proposes to remove `assert` from `Benchmark` for making it easier to run benchmark codes via `spark-submit`.

### Why are the changes needed?

Since the current `Benchmark` (`master` and `branch-3.0`) has `assert`, we need to pass the proper jars of `scalatest` and `scalactic`;
 - scalatest-core_2.12-3.2.0.jar
 - scalatest-compatible-3.2.0.jar
 - scalactic_2.12-3.0.jar
```
./bin/spark-submit --jars scalatest-core_2.12-3.2.0.jar,scalatest-compatible-3.2.0.jar,scalactic_2.12-3.0.jar,./sql/catalyst/target/spark-catalyst_2.12-3.1.0-SNAPSHOT-tests.jar,./core/target/spark-core_2.12-3.1.0-SNAPSHOT-tests.jar --class org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark ./sql/core/target/spark-sql_2.12-3.1.0-SNAPSHOT-tests.jar --data-location /tmp/tpcds-sf1
```

This update can make developers submit benchmark codes without these dependencies;
```
./bin/spark-submit --jars ./sql/catalyst/target/spark-catalyst_2.12-3.1.0-SNAPSHOT-tests.jar,./core/target/spark-core_2.12-3.1.0-SNAPSHOT-tests.jar --class org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark ./sql/core/target/spark-sql_2.12-3.1.0-SNAPSHOT-tests.jar --data-location /tmp/tpcds-sf1
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually checked.

Closes #30064 from maropu/RemoveDepInBenchmark.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-16 11:39:09 +09:00
Huaxin Gao bf594a9788 [SPARK-32402][SQL][FOLLOW-UP] Add case sensitivity tests for column resolution in ALTER TABLE
### What changes were proposed in this pull request?
Add case sensitivity tests for column resolution in ALTER TABLE

### Why are the changes needed?
To make sure `spark.sql.caseSensitive` works for `ResolveAlterTableChanges`

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
new test

Closes #30063 from huaxingao/caseSensitivity.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-16 11:04:35 +09:00
Max Gekk 38c05af1d5 [SPARK-33163][SQL][TESTS] Check the metadata key 'org.apache.spark.legacyDateTime' in Avro/Parquet files
### What changes were proposed in this pull request?
Added a couple tests to `AvroSuite` and to `ParquetIOSuite` to check that the metadata key 'org.apache.spark.legacyDateTime' is written correctly depending on the SQL configs:
- spark.sql.legacy.avro.datetimeRebaseModeInWrite
- spark.sql.legacy.parquet.datetimeRebaseModeInWrite

This is a follow up https://github.com/apache/spark/pull/28137.

### Why are the changes needed?
1. To improve test coverage
2. To make sure that the metadata key is actually saved to Avro/Parquet files

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the added tests:
```
$ build/sbt "testOnly org.apache.spark.sql.execution.datasources.parquet.ParquetIOSuite"
$ build/sbt "avro/test:testOnly org.apache.spark.sql.avro.AvroV1Suite"
$ build/sbt "avro/test:testOnly org.apache.spark.sql.avro.AvroV2Suite"
```

Closes #30061 from MaxGekk/parquet-test-metakey.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-16 10:28:15 +09:00
Denis Pyshev ba69d68d91 [SPARK-33080][BUILD] Replace fatal warnings snippet
### What changes were proposed in this pull request?

Current solution in build file to enable build failure on compilation warnings with exclusion of deprecation ones is not portable after SBT version 1.3.13 (build import fails with compilation error with SBT 1.4) and could be replaced with more robust and maintainable, especially since Scala 2.13.2 with similar built-in functionality.

Additionally, warnings were fixed to pass the build, with as few changes as possible:
warnings in 2.12 compilation fixed in code,
warnings in 2.13 compilation covered by configuration to be addressed separately

### Why are the changes needed?

Unblocks upgrade to SBT after 1.3.13.
Enhances build file maintainability.
Allows fine tune of warnings configuration in scope of Scala 2.13 compilation.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`build/sbt`'s `compile` and `Test/compile` for both Scala 2.12 and 2.13 profiles.

Closes #29995 from gemelen/feature/warnings-reporter.

Authored-by: Denis Pyshev <git@gemelen.net>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-10-15 14:49:43 -05:00
Huaxin Gao 31f7097ce0 [SPARK-32402][SQL][FOLLOW-UP] Use quoted column name for JDBCTableCatalog.alterTable
### What changes were proposed in this pull request?
I currently have unquoted column names in alter table, e.g. ```ALTER TABLE "test"."alt_table" DROP COLUMN c1```
should change to quoted column name ```ALTER TABLE "test"."alt_table" DROP COLUMN "c1"```

### Why are the changes needed?
We should always use quoted identifiers in JDBC SQLs, e.g. ```CREATE TABLE "test"."abc" ("col" INTEGER )  ``` or ```INSERT INTO "test"."abc" ("col") VALUES (?)```. Using unquoted column name in alterTable causes problems, for example:
```
sql("CREATE TABLE h2.test.alt_table (c1 INTEGER, c2 INTEGER) USING _")
sql("ALTER TABLE h2.test.alt_table DROP COLUMN c1")

org.apache.spark.sql.AnalysisException: Failed table altering: test.alt_table;
......

Caused by: org.h2.jdbc.JdbcSQLException: Column "C1" not found; SQL statement:
ALTER TABLE "test"."alt_table" DROP COLUMN c1 [42122-195]

```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests

Closes #30041 from huaxingao/alter_table_followup.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-15 15:33:23 +00:00
manuzhang 77a8efbc05 [SPARK-32932][SQL] Do not use local shuffle reader at final stage on write command
### What changes were proposed in this pull request?
Do not use local shuffle reader at final stage if the root node is write command.

### Why are the changes needed?
Users usually repartition with partition column on dynamic partition overwrite. AQE could break it by removing physical shuffle with local shuffle reader. That could lead to a large number of output files, even exceeding the file system limit.

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
Add test.

Closes #29797 from manuzhang/spark-32932.

Authored-by: manuzhang <owenzhang1990@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-15 05:53:32 +00:00
Wenchen Fan f3ad32f4b6 [SPARK-33026][SQL][FOLLOWUP] metrics name should be numOutputRows
### What changes were proposed in this pull request?

Follow the convention and rename the metrics `numRows` to `numOutputRows`

### Why are the changes needed?

`FilterExec`, `HashAggregateExec`, etc. all use `numOutputRows`

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #30039 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-14 16:17:28 +00:00
Jungtaek Lim (HeartSaVioR) 8e5cb1d276 [SPARK-33136][SQL] Fix mistakenly swapped parameter in V2WriteCommand.outputResolved
### What changes were proposed in this pull request?

This PR proposes to fix a bug on calling `DataType.equalsIgnoreCompatibleNullability` with mistakenly swapped parameters in `V2WriteCommand.outputResolved`. The order of parameters for `DataType.equalsIgnoreCompatibleNullability` are `from` and `to`, which says that the right order of matching variables are `inAttr` and `outAttr`.

### Why are the changes needed?

Spark throws AnalysisException due to unresolved operator in v2 write, while the operator is unresolved due to a bug that parameters to call `DataType.equalsIgnoreCompatibleNullability` in `outputResolved` have been swapped.

### Does this PR introduce _any_ user-facing change?

Yes, end users no longer suffer on unresolved operator in v2 write if they're trying to write dataframe containing non-nullable complex types against table matching complex types as nullable.

### How was this patch tested?

New UT added.

Closes #30033 from HeartSaVioR/SPARK-33136.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-14 08:30:03 -07:00
Richard Penney d8c4a47ea1 [SPARK-33061][SQL] Expose inverse hyperbolic trig functions through sql.functions API
This patch is a small extension to change-request SPARK-28133, which added inverse hyperbolic functions to the SQL interpreter, but did not include those methods within the Scala `sql.functions._` API. This patch makes `acosh`, `asinh` and `atanh` functions available through the Scala API.

Unit-tests have been added to `sql/core/src/test/scala/org/apache/spark/sql/MathFunctionsSuite.scala`. Manual testing has been done via `spark-shell`, using the following recipe:
```
val df = spark.range(0, 11)
              .toDF("x")
              .withColumn("x", ($"x" - 5) / 2.0)
val hyps = df.withColumn("tanh", tanh($"x"))
             .withColumn("sinh", sinh($"x"))
             .withColumn("cosh", cosh($"x"))
val invhyps = hyps.withColumn("atanh", atanh($"tanh"))
                  .withColumn("asinh", asinh($"sinh"))
                  .withColumn("acosh", acosh($"cosh"))
invhyps.show
```
which produces the following output:
```
+----+--------------------+-------------------+------------------+-------------------+-------------------+------------------+
|   x|                tanh|               sinh|              cosh|              atanh|              asinh|             acosh|
+----+--------------------+-------------------+------------------+-------------------+-------------------+------------------+
|-2.5| -0.9866142981514303|-6.0502044810397875| 6.132289479663686| -2.500000000000001|-2.4999999999999956|               2.5|
|-2.0| -0.9640275800758169| -3.626860407847019|3.7621956910836314|-2.0000000000000004|-1.9999999999999991|               2.0|
|-1.5| -0.9051482536448664|-2.1292794550948173| 2.352409615243247|-1.4999999999999998|-1.4999999999999998|               1.5|
|-1.0| -0.7615941559557649|-1.1752011936438014| 1.543080634815244|               -1.0|               -1.0|               1.0|
|-0.5|-0.46211715726000974|-0.5210953054937474|1.1276259652063807|               -0.5|-0.5000000000000002|0.4999999999999998|
| 0.0|                 0.0|                0.0|               1.0|                0.0|                0.0|               0.0|
| 0.5| 0.46211715726000974| 0.5210953054937474|1.1276259652063807|                0.5|                0.5|0.4999999999999998|
| 1.0|  0.7615941559557649| 1.1752011936438014| 1.543080634815244|                1.0|                1.0|               1.0|
| 1.5|  0.9051482536448664| 2.1292794550948173| 2.352409615243247| 1.4999999999999998|                1.5|               1.5|
| 2.0|  0.9640275800758169|  3.626860407847019|3.7621956910836314| 2.0000000000000004|                2.0|               2.0|
| 2.5|  0.9866142981514303| 6.0502044810397875| 6.132289479663686|  2.500000000000001|                2.5|               2.5|
+----+--------------------+-------------------+------------------+-------------------+-------------------+------------------+
```

Closes #29938 from rwpenney/fix/inverse-hyperbolics.

Authored-by: Richard Penney <rwp@rwpenney.uk>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-10-14 08:48:55 -05:00
Max Gekk 05a62dcada [SPARK-33134][SQL] Return partial results only for root JSON objects
### What changes were proposed in this pull request?
In the PR, I propose to restrict the partial result feature only by root JSON objects. JSON datasource as well as `from_json()` will return `null` for malformed nested JSON objects.

### Why are the changes needed?
1. To not raise exception to users in the PERMISSIVE mode
2. To fix a regression and to have the same behavior as Spark 2.4.x has
3. Current implementation of partial result is supposed to work only for root (top-level) JSON objects, and not tested for bad nested complex JSON fields.

### Does this PR introduce _any_ user-facing change?
Yes. Before the changes, the code below:
```scala
    val pokerhand_raw = Seq("""[{"cards": [19], "playerId": 123456}]""").toDF("events")
    val event = new StructType().add("playerId", LongType).add("cards", ArrayType(new StructType().add("id", LongType).add("rank", StringType)))
    val pokerhand_events = pokerhand_raw.select(from_json($"events", ArrayType(event)).as("event"))
    pokerhand_events.show
```
throws the exception even in the default **PERMISSIVE** mode:
```java
java.lang.ClassCastException: java.lang.Long cannot be cast to org.apache.spark.sql.catalyst.util.ArrayData
  at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray(rows.scala:48)
  at org.apache.spark.sql.catalyst.expressions.BaseGenericInternalRow.getArray$(rows.scala:48)
  at org.apache.spark.sql.catalyst.expressions.GenericInternalRow.getArray(rows.scala:195)
```

After the changes:
```
+-----+
|event|
+-----+
| null|
+-----+
```

### How was this patch tested?
Added a test to `JsonFunctionsSuite`.

Closes #30031 from MaxGekk/json-skip-row-wrong-schema.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-14 12:13:54 +09:00
Prashant Sharma 304ca1ec93 [SPARK-33129][BUILD][DOCS] Updating the build/sbt references to test-only with testOnly for SBT 1.3.x
### What changes were proposed in this pull request?

test-only - > testOnly in docs across the project.

### Why are the changes needed?

Since the sbt version is updated, the older way or running i.e. `test-only` is no longer valid.

### Does this PR introduce _any_ user-facing change?

docs update.

### How was this patch tested?

Manually.

Closes #30028 from ScrapCodes/fix-build/sbt-sample.

Authored-by: Prashant Sharma <prashsh1@in.ibm.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-13 09:21:06 -07:00
xuewei.linxuewei dc697a8b59 [SPARK-13860][SQL] Change statistical aggregate function to return null instead of Double.NaN when divideByZero
### What changes were proposed in this pull request?

As [SPARK-13860](https://issues.apache.org/jira/browse/SPARK-13860) stated, TPCDS Query 39 returns wrong results using SparkSQL. The root cause is that when stddev_samp is applied to a single element set, with TPCDS answer, it return null; as in SparkSQL, it return Double.NaN which caused the wrong result.

Add an extra legacy config to fallback into the NaN logical, and return null by default to align with TPCDS standard.

### Why are the changes needed?

SQL correctness issue.

### Does this PR introduce any user-facing change?
Yes. See sql-migration-guide

In Spark 3.1, statistical aggregation function includes `std`, `stddev`, `stddev_samp`, `variance`, `var_samp`, `skewness`, `kurtosis`, `covar_samp`, `corr` will return `NULL` instead of `Double.NaN` when `DivideByZero` occurs during expression evaluation, for example, when `stddev_samp` applied on a single element set. In Spark version 3.0 and earlier, it will return `Double.NaN` in such case. To restore the behavior before Spark 3.1, you can set `spark.sql.legacy.statisticalAggregate` to `true`.

### How was this patch tested?
Updated DataFrameAggregateSuite/DataFrameWindowFunctionsSuite to test both default and legacy behavior.
Adjust DataFrameWindowFunctionsSuite/SQLQueryTestSuite and some R case to update to the default return null behavior.

Closes #29983 from leanken/leanken-SPARK-13860.

Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-13 13:21:45 +00:00
Huaxin Gao af3e2f7d58 [SPARK-33081][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: update type and nullability of columns (DB2 dialect)
### What changes were proposed in this pull request?
- Override the default SQL strings in the DB2 Dialect for:

  * ALTER TABLE UPDATE COLUMN TYPE
  * ALTER TABLE UPDATE COLUMN NULLABILITY

- Add new docker integration test suite jdbc/v2/DB2IntegrationSuite.scala

### Why are the changes needed?
In SPARK-24907, we implemented JDBC v2 Table Catalog but it doesn't support some ALTER TABLE at the moment. This PR supports DB2 specific ALTER TABLE.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
By running new integration test suite:

$ ./build/sbt -Pdocker-integration-tests "test-only *.DB2IntegrationSuite"

Closes #29972 from huaxingao/db2_docker.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-13 12:57:54 +00:00
Chao Sun feee8da14b [SPARK-32858][SQL] UnwrapCastInBinaryComparison: support other numeric types
### What changes were proposed in this pull request?

In SPARK-24994 we implemented unwrapping cast for **integral types**. This extends it to support **numeric types** such as float/double/decimal, so that filters involving these types can be better pushed down to data sources.

Unlike the cases of integral types, conversions between numeric types can result to rounding up or downs. Consider the following case:

```sql
cast(e as double) < 1.9
```

assume type of `e` is short, since 1.9 is not representable in the type, the casting will either truncate or round. Now suppose the literal is truncated, we cannot convert the expression to:

```sql
e < cast(1.9 as short)
```

as in the previous implementation, since if `e` is 1, the original expression evaluates to true, but converted expression will evaluate to false.

To resolve the above, this PR first finds out whether casting from the wider type to the narrower type will result to truncate or round, by comparing a _roundtrip value_ derived from **converting the literal first to the narrower type, and then to the wider type**, versus the original literal value. For instance, in the above, we'll first obtain a roundtrip value via the conversion (double) 1.9 -> (short) 1 -> (double) 1.0, and then compare it against 1.9.

<img width="1153" alt="Screen Shot 2020-09-28 at 3 30 27 PM" src="https://user-images.githubusercontent.com/506679/94492719-bd29e780-019f-11eb-9111-71d6e3d157f7.png">

Now in the case of truncate, we'd convert the original expression to:
```sql
e <= cast(1.9 as short)
```
instead, so that the conversion also is valid when `e` is 1.

For more details, please check [this blog post](https://prestosql.io/blog/2019/05/21/optimizing-the-casts-away.html) by Presto which offers a very good explanation on how it works.

### Why are the changes needed?

For queries such as:
```sql
SELECT * FROM tbl WHERE short_col < 100.5
```
The predicate `short_col < 100.5` can't be pushed down to data sources because it involves casts. This eliminates the cast so these queries can run more efficiently.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests

Closes #29792 from sunchao/SPARK-32858.

Lead-authored-by: Chao Sun <sunchao@apple.com>
Co-authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-13 12:44:20 +00:00
Yuming Wang e34f2d8df2 [SPARK-33119][SQL] ScalarSubquery should returns the first two rows to avoid Driver OOM
### What changes were proposed in this pull request?

`ScalarSubquery` should returns the first two rows.

### Why are the changes needed?

To avoid Driver OOM.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing test: d6f3138352/sql/core/src/test/scala/org/apache/spark/sql/SubquerySuite.scala (L147-L154)

Closes #30016 from wangyum/SPARK-33119.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-13 17:41:55 +09:00
Pablo 819f12ee2f [SPARK-33118][SQL] CREATE TEMPORARY TABLE fails with location
### What changes were proposed in this pull request?

We have a problem when you use CREATE TEMPORARY TABLE with LOCATION

```scala
spark.range(3).write.parquet("/tmp/testspark1")

sql("CREATE TEMPORARY TABLE t USING parquet OPTIONS (path '/tmp/testspark1')")
sql("CREATE TEMPORARY TABLE t USING parquet LOCATION '/tmp/testspark1'")
```
```scala
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
  at org.apache.spark.sql.execution.datasources.DataSource.$anonfun$getOrInferFileFormatSchema$12(DataSource.scala:200)
  at scala.Option.getOrElse(Option.scala:189)
  at org.apache.spark.sql.execution.datasources.DataSource.getOrInferFileFormatSchema(DataSource.scala:200)
  at org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:408)
  at org.apache.spark.sql.execution.datasources.CreateTempViewUsing.run(ddl.scala:94)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68)
  at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79)
  at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229)
  at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3618)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:100)
  at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160)
  at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:87)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
  at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
  at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3616)
  at org.apache.spark.sql.Dataset.<init>(Dataset.scala:229)
  at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
  at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607)
  at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602)
```
This bug was introduced by SPARK-30507.
sparksqlparser --> visitCreateTable --> visitCreateTableClauses --> cleanTableOptions extract the path from the options but in this case CreateTempViewUsing need the path in the options map.

### Why are the changes needed?

To fix the problem

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit testing and manual testing

Closes #30014 from planga82/bugfix/SPARK-33118_create_temp_table_location.

Authored-by: Pablo <pablo.langa@stratio.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-12 14:18:34 -07:00
xuewei.linxuewei b27a287ff2 [SPARK-33016][SQL] Potential SQLMetrics missed which might cause WEB UI display issue while AQE is on
### What changes were proposed in this pull request?

With following scenario when AQE is on, SQLMetrics could be incorrect.

1. Stage A and B are created, and UI updated thru event onAdaptiveExecutionUpdate.
2. Stage A and B are running. Subquery in stage A keep updating metrics thru event onAdaptiveSQLMetricUpdate.
3. Stage B completes, while stage A's subquery is still running, updating metrics.
4. Completion of stage B triggers new stage creation and UI update thru event onAdaptiveExecutionUpdate again (just like step 1).

So decided to make a trade off of keeping more duplicate SQLMetrics without deleting them when AQE with newPlan updated.

### Why are the changes needed?

Make SQLMetrics behavior 100% correct.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Updated SQLAppStatusListenerSuite.

Closes #29965 from leanken/leanken-SPARK-33016.

Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-12 14:48:40 +00:00
Liang-Chi Hsieh 78c0967bbe [SPARK-33092][SQL] Support subexpression elimination in ProjectExec
### What changes were proposed in this pull request?

This patch proposes to add subexpression elimination support into `ProjectExec`. It can be controlled by `spark.sql.subexpressionElimination.enabled` config.

Before this change:

```scala
val df = spark.read.option("header", true).csv("/tmp/test.csv")
 df.withColumn("my_map", expr("str_to_map(foo, '&', '=')")).select(col("my_map")("foo"), col("my_map")("bar"), col("my_map")("baz")).debugCodegen
```

L27-40: first `str_to_map`.
L68:81: second `str_to_map`.
L109-122: third `str_to_map`.

```
/* 024 */   private void project_doConsume_0(InternalRow inputadapter_row_0, UTF8String project_expr_0_0, boolean project_exprIsNull_0_0) throws java.io.IOException {
/* 025 */     boolean project_isNull_0 = true;
/* 026 */     UTF8String project_value_0 = null;
/* 027 */     boolean project_isNull_1 = true;
/* 028 */     MapData project_value_1 = null;
/* 029 */
/* 030 */     if (!project_exprIsNull_0_0) {
/* 031 */       project_isNull_1 = false; // resultCode could change nullability.
/* 032 */
/* 033 */       UTF8String[] project_kvs_0 = project_expr_0_0.split(((UTF8String) references[1] /* literal */), -1);
/* 034 */       for(UTF8String kvEntry: project_kvs_0) {
/* 035 */         UTF8String[] kv = kvEntry.split(((UTF8String) references[2] /* literal */), 2);
/* 036 */         ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[0] /* mapBuilder */).put(kv[0], kv.length == 2 ? kv[1] : null);
/* 037 */       }
/* 038 */       project_value_1 = ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[0] /* mapBuilder */).build();
/* 039 */
/* 040 */     }
/* 041 */     if (!project_isNull_1) {
/* 042 */       project_isNull_0 = false; // resultCode could change nullability.
/* 043 */
/* 044 */       final int project_length_0 = project_value_1.numElements();
/* 045 */       final ArrayData project_keys_0 = project_value_1.keyArray();
/* 046 */       final ArrayData project_values_0 = project_value_1.valueArray();
/* 047 */
/* 048 */       int project_index_0 = 0;
/* 049 */       boolean project_found_0 = false;
/* 050 */       while (project_index_0 < project_length_0 && !project_found_0) {
/* 051 */         final UTF8String project_key_0 = project_keys_0.getUTF8String(project_index_0);
/* 052 */         if (project_key_0.equals(((UTF8String) references[3] /* literal */))) {
/* 053 */           project_found_0 = true;
/* 054 */         } else {
/* 055 */           project_index_0++;
/* 056 */         }
/* 057 */       }
/* 058 */
/* 059 */       if (!project_found_0 || project_values_0.isNullAt(project_index_0)) {
/* 060 */         project_isNull_0 = true;
/* 061 */       } else {
/* 062 */         project_value_0 = project_values_0.getUTF8String(project_index_0);
/* 063 */       }
/* 064 */
/* 065 */     }
/* 066 */     boolean project_isNull_6 = true;
/* 067 */     UTF8String project_value_6 = null;
/* 068 */     boolean project_isNull_7 = true;
/* 069 */     MapData project_value_7 = null;
/* 070 */
/* 071 */     if (!project_exprIsNull_0_0) {
/* 072 */       project_isNull_7 = false; // resultCode could change nullability.
/* 073 */
/* 074 */       UTF8String[] project_kvs_1 = project_expr_0_0.split(((UTF8String) references[5] /* literal */), -1);
/* 075 */       for(UTF8String kvEntry: project_kvs_1) {
/* 076 */         UTF8String[] kv = kvEntry.split(((UTF8String) references[6] /* literal */), 2);
/* 077 */         ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[4] /* mapBuilder */).put(kv[0], kv.length == 2 ? kv[1] : null);
/* 078 */       }
/* 079 */       project_value_7 = ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[4] /* mapBuilder */).build();
/* 080 */
/* 081 */     }
/* 082 */     if (!project_isNull_7) {
/* 083 */       project_isNull_6 = false; // resultCode could change nullability.
/* 084 */
/* 085 */       final int project_length_1 = project_value_7.numElements();
/* 086 */       final ArrayData project_keys_1 = project_value_7.keyArray();
/* 087 */       final ArrayData project_values_1 = project_value_7.valueArray();
/* 088 */
/* 089 */       int project_index_1 = 0;
/* 090 */       boolean project_found_1 = false;
/* 091 */       while (project_index_1 < project_length_1 && !project_found_1) {
/* 092 */         final UTF8String project_key_1 = project_keys_1.getUTF8String(project_index_1);
/* 093 */         if (project_key_1.equals(((UTF8String) references[7] /* literal */))) {
/* 094 */           project_found_1 = true;
/* 095 */         } else {
/* 096 */           project_index_1++;
/* 097 */         }
/* 098 */       }
/* 099 */
/* 100 */       if (!project_found_1 || project_values_1.isNullAt(project_index_1)) {
/* 101 */         project_isNull_6 = true;
/* 102 */       } else {
/* 103 */         project_value_6 = project_values_1.getUTF8String(project_index_1);
/* 104 */       }
/* 105 */
/* 106 */     }
/* 107 */     boolean project_isNull_12 = true;
/* 108 */     UTF8String project_value_12 = null;
/* 109 */     boolean project_isNull_13 = true;
/* 110 */     MapData project_value_13 = null;
/* 111 */
/* 112 */     if (!project_exprIsNull_0_0) {
/* 113 */       project_isNull_13 = false; // resultCode could change nullability.
/* 114 */
/* 115 */       UTF8String[] project_kvs_2 = project_expr_0_0.split(((UTF8String) references[9] /* literal */), -1);
/* 116 */       for(UTF8String kvEntry: project_kvs_2) {
/* 117 */         UTF8String[] kv = kvEntry.split(((UTF8String) references[10] /* literal */), 2);
/* 118 */         ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[8] /* mapBuilder */).put(kv[0], kv.length == 2 ? kv[1] : null);
/* 119 */       }
/* 120 */       project_value_13 = ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[8] /* mapBuilder */).build();
/* 121 */
/* 122 */     }
...
```
After this change:

L27-40 evaluates the common map variable.

```
/* 024 */   private void project_doConsume_0(InternalRow inputadapter_row_0, UTF8String project_expr_0_0, boolean project_exprIsNull_0_0) throws java.io.IOException {
/* 025 */     // common sub-expressions
/* 026 */
/* 027 */     boolean project_isNull_0 = true;
/* 028 */     MapData project_value_0 = null;
/* 029 */
/* 030 */     if (!project_exprIsNull_0_0) {
/* 031 */       project_isNull_0 = false; // resultCode could change nullability.
/* 032 */
/* 033 */       UTF8String[] project_kvs_0 = project_expr_0_0.split(((UTF8String) references[1] /* literal */), -1);
/* 034 */       for(UTF8String kvEntry: project_kvs_0) {
/* 035 */         UTF8String[] kv = kvEntry.split(((UTF8String) references[2] /* literal */), 2);
/* 036 */         ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[0] /* mapBuilder */).put(kv[0], kv.length == 2 ? kv[1] : null);
/* 037 */       }
/* 038 */       project_value_0 = ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[0] /* mapBuilder */).build();
/* 039 */
/* 040 */     }
/* 041 */
/* 042 */     boolean project_isNull_4 = true;
/* 043 */     UTF8String project_value_4 = null;
/* 044 */
/* 045 */     if (!project_isNull_0) {
/* 046 */       project_isNull_4 = false; // resultCode could change nullability.
/* 047 */
/* 048 */       final int project_length_0 = project_value_0.numElements();
/* 049 */       final ArrayData project_keys_0 = project_value_0.keyArray();
/* 050 */       final ArrayData project_values_0 = project_value_0.valueArray();
/* 051 */
/* 052 */       int project_index_0 = 0;
/* 053 */       boolean project_found_0 = false;
/* 054 */       while (project_index_0 < project_length_0 && !project_found_0) {
/* 055 */         final UTF8String project_key_0 = project_keys_0.getUTF8String(project_index_0);
/* 056 */         if (project_key_0.equals(((UTF8String) references[3] /* literal */))) {
/* 057 */           project_found_0 = true;
/* 058 */         } else {
/* 059 */           project_index_0++;
/* 060 */         }
/* 061 */       }
/* 062 */
/* 063 */       if (!project_found_0 || project_values_0.isNullAt(project_index_0)) {
/* 064 */         project_isNull_4 = true;
/* 065 */       } else {
/* 066 */         project_value_4 = project_values_0.getUTF8String(project_index_0);
/* 067 */       }
/* 068 */
/* 069 */     }
/* 070 */     boolean project_isNull_6 = true;
/* 071 */     UTF8String project_value_6 = null;
/* 072 */
/* 073 */     if (!project_isNull_0) {
/* 074 */       project_isNull_6 = false; // resultCode could change nullability.
/* 075 */
/* 076 */       final int project_length_1 = project_value_0.numElements();
/* 077 */       final ArrayData project_keys_1 = project_value_0.keyArray();
/* 078 */       final ArrayData project_values_1 = project_value_0.valueArray();
/* 079 */
/* 080 */       int project_index_1 = 0;
/* 081 */       boolean project_found_1 = false;
/* 082 */       while (project_index_1 < project_length_1 && !project_found_1) {
/* 083 */         final UTF8String project_key_1 = project_keys_1.getUTF8String(project_index_1);
/* 084 */         if (project_key_1.equals(((UTF8String) references[4] /* literal */))) {
/* 085 */           project_found_1 = true;
/* 086 */         } else {
/* 087 */           project_index_1++;
/* 088 */         }
/* 089 */       }
/* 090 */
/* 091 */       if (!project_found_1 || project_values_1.isNullAt(project_index_1)) {
/* 092 */         project_isNull_6 = true;
/* 093 */       } else {
/* 094 */         project_value_6 = project_values_1.getUTF8String(project_index_1);
/* 095 */       }
/* 096 */
/* 097 */     }
/* 098 */     boolean project_isNull_8 = true;
/* 099 */     UTF8String project_value_8 = null;
/* 100 */
...
```

When the code is split into separated method:

```
/* 026 */   private void project_doConsume_0(InternalRow inputadapter_row_0, UTF8String project_expr_0_0, boolean project_exprIsNull_0_0) throws java.io.IOException {
/* 027 */     // common sub-expressions
/* 028 */
/* 029 */     MapData project_subExprValue_0 = project_subExpr_0(project_exprIsNull_0_0, project_expr_0_0);
/* 030 */
...
/* 140 */   private MapData project_subExpr_0(boolean project_exprIsNull_0_0, org.apache.spark.unsafe.types.UTF8String project_expr_0_0) {
/* 141 */     boolean project_isNull_0 = true;
/* 142 */     MapData project_value_0 = null;
/* 143 */
/* 144 */     if (!project_exprIsNull_0_0) {
/* 145 */       project_isNull_0 = false; // resultCode could change nullability.
/* 146 */
/* 147 */       UTF8String[] project_kvs_0 = project_expr_0_0.split(((UTF8String) references[1] /* literal */), -1);
/* 148 */       for(UTF8String kvEntry: project_kvs_0) {
/* 149 */         UTF8String[] kv = kvEntry.split(((UTF8String) references[2] /* literal */), 2);
/* 150 */         ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[0] /* mapBuilder */).put(kv[0], kv.length == 2 ? kv[1] : null);
/* 151 */       }
/* 152 */       project_value_0 = ((org.apache.spark.sql.catalyst.util.ArrayBasedMapBuilder) references[0] /* mapBuilder */).build();
/* 153 */
/* 154 */     }
/* 155 */     project_subExprIsNull_0 = project_isNull_0;
/* 156 */     return project_value_0;
/* 157 */   }
```

### Why are the changes needed?

Users occasionally write repeated expression in projection. It is also possibly that query optimizer optimizes a query to evaluate same expression many times in a Project. Currently in ProjectExec, we don't support subexpression elimination in Whole-stage codegen. We can support it to reduce redundant evaluation.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

`spark.sql.subexpressionElimination.enabled` is enabled by default. So that's said we should pass all tests with this change.

Closes #29975 from viirya/SPARK-33092.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-10-12 16:54:21 +09:00
Yuming Wang 543d59dfbf [SPARK-33107][BUILD][FOLLOW-UP] Remove com.twitter:parquet-hadoop-bundle:1.6.0 and orc.classifier
### What changes were proposed in this pull request?

This pr removes `com.twitter:parquet-hadoop-bundle:1.6.0` and `orc.classifier`.

### Why are the changes needed?

To make code more clear and readable.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing test.

Closes #30005 from wangyum/SPARK-33107.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-11 21:54:56 -07:00
Gabor Somogyi 4af1ac9384 [SPARK-32047][SQL] Add JDBC connection provider disable possibility
### What changes were proposed in this pull request?
At the moment there is no possibility to turn off JDBC authentication providers which exists on the classpath. This can be problematic because service providers are loaded with service loader. In this PR I've added `spark.sql.sources.disabledJdbcConnProviderList` configuration possibility (default: empty).

### Why are the changes needed?
No possibility to turn off JDBC authentication providers.

### Does this PR introduce _any_ user-facing change?
Yes, it introduces new configuration option.

### How was this patch tested?
* Existing + newly added unit tests.
* Existing integration tests.

Closes #29964 from gaborgsomogyi/SPARK-32047.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-12 12:24:54 +09:00
Gabor Somogyi 1e63dcc8f0 [SPARK-33102][SQL] Use stringToSeq on SQL list typed parameters
### What changes were proposed in this pull request?
While I've implemented JDBC provider disable functionality it has been popped up [here](https://github.com/apache/spark/pull/29964#discussion_r501786746) that `Utils.stringToSeq` must be used when String list type SQL parameter handled. In this PR I've fixed the problematic parameters.

### Why are the changes needed?
`Utils.stringToSeq` must be used when String list type SQL parameter handled.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing unit tests.

Closes #29989 from gaborgsomogyi/SPARK-33102.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-10 13:53:09 +09:00
HyukjinKwon 2e07ed3041 [SPARK-33082][SPARK-20202][BUILD][SQL][FOLLOW-UP] Remove Hive 1.2 workarounds and Hive 1.2 profile in Jenkins script
### What changes were proposed in this pull request?

This PR removes the leftover of Hive 1.2 workarounds and Hive 1.2 profile in Jenkins script.

- `test-hive1.2` title is not used anymore in Jenkins
- Remove some comments related to Hive 1.2
- Remove unused codes in `OrcFilters.scala`  Hive
- Test `spark.sql.hive.convertMetastoreOrc` disabled case for the tests added at SPARK-19809 and SPARK-22267

### Why are the changes needed?

To remove unused codes & improve test coverage

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually ran the unit tests. Also It will be tested in CI in this PR.

Closes #29973 from HyukjinKwon/SPARK-33082-SPARK-20202.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-09 03:04:26 -07:00
Jungtaek Lim (HeartSaVioR) edb140eb5c [SPARK-32896][SS] Add DataStreamWriter.table API
### What changes were proposed in this pull request?

This PR proposes to add `DataStreamWriter.table` to specify the output "table" to write from the streaming query.

### Why are the changes needed?

For now, there's no way to write to the table (especially catalog table) even the table is capable to handle streaming write, so even with Spark 3, writing to the catalog table via SS should go through the `DataStreamWriter.format(provider)` and wish the provider can handle it as same as we do with catalog table.

With the new API, we can directly point to the catalog table which supports streaming write. Some of usages are covered with tests - simply saying, end users can do the following:

```scala
// assuming `testcat` is a custom catalog, and `ns` is a namespace in the catalog
spark.sql("CREATE TABLE testcat.ns.table1 (id bigint, data string) USING foo")

val query = inputDF
      .writeStream
      .table("testcat.ns.table1")
      .option(...)
      .start()
```

### Does this PR introduce _any_ user-facing change?

Yes, as this adds a new public API in DataStreamWriter. This doesn't bring backward incompatible change.

### How was this patch tested?

New unit tests.

Closes #29767 from HeartSaVioR/SPARK-32896.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-09 03:01:54 -07:00
ulysses a9077299d7 [SPARK-32743][SQL] Add distinct info at UnresolvedFunction toString
### What changes were proposed in this pull request?

Add distinct info at `UnresolvedFunction.toString`.

### Why are the changes needed?

Make `UnresolvedFunction` info complete.

```
create table test (c1 int, c2 int);
explain extended select sum(distinct c1) from test;

-- before this pr
== Parsed Logical Plan ==
'Project [unresolvedalias('sum('c1), None)]
+- 'UnresolvedRelation [test]

-- after this pr
== Parsed Logical Plan ==
'Project [unresolvedalias('sum(distinct 'c1), None)]
+- 'UnresolvedRelation [test]
```

### Does this PR introduce _any_ user-facing change?

Yes, get distinct info during sql parse.

### How was this patch tested?

manual test.

Closes #29586 from ulysses-you/SPARK-32743.

Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-10-09 09:25:22 +09:00
Max Gekk c5f6af9f17 [SPARK-33094][SQL] Make ORC format propagate Hadoop config from DS options to underlying HDFS file system
### What changes were proposed in this pull request?
Propagate ORC options to Hadoop configs in Hive `OrcFileFormat` and in the regular ORC datasource.

### Why are the changes needed?
There is a bug that when running:
```scala
spark.read.format("orc").options(conf).load(path)
```
The underlying file system will not receive the conf options.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
Added UT to `OrcSourceSuite`.

Closes #29976 from MaxGekk/orc-option-propagation.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-08 11:59:30 -07:00
HyukjinKwon 5effa8ea26 [SPARK-33091][SQL] Avoid using map instead of foreach to avoid potential side effect at callers of OrcUtils.readCatalystSchema
### What changes were proposed in this pull request?

This is a kind of a followup of SPARK-32646. New JIRA was filed to control the fixed versions properly.

When you use `map`, it might be lazily evaluated and not executed. To avoid this,  we should better use `foreach`. See also SPARK-16694. Current codes look not causing any bug for now but it should be best to fix to avoid potential issues.

### Why are the changes needed?

To avoid potential issues from `map` being lazy and not executed.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Ran related tests. CI in this PR should verify.

Closes #29974 from HyukjinKwon/SPARK-32646.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-10-08 16:29:15 +09:00
Max Gekk 7d6e3fb998 [SPARK-33074][SQL] Classify dialect exceptions in JDBC v2 Table Catalog
### What changes were proposed in this pull request?
1. Add new method to the `JdbcDialect` class - `classifyException()`. It converts dialect specific exception to Spark's `AnalysisException` or its sub-classes.
2. Replace H2 exception  `org.h2.jdbc.JdbcSQLException` in `JDBCTableCatalogSuite` by `AnalysisException`.
3. Add `H2Dialect`

### Why are the changes needed?
Currently JDBC v2 Table Catalog implementation throws dialect specific exception and ignores exceptions defined in the `TableCatalog` interface. This PR adds new method for converting dialect specific exception, and assumes that follow up PRs will implement `classifyException()`.

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
By running existing test suites `JDBCTableCatalogSuite` and `JDBCV2Suite`.

Closes #29952 from MaxGekk/jdbcv2-classify-exception.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-08 05:28:33 +00:00
Terry Kim 1c781a4354 [SPARK-32282][SQL] Improve EnsureRquirement.reorderJoinKeys to handle more scenarios such as PartitioningCollection
### What changes were proposed in this pull request?

This PR proposes to improve  `EnsureRquirement.reorderJoinKeys` to handle the following scenarios:
1. If the keys cannot be reordered to match the left-side `HashPartitioning`, consider the right-side `HashPartitioning`.
2. Handle `PartitioningCollection`, which may contain `HashPartitioning`

### Why are the changes needed?

1. For the scenario 1), the current behavior matches either the left-side `HashPartitioning` or the right-side `HashPartitioning`. This means that if both sides are `HashPartitioning`, it will try to match only the left side.
The following will not consider the right-side `HashPartitioning`:
```
val df1 = (0 until 10).map(i => (i % 5, i % 13)).toDF("i1", "j1")
val df2 = (0 until 10).map(i => (i % 7, i % 11)).toDF("i2", "j2")
df1.write.format("parquet").bucketBy(4, "i1", "j1").saveAsTable("t1")df2.write.format("parquet").bucketBy(4, "i2", "j2").saveAsTable("t2")
val t1 = spark.table("t1")
val t2 = spark.table("t2")
val join = t1.join(t2, t1("i1") === t2("j2") && t1("i1") === t2("i2"))
 join.explain

== Physical Plan ==
*(5) SortMergeJoin [i1#26, i1#26], [j2#31, i2#30], Inner
:- *(2) Sort [i1#26 ASC NULLS FIRST, i1#26 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(i1#26, i1#26, 4), true, [id=#69]
:     +- *(1) Project [i1#26, j1#27]
:        +- *(1) Filter isnotnull(i1#26)
:           +- *(1) ColumnarToRow
:              +- FileScan parquet default.t1[i1#26,j1#27] Batched: true, DataFilters: [isnotnull(i1#26)], Format: Parquet, Location: InMemoryFileIndex[..., PartitionFilters: [], PushedFilters: [IsNotNull(i1)], ReadSchema: struct<i1:int,j1:int>, SelectedBucketsCount: 4 out of 4
+- *(4) Sort [j2#31 ASC NULLS FIRST, i2#30 ASC NULLS FIRST], false, 0.
   +- Exchange hashpartitioning(j2#31, i2#30, 4), true, [id=#79].       <===== This can be removed
      +- *(3) Project [i2#30, j2#31]
         +- *(3) Filter (((j2#31 = i2#30) AND isnotnull(j2#31)) AND isnotnull(i2#30))
            +- *(3) ColumnarToRow
               +- FileScan parquet default.t2[i2#30,j2#31] Batched: true, DataFilters: [(j2#31 = i2#30), isnotnull(j2#31), isnotnull(i2#30)], Format: Parquet, Location: InMemoryFileIndex[..., PartitionFilters: [], PushedFilters: [IsNotNull(j2), IsNotNull(i2)], ReadSchema: struct<i2:int,j2:int>, SelectedBucketsCount: 4 out of 4

```

2.  For the scenario 2), the current behavior does not handle `PartitioningCollection`:
```
val df1 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i1", "j1")
val df2 = (0 until 100).map(i => (i % 7, i % 11)).toDF("i2", "j2")
val df3 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i3", "j3")
val join = df1.join(df2, df1("i1") === df2("i2") && df1("j1") === df2("j2")) // PartitioningCollection
val join2 = join.join(df3, join("j1") === df3("j3") && join("i1") === df3("i3"))
join2.explain

== Physical Plan ==
*(9) SortMergeJoin [j1#8, i1#7], [j3#30, i3#29], Inner
:- *(6) Sort [j1#8 ASC NULLS FIRST, i1#7 ASC NULLS FIRST], false, 0.       <===== This can be removed
:  +- Exchange hashpartitioning(j1#8, i1#7, 5), true, [id=#58]             <===== This can be removed
:     +- *(5) SortMergeJoin [i1#7, j1#8], [i2#18, j2#19], Inner
:        :- *(2) Sort [i1#7 ASC NULLS FIRST, j1#8 ASC NULLS FIRST], false, 0
:        :  +- Exchange hashpartitioning(i1#7, j1#8, 5), true, [id=#45]
:        :     +- *(1) Project [_1#2 AS i1#7, _2#3 AS j1#8]
:        :        +- *(1) LocalTableScan [_1#2, _2#3]
:        +- *(4) Sort [i2#18 ASC NULLS FIRST, j2#19 ASC NULLS FIRST], false, 0
:           +- Exchange hashpartitioning(i2#18, j2#19, 5), true, [id=#51]
:              +- *(3) Project [_1#13 AS i2#18, _2#14 AS j2#19]
:                 +- *(3) LocalTableScan [_1#13, _2#14]
+- *(8) Sort [j3#30 ASC NULLS FIRST, i3#29 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(j3#30, i3#29, 5), true, [id=#64]
      +- *(7) Project [_1#24 AS i3#29, _2#25 AS j3#30]
         +- *(7) LocalTableScan [_1#24, _2#25]
```
### Does this PR introduce _any_ user-facing change?

Yes, now from the above examples, the shuffle/sort nodes pointed by `This can be removed` are now removed:
1. Senario 1):
```
== Physical Plan ==
*(4) SortMergeJoin [i1#26, i1#26], [i2#30, j2#31], Inner
:- *(2) Sort [i1#26 ASC NULLS FIRST, i1#26 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(i1#26, i1#26, 4), true, [id=#67]
:     +- *(1) Project [i1#26, j1#27]
:        +- *(1) Filter isnotnull(i1#26)
:           +- *(1) ColumnarToRow
:              +- FileScan parquet default.t1[i1#26,j1#27] Batched: true, DataFilters: [isnotnull(i1#26)], Format: Parquet, Location: InMemoryFileIndex[..., PartitionFilters: [], PushedFilters: [IsNotNull(i1)], ReadSchema: struct<i1:int,j1:int>, SelectedBucketsCount: 4 out of 4
+- *(3) Sort [i2#30 ASC NULLS FIRST, j2#31 ASC NULLS FIRST], false, 0
   +- *(3) Project [i2#30, j2#31]
      +- *(3) Filter (((j2#31 = i2#30) AND isnotnull(j2#31)) AND isnotnull(i2#30))
         +- *(3) ColumnarToRow
            +- FileScan parquet default.t2[i2#30,j2#31] Batched: true, DataFilters: [(j2#31 = i2#30), isnotnull(j2#31), isnotnull(i2#30)], Format: Parquet, Location: InMemoryFileIndex[..., PartitionFilters: [], PushedFilters: [IsNotNull(j2), IsNotNull(i2)], ReadSchema: struct<i2:int,j2:int>, SelectedBucketsCount: 4 out of 4
```
2. Scenario 2):
```
== Physical Plan ==
*(8) SortMergeJoin [i1#7, j1#8], [i3#29, j3#30], Inner
:- *(5) SortMergeJoin [i1#7, j1#8], [i2#18, j2#19], Inner
:  :- *(2) Sort [i1#7 ASC NULLS FIRST, j1#8 ASC NULLS FIRST], false, 0
:  :  +- Exchange hashpartitioning(i1#7, j1#8, 5), true, [id=#43]
:  :     +- *(1) Project [_1#2 AS i1#7, _2#3 AS j1#8]
:  :        +- *(1) LocalTableScan [_1#2, _2#3]
:  +- *(4) Sort [i2#18 ASC NULLS FIRST, j2#19 ASC NULLS FIRST], false, 0
:     +- Exchange hashpartitioning(i2#18, j2#19, 5), true, [id=#49]
:        +- *(3) Project [_1#13 AS i2#18, _2#14 AS j2#19]
:           +- *(3) LocalTableScan [_1#13, _2#14]
+- *(7) Sort [i3#29 ASC NULLS FIRST, j3#30 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(i3#29, j3#30, 5), true, [id=#58]
      +- *(6) Project [_1#24 AS i3#29, _2#25 AS j3#30]
         +- *(6) LocalTableScan [_1#24, _2#25]
```

### How was this patch tested?

Added tests.

Closes #29074 from imback82/reorder_keys.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-08 04:58:41 +00:00
Karen Feng 39510b0e9b [SPARK-32793][SQL] Add raise_error function, adds error message parameter to assert_true
## What changes were proposed in this pull request?

Adds a SQL function `raise_error` which underlies the refactored `assert_true` function. `assert_true` now also (optionally) accepts a custom error message field.
`raise_error` is exposed in SQL, Python, Scala, and R.
`assert_true` was previously only exposed in SQL; it is now also exposed in Python, Scala, and R.

### Why are the changes needed?

Improves usability of `assert_true` by clarifying error messaging, and adds the useful helper function `raise_error`.

### Does this PR introduce _any_ user-facing change?

Yes:
- Adds `raise_error` function to the SQL, Python, Scala, and R APIs.
- Adds `assert_true` function to the SQL, Python and R APIs.

### How was this patch tested?

Adds unit tests in SQL, Python, Scala, and R for `assert_true` and `raise_error`.

Closes #29947 from karenfeng/spark-32793.

Lead-authored-by: Karen Feng <karen.feng@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-08 12:05:39 +09:00
Max Gekk 23afc930ae [SPARK-26499][SQL][FOLLOWUP] Print the loading provider exception starting from the INFO level
### What changes were proposed in this pull request?
1. Don't print the exception in the error message while loading a built-in provider.
2. Print the exception starting from the INFO level.

Up to the INFO level, the output is:
```
17:48:32.342 ERROR org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider: Failed to load built in provider.
```
and starting from the INFO level:
```
17:48:32.342 ERROR org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider: Failed to load built in provider.
17:48:32.342 INFO org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider: Loading of the provider failed with the exception:
java.util.ServiceConfigurationError: org.apache.spark.sql.jdbc.JdbcConnectionProvider: Provider org.apache.spark.sql.execution.datasources.jdbc.connection.IntentionallyFaultyConnectionProvider could not be instantiated
	at java.util.ServiceLoader.fail(ServiceLoader.java:232)
	at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
	at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
	at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
	at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
	at org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider$.loadProviders(ConnectionProvider.scala:41)
```

### Why are the changes needed?
To avoid "noise" in logs while running tests. Currently, logs are blown up:
```
org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider: Loading of the provider failed with the exception:
java.util.ServiceConfigurationError: org.apache.spark.sql.jdbc.JdbcConnectionProvider: Provider org.apache.spark.sql.execution.datasources.jdbc.connection.IntentionallyFaultyConnectionProvider could not be instantiated
	at java.util.ServiceLoader.fail(ServiceLoader.java:232)
	at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
	at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
	at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
	at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
	at org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider$.loadProviders(ConnectionProvider.scala:41)
...
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Intentional Exception
	at org.apache.spark.sql.execution.datasources.jdbc.connection.IntentionallyFaultyConnectionProvider.<init>(IntentionallyFaultyConnectionProvider.scala:26)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at java.lang.Class.newInstance(Class.java:442)
	at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380)
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running:
```
$ build/sbt "sql/test:testOnly org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalogSuite"
```

Closes #29968 from MaxGekk/gaborgsomogyi-SPARK-32001-followup.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-07 13:50:15 -07:00
Dongjoon Hyun a127387a53 [SPARK-33082][SQL] Remove hive-1.2 workaround code
### What changes were proposed in this pull request?

This PR removes old Hive-1.2 profile related workaround code.

### Why are the changes needed?

To simply the code.
### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CI.

Closes #29961 from dongjoon-hyun/SPARK-HIVE12.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-07 12:27:23 -07:00
Terry Kim 7e99fcd64e [SPARK-33004][SQL] Migrate DESCRIBE column to use UnresolvedTableOrView to resolve the identifier
### What changes were proposed in this pull request?

This PR proposes to migrate `DESCRIBE tbl colname` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).

### Why are the changes needed?

The current behavior is not consistent between v1 and v2 commands when resolving a temp view.
In v2, the `t` in the following example is resolved to a table:
```scala
sql("CREATE TABLE testcat.ns.t (id bigint) USING foo")
sql("CREATE TEMPORARY VIEW t AS SELECT 2 as i")
sql("USE testcat.ns")
sql("DESCRIBE t i") // 't' is resolved to testcat.ns.t

Describing columns is not supported for v2 tables.;
org.apache.spark.sql.AnalysisException: Describing columns is not supported for v2 tables.;
```
whereas in v1, the `t` is resolved to a temp view:
```scala
sql("CREATE DATABASE test")
sql("CREATE TABLE spark_catalog.test.t (id bigint) USING csv")
sql("CREATE TEMPORARY VIEW t AS SELECT 2 as i")
sql("USE spark_catalog.test")
sql("DESCRIBE t i").show // 't' is resolved to a temp view

+---------+----------+
|info_name|info_value|
+---------+----------+
| col_name|         i|
|data_type|       int|
|  comment|      NULL|
+---------+----------+
```

### Does this PR introduce _any_ user-facing change?

After this PR, `DESCRIBE t i` is resolved to a temp view `t` instead of `testcat.ns.t`.

### How was this patch tested?

Added a new test

Closes #29880 from imback82/describe_column_consistent.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-07 06:33:20 +00:00
Max Gekk aea78d2c8c [SPARK-33034][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (Oracle dialect)
### What changes were proposed in this pull request?
1. Override the default SQL strings in the Oracle Dialect for:
    - ALTER TABLE ADD COLUMN
    - ALTER TABLE UPDATE COLUMN TYPE
    - ALTER TABLE UPDATE COLUMN NULLABILITY
2. Add new docker integration test suite `jdbc/v2/OracleIntegrationSuite.scala`

### Why are the changes needed?
In SPARK-24907, we implemented JDBC v2 Table Catalog but it doesn't support some `ALTER TABLE` at the moment. This PR supports Oracle specific `ALTER TABLE`.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
By running new integration test suite:
```
$ ./build/sbt -Pdocker-integration-tests "test-only *.OracleIntegrationSuite"
```

Closes #29912 from MaxGekk/jdbcv2-oracle-alter-table.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-07 04:48:57 +00:00
Max Gekk 584f90c82e [SPARK-33067][SQL][TESTS][FOLLOWUP] Check error messages in JDBCTableCatalogSuite
### What changes were proposed in this pull request?
Get error message from the expected exception, and check that they are reasonable.

### Why are the changes needed?
To improve tests by expecting particular error messages.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running `JDBCTableCatalogSuite`.

Closes #29957 from MaxGekk/jdbcv2-negative-tests-followup.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-07 09:29:30 +09:00
fqaiser94@gmail.com 2793347972 [SPARK-32511][SQL] Add dropFields method to Column class
### What changes were proposed in this pull request?

1. Refactored `WithFields` Expression to make it more extensible (now `UpdateFields`).
2. Added a new `dropFields` method to the `Column` class. This method should allow users to drop a `StructField` in a `StructType` column (with similar semantics to the `drop` method on `Dataset`).

### Why are the changes needed?

Often Spark users have to work with deeply nested data e.g. to fix a data quality issue with an existing `StructField`. To do this with the existing Spark APIs, users have to rebuild the entire struct column.

For example, let's say you have the following deeply nested data structure which has a data quality issue (`5` is missing):
```
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

val data = spark.createDataFrame(sc.parallelize(
      Seq(Row(Row(Row(1, 2, 3), Row(Row(4, null, 6), Row(7, 8, 9), Row(10, 11, 12)), Row(13, 14, 15))))),
      StructType(Seq(
        StructField("a", StructType(Seq(
          StructField("a", StructType(Seq(
            StructField("a", IntegerType),
            StructField("b", IntegerType),
            StructField("c", IntegerType)))),
          StructField("b", StructType(Seq(
            StructField("a", StructType(Seq(
              StructField("a", IntegerType),
              StructField("b", IntegerType),
              StructField("c", IntegerType)))),
            StructField("b", StructType(Seq(
              StructField("a", IntegerType),
              StructField("b", IntegerType),
              StructField("c", IntegerType)))),
            StructField("c", StructType(Seq(
              StructField("a", IntegerType),
              StructField("b", IntegerType),
              StructField("c", IntegerType))))
          ))),
          StructField("c", StructType(Seq(
            StructField("a", IntegerType),
            StructField("b", IntegerType),
            StructField("c", IntegerType))))
        )))))).cache

data.show(false)
+---------------------------------+
|a                                |
+---------------------------------+
|[[1, 2, 3], [[4,, 6], [7, 8, 9]]]|
+---------------------------------+
```
Currently, to drop the missing value users would have to do something like this:
```
val result = data.withColumn("a",
  struct(
    $"a.a",
    struct(
      struct(
        $"a.b.a.a",
        $"a.b.a.c"
      ).as("a"),
      $"a.b.b",
      $"a.b.c"
    ).as("b"),
    $"a.c"
  ))

result.show(false)
+---------------------------------------------------------------+
|a                                                              |
+---------------------------------------------------------------+
|[[1, 2, 3], [[4, 6], [7, 8, 9], [10, 11, 12]], [13, 14, 15]]|
+---------------------------------------------------------------+
```
As you can see above, with the existing methods users must call the `struct` function and list all fields, including fields they don't want to change. This is not ideal as:
>this leads to complex, fragile code that cannot survive schema evolution.
[SPARK-16483](https://issues.apache.org/jira/browse/SPARK-16483)

In contrast, with the method added in this PR, a user could simply do something like this to get the same result:
```
val result = data.withColumn("a", 'a.dropFields("b.a.b"))
result.show(false)
+---------------------------------------------------------------+
|a                                                              |
+---------------------------------------------------------------+
|[[1, 2, 3], [[4, 6], [7, 8, 9], [10, 11, 12]], [13, 14, 15]]|
+---------------------------------------------------------------+

```

This is the second of maybe 3 methods that could be added to the `Column` class to make it easier to manipulate nested data.
Other methods under discussion in [SPARK-22231](https://issues.apache.org/jira/browse/SPARK-22231) include `withFieldRenamed`.
However, this should be added in a separate PR.

### Does this PR introduce _any_ user-facing change?

The documentation for `Column.withField` method has changed to include an additional note about how to write optimized queries when adding multiple nested Column directly.

### How was this patch tested?

New unit tests were added. Jenkins must pass them.

### Related JIRAs:
More discussion on this topic can be found here:
- https://issues.apache.org/jira/browse/SPARK-22231
- https://issues.apache.org/jira/browse/SPARK-16483

Closes #29795 from fqaiser94/SPARK-32511-dropFields-second-try.

Authored-by: fqaiser94@gmail.com <fqaiser94@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-06 08:53:30 +00:00
Max Gekk 9870cf9c08 [SPARK-33067][SQL][TESTS] Add negative checks to JDBC v2 Table Catalog tests
### What changes were proposed in this pull request?
Add checks for the cases when JDBC v2 Table Catalog commands fail.

### Why are the changes needed?
To improve test coverage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running `JDBCTableCatalogSuite`.

Closes #29945 from MaxGekk/jdbcv2-negative-tests.

Lead-authored-by: Max Gekk <max.gekk@gmail.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-06 13:01:57 +09:00
Dongjoon Hyun 008a2ad1f8 [SPARK-20202][BUILD][SQL] Remove references to org.spark-project.hive (Hive 1.2.1)
### What changes were proposed in this pull request?

As of today,
- SPARK-30034 Apache Spark 3.0.0 switched its default Hive execution engine from Hive 1.2 to Hive 2.3. This removes the direct dependency to the forked Hive 1.2.1 in maven repository.
- SPARK-32981 Apache Spark 3.1.0(`master` branch) removed Hive 1.2 related artifacts from Apache Spark binary distributions.

This PR(SPARK-20202) aims to remove the following usage of unofficial Apache Hive fork completely from Apache Spark master for Apache Spark 3.1.0.
```
<hive.group>org.spark-project.hive</hive.group>
<hive.version>1.2.1.spark2</hive.version>
```

For the forked Hive 1.2.1.spark2 users, Apache Spark 2.4(LTS) and 3.0 (~ 2021.12) will provide it.

### Why are the changes needed?

- First, Apache Spark community should not use the unofficial forked release of another Apache project.
- Second, Apache Hive 1.2.1 was released at 2015-06-26 and the forked Hive `1.2.1.spark2` exposed many unfixable bugs in Apache because the forked `1.2.1.spark2` is not maintained at all. Apache Hive 2.3.0 was released at 2017-07-19 and it has been used with less number of bugs compared with `1.2.1.spark2`. Many bugs still exist in `hive-1.2` profile and new Apache Spark unit tests are added with `HiveUtils.isHive23` condition so far.

### Does this PR introduce _any_ user-facing change?

No. This is a dev-only change. PRBuilder will not accept `[test-hive1.2]` on master and `branch-3.1`.

### How was this patch tested?

1. SBT/Hadoop 3.2/Hive 2.3 (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129366)
2. SBT/Hadoop 2.7/Hive 2.3 (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129382)
3. SBT/Hadoop 3.2/Hive 1.2 (This has not been supported already due to Hive 1.2 doesn't work with Hadoop 3.2.)
4. SBT/Hadoop 2.7/Hive 1.2 (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129383, This is rejected)

Closes #29936 from dongjoon-hyun/SPARK-REMOVE-HIVE1.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-05 15:29:56 -07:00
allisonwang-db 14aeab3b27 [SPARK-33038][SQL] Combine AQE initial and current plan string when two plans are the same
### What changes were proposed in this pull request?
This PR combines the current plan and the initial plan in the AQE query plan string when the two plans are the same. It also removes the `== Current Plan ==` and `== Initial Plan ==` headers:

Before
```scala
AdaptiveSparkPlan isFinalPlan=false
+- == Current Plan ==
   SortMergeJoin [key#13], [a#23], Inner
   :- Sort [key#13 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(key#13, 5), true, [id=#94]
            ...
+- == Initial Plan ==
   SortMergeJoin [key#13], [a#23], Inner
   :- Sort [key#13 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(key#13, 5), true, [id=#94]
            ...
```
After
```scala
AdaptiveSparkPlan isFinalPlan=false
+- SortMergeJoin [key#13], [a#23], Inner
   :- Sort [key#13 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(key#13, 5), true, [id=#94]
            ...
```
For SQL `EXPLAIN` output:
Before
```scala
AdaptiveSparkPlan (8)
+- == Current Plan ==
   Sort (7)
   +- Exchange (6)
      ...
+- == Initial Plan ==
   Sort (7)
   +- Exchange (6)
      ...
```
After
```scala
AdaptiveSparkPlan (8)
+- Sort (7)
   +- Exchange (6)
      ...
```

### Why are the changes needed?
To simplify the AQE plan string by removing the redundant plan information.

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
Modified the existing unit test.

Closes #29915 from allisonwang-db/aqe-explain.

Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2020-10-05 09:30:27 -07:00
Yuming Wang 9996e252ad [SPARK-33026][SQL] Add numRows to metric of BroadcastExchangeExec
### What changes were proposed in this pull request?

This pr adds `numRows` to the metric and runtimeStatistics of `BroadcastExchangeExec`.

### Why are the changes needed?

[`JoinEstimation.estimateInnerOuterJoin`](d6a68e0b67/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala (L55-L156)) need row count. The [ShuffleExchangeExec](1c6dff7b5f/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala (L127)) have added the row count, but `BroadcastExchangeExec` missing the row count.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #29904 from wangyum/SPARK-33026.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-01 23:01:31 -07:00
Gabor Somogyi 991f7e81d4 [SPARK-32001][SQL] Create JDBC authentication provider developer API
### What changes were proposed in this pull request?
At the moment only the baked in JDBC connection providers can be used but there is a need to support additional databases and use-cases. In this PR I'm proposing a new developer API name `JdbcConnectionProvider`. To show how an external JDBC connection provider can be implemented I've created an example [here](https://github.com/gaborgsomogyi/spark-jdbc-connection-provider).

The PR contains the following changes:
* Added connection provider developer API
* Made JDBC connection providers constructor to noarg => needed to load them w/ service loader
* Connection providers are now loaded w/ service loader
* Added tests to load providers independently
* Moved `SecurityConfigurationLock` into a central place because other areas will change global JVM security config

### Why are the changes needed?
No custom authentication possibility.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
* Existing + additional unit tests
* Docker integration tests
* Tested manually the newly created external JDBC connection provider

Closes #29024 from gaborgsomogyi/SPARK-32001.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-02 13:04:40 +09:00
Cheng Su d6f3138352 [SPARK-32859][SQL] Introduce physical rule to decide bucketing dynamically
### What changes were proposed in this pull request?

This PR is to add support to decide bucketed table scan dynamically based on actual query plan. Currently bucketing is enabled by default (`spark.sql.sources.bucketing.enabled`=true), so for all bucketed tables in the query plan, we will use bucket table scan (all input files per the bucket will be read by same task). This has the drawback that if the bucket table scan is not benefitting at all (no join/groupby/etc in the query), we don't need to use bucket table scan as it would restrict the # of tasks to be # of buckets and might hurt parallelism.

The feature is to add a physical plan rule right after `EnsureRequirements`:

The rule goes through plan nodes. For all operators which has "interesting partition" (i.e., require `ClusteredDistribution` or `HashClusteredDistribution`), check if the sub-plan for operator has `Exchange` and bucketed table scan (and only allow certain operators in plan (i.e. `Scan/Filter/Project/Sort/PartialAgg/etc`.), see details in `DisableUnnecessaryBucketedScan.disableBucketWithInterestingPartition`). If yes, disable the bucketed table scan in the sub-plan. In addition, disabling bucketed table scan if there's operator with interesting partition along the sub-plan.

Why the algorithm works is that if there's a shuffle between the bucketed table scan and operator with interesting partition, then bucketed table scan partitioning will be destroyed by the shuffle operator in the middle, and we don't need bucketed table scan for sure.

The idea of "interesting partition" is inspired from "interesting order" in "Access Path Selection in a Relational Database Management System"(http://www.inf.ed.ac.uk/teaching/courses/adbs/AccessPath.pdf), after discussion with cloud-fan .

### Why are the changes needed?

To avoid unnecessary bucketed scan in the query, and this is prerequisite for https://github.com/apache/spark/pull/29625 (decide bucketed sorted scan dynamically will be added later in that PR).

### Does this PR introduce _any_ user-facing change?

A new config `spark.sql.sources.bucketing.autoBucketedScan.enabled` is introduced which set to false by default (the rule is disabled by default as it can regress cached bucketed table query, see discussion in https://github.com/apache/spark/pull/29804#issuecomment-701151447). User can opt-in/opt-out by enabling/disabling the config, as we found in prod, some users rely on assumption of # of tasks == # of buckets when reading bucket table to precisely control # of tasks. This is a bad assumption but it does happen on our side, so leave a config here to allow them opt-out for the feature.

### How was this patch tested?

Added unit tests in `DisableUnnecessaryBucketedScanSuite.scala`

Closes #29804 from c21/bucket-rule.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-10-02 09:01:15 +09:00
ulysses e62d24717e [SPARK-32585][SQL] Support scala enumeration in ScalaReflection
### What changes were proposed in this pull request?

Add code in `ScalaReflection` to support scala enumeration and make enumeration type as string type in Spark.

### Why are the changes needed?

We support java enum but failed with scala enum, it's better to keep the same behavior.

Here is a example.

```
package test

object TestEnum extends Enumeration {
  type TestEnum = Value
  val E1, E2, E3 = Value
}
import TestEnum._
case class TestClass(i: Int,  e: TestEnum) {
}

import test._
Seq(TestClass(1, TestEnum.E1)).toDS
```

Before this PR
```
Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for test.TestEnum.TestEnum
- field (class: "scala.Enumeration.Value", name: "e")
- root class: "test.TestClass"
  at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerFor$1(ScalaReflection.scala:567)
  at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
  at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:882)
  at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:881)
```

After this PR
`org.apache.spark.sql.Dataset[test.TestClass] = [i: int, e: string]`

### Does this PR introduce _any_ user-facing change?

Yes, user can make case class which include scala enumeration field as dataset.

### How was this patch tested?

Add test.

Closes #29403 from ulysses-you/SPARK-32585.

Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
2020-10-01 15:58:01 -04:00
Max Gekk 5651284c3b [SPARK-32992][SQL] Map Oracle's ROWID type to StringType in read via JDBC
### What changes were proposed in this pull request?
Convert the `ROWID` type in the Oracle JDBC dialect to Catalyst's `StringType`. The doc for Oracle 19c says explicitly that the type must be string: https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Data-Types.html#GUID-AEF1FE4C-2DE5-4BE7-BB53-83AD8F1E34EF

### Why are the changes needed?
To avoid the exception showed in https://stackoverflow.com/questions/52244492/spark-jdbc-dataframereader-fails-to-read-oracle-table-with-datatype-as-rowid

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
N/A

Closes #29884 from MaxGekk/jdbc-oracle-rowid-string.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-10-01 14:50:32 +09:00
Takeshi Yamamuro 3a299aa648 [SPARK-32741][SQL] Check if the same ExprId refers to the unique attribute in logical plans
### What changes were proposed in this pull request?

Some plan transformations (e.g., `RemoveNoopOperators`) implicitly assume the same `ExprId` refers to the unique attribute. But, `RuleExecutor` does not check this integrity between logical plan transformations. So, this PR intends to add this check in `isPlanIntegral` of `Analyzer`/`Optimizer`.

This PR comes from the talk with cloud-fan viirya in https://github.com/apache/spark/pull/29485#discussion_r475346278

### Why are the changes needed?

For better logical plan integrity checking.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #29585 from maropu/PlanIntegrityTest.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-09-30 21:37:29 +09:00
tanel.kiis@gmail.com 90e86f6fac [SPARK-32970][SPARK-32019][SQL][TEST] Reduce the runtime of an UT for
### What changes were proposed in this pull request?

The UT for SPARK-32019 (#28853) tries to write about 16GB of data do the disk. We must change the value of `spark.sql.files.maxPartitionBytes` to a smaller value do check the correct behavior with less data. By default it is `128MB`.
The other parameters in this UT are also changed to smaller values to keep the behavior the same.

### Why are the changes needed?

The runtime of this one UT can be over 7 minutes on Jenkins. After the change it is few seconds.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing UT

Closes #29842 from tanelk/SPARK-32970.

Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-29 16:51:44 +09:00
Max Gekk 1b60ff5afe [MINOR][DOCS] Document when current_date and current_timestamp are evaluated
### What changes were proposed in this pull request?
Explicitly document that `current_date` and `current_timestamp` are executed at the start of query evaluation. And all calls of `current_date`/`current_timestamp` within the same query return the same value

### Why are the changes needed?
Users could expect that `current_date` and `current_timestamp` return the current date/timestamp at the moment of query execution but in fact the functions are folded by the optimizer at the start of query evaluation:
0df8dd6073/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala (L71-L91)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
by running `./dev/scalastyle`.

Closes #29892 from MaxGekk/doc-current_date.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-29 05:20:12 +00:00
gengjiaan a53fc9b7ae [SPARK-27951][SQL][FOLLOWUP] Improve the window function nth_value
### What changes were proposed in this pull request?
https://github.com/apache/spark/pull/29604 supports the ANSI SQL NTH_VALUE.
We should override the `prettyName` and `sql`.

### Why are the changes needed?
Make the name of nth_value correct.
To show the ignoreNulls parameter correctly.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
Jenkins test.

Closes #29886 from beliefer/improve-nth_value.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-29 09:54:43 +09:00
Yuanjian Li 9e6882feca [SPARK-32885][SS] Add DataStreamReader.table API
### What changes were proposed in this pull request?
This pr aims to add a new `table` API in DataStreamReader, which is similar to the table API in DataFrameReader.

### Why are the changes needed?
Users can directly use this API to get a Streaming DataFrame on a table. Below is a simple example:

Application 1 for initializing and starting the streaming job:

```
val path = "/home/yuanjian.li/runtime/to_be_deleted"
val tblName = "my_table"

// Write some data to `my_table`
spark.range(3).write.format("parquet").option("path", path).saveAsTable(tblName)

// Read the table as a streaming source, write result to destination directory
val table = spark.readStream.table(tblName)
table.writeStream.format("parquet").option("checkpointLocation", "/home/yuanjian.li/runtime/to_be_deleted_ck").start("/home/yuanjian.li/runtime/to_be_deleted_2")
```

Application 2 for appending new data:

```
// Append new data into the path
spark.range(5).write.format("parquet").option("path", "/home/yuanjian.li/runtime/to_be_deleted").mode("append").save()
```

Check result:
```
// The desitination directory should contains all written data
spark.read.parquet("/home/yuanjian.li/runtime/to_be_deleted_2").show()
```

### Does this PR introduce _any_ user-facing change?
Yes, a new API added.

### How was this patch tested?
New UT added and integrated testing.

Closes #29756 from xuanyuanking/SPARK-32885.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-25 06:50:24 +00:00
Terry Kim e9c98c910a [SPARK-32990][SQL] Migrate REFRESH TABLE to use UnresolvedTableOrView to resolve the identifier
### What changes were proposed in this pull request?

This PR proposes to migrate `REFRESH TABLE` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).

### Why are the changes needed?

The current behavior is not consistent between v1 and v2 commands when resolving a temp view.
In v2, the `t` in the following example is resolved to a table:
```scala
sql("CREATE TABLE testcat.ns.t (id bigint) USING foo")
sql("CREATE TEMPORARY VIEW t AS SELECT 2")
sql("USE testcat.ns")
sql("REFRESH TABLE t") // 't' is resolved to testcat.ns.t
```
whereas in v1, the `t` is resolved to a temp view:
```scala
sql("CREATE DATABASE test")
sql("CREATE TABLE spark_catalog.test.t (id bigint) USING csv")
sql("CREATE TEMPORARY VIEW t AS SELECT 2")
sql("USE spark_catalog.test")
sql("REFRESH TABLE t") // 't' is resolved to a temp view
```

### Does this PR introduce _any_ user-facing change?

After this PR, `REFRESH TABLE t` is resolved to a temp view `t` instead of `testcat.ns.t`.

### How was this patch tested?

Added a new test

Closes #29866 from imback82/refresh_table_consistent.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-25 04:29:09 +00:00
Chao Sun 8ccfbc114e [SPARK-32381][CORE][SQL] Move and refactor parallel listing & non-location sensitive listing to core
<!--
Thanks for sending a pull request!  Here are some tips for you:
  1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
  2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
  4. Be sure to keep the PR description updated to reflect all changes.
  5. Please write your PR title to summarize what this PR proposes.
  6. If possible, provide a concise example to reproduce the issue for a faster review.
  7. If you want to add a new configuration, please read the guideline first for naming configurations in
     'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
-->

### What changes were proposed in this pull request?
<!--
Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
  1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
  2. If you fix some SQL features, you can provide some references of other DBMSes.
  3. If there is design documentation, please add the link.
  4. If there is a discussion in the mailing list, please add the link.
-->

This moves and refactors the parallel listing utilities from `InMemoryFileIndex` to Spark core so it can be reused by modules beside SQL. Along the process this also did some cleanups/refactorings:

- Created a `HadoopFSUtils` class under core
- Moved `InMemoryFileIndex.bulkListLeafFiles` into `HadoopFSUtils.parallelListLeafFiles`. It now depends on a `SparkContext` instead of `SparkSession` in SQL. Also added a few parameters which used to be read from `SparkSession.conf`: `ignoreMissingFiles`, `ignoreLocality`, `parallelismThreshold`, `parallelismMax ` and `filterFun` (for additional filtering support but we may be able to merge this with `filter` parameter in future).
- Moved `InMemoryFileIndex.listLeafFiles` into `HadoopFSUtils.listLeafFiles` with similar changes above.

### Why are the changes needed?
<!--
Please clarify why the changes are needed. For instance,
  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug.
-->

Currently the locality-aware parallel listing mechanism only applies to `InMemoryFileIndex`. By moving this to core, we can potentially reuse the same mechanism for other code paths as well.

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such as the documentation fix.
If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
If no, write 'No'.
-->

No.

### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add.
-->

Since this is mostly a refactoring, it relies on existing unit tests such as those for `InMemoryFileIndex`.

Closes #29471 from sunchao/SPARK-32381.

Lead-authored-by: Chao Sun <sunchao@apache.org>
Co-authored-by: Holden Karau <hkarau@apple.com>
Co-authored-by: Chao Sun <sunchao@uber.com>
Signed-off-by: Holden Karau <hkarau@apple.com>
2020-09-24 10:58:52 -07:00
Russell Spitzer b3f0087e39 [SPARK-32977][SQL][DOCS] Fix JavaDoc on Default Save Mode
### What changes were proposed in this pull request?

The default is always ErrorsOnExist regardless of DataSource version. Fixing the JavaDoc to reflect this.

### Why are the changes needed?

To fix documentation

### Does this PR introduce _any_ user-facing change?

Doc change.

### How was this patch tested?

Manual.

Closes #29853 from RussellSpitzer/SPARK-32977.

Authored-by: Russell Spitzer <russell.spitzer@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-09-23 20:02:20 -07:00
Michael Munday faeb71b39d [SPARK-32950][SQL] Remove unnecessary big-endian code paths
### What changes were proposed in this pull request?
Remove unnecessary code.

### Why are the changes needed?

General housekeeping. Might be a slight performance improvement, especially on big-endian systems.

There is no need for separate code paths for big- and little-endian
platforms in putDoubles and putFloats anymore (since PR #24861). On
all platforms values are encoded in native byte order and can just
be copied directly.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

Closes #29815 from mundaym/clean-putfloats.

Authored-by: Michael Munday <mike.munday@ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-09-23 12:38:06 -05:00
Terry Kim 21b7479797 [SPARK-32959][SQL][TEST] Fix an invalid test in DataSourceV2SQLSuite
### What changes were proposed in this pull request?

This PR addresses two issues related to the `Relation: view text` test in `DataSourceV2SQLSuite`.

1. The test has the following block:
```scala
withView("view1") { v1: String =>
  sql(...)
}
```
Since `withView`'s signature is `withView(v: String*)(f: => Unit): Unit`, the `f` that will be executed is ` v1: String => sql(..)`, which is just defining the anonymous function, and _not_ executing it.

2. Once the test is fixed to run, it actually fails. The reason is that the v2 session catalog implementation used in tests does not correctly handle `V1Table` for views in `loadTable`. And this results in views resolved to `ResolvedTable` instead of `ResolvedView`, causing the test failure: f1dc479d39/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (L1007-L1011)

### Why are the changes needed?

Fixing a bug in test.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing test.

Closes #29811 from imback82/fix_minor_test.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-23 05:49:45 +00:00
tanel.kiis@gmail.com acfee3c8b1 [SPARK-32870][DOCS][SQL] Make sure that all expressions have their ExpressionDescription filled
### What changes were proposed in this pull request?

Made sure, that all the expressions in the `FunctionRegistry ` have the fields `usage`, `examples` and `since` filled in their `ExpressionDescription`. Added UT to `ExpressionInfoSuite`, to make sure, that all new expressions will also fill those fields.

### Why are the changes needed?

Documentation improvement

### Does this PR introduce _any_ user-facing change?

Better generated SQL built in functions documentation

### How was this patch tested?

Checked the fix version in the following jiras:
SPARK-1251 - UnaryMinus, Add, Subtract, Multiply, Divide, Remainder, Explode, Not, In, And, Or, Equals, LessThan, LessThanOrEqual, GreaterThan, GreaterThanOrEqual, If, Cast
SPARK-2053 - CaseWhen
SPARK-2665 - EqualNullSafe
SPARK-3176 - Abs
SPARK-6542 - CreateStruct
SPARK-7135 - MonotonicallyIncreasingID
SPARK-7152 - SparkPartitionID
SPARK-7295 - bitwiseAND, bitwiseOR, bitwiseXOR, bitwiseNOT
SPARK-8005 - InputFileName
SPARK-8203 - Greatest
SPARK-8204 - Least
SPARK-8220 - UnaryPositive
SPARK-8221 - Pmod
SPARK-8230 - Size
SPARK-8231 - ArrayContains
SPARK-8232 - SortArray
SPARK-8234 - md5
SPARK-8235 - sha1
SPARK-8236 - crc32
SPARK-8237 - sha2
SPARK-8240 - Concat
SPARK-8246 - GetJsonObject
SPARK-8407 - CreateNamedStruct
SPARK-9617 - JsonTuple
SPARK-10810 - CurrentDatabase
SPARK-12480 - Murmur3Hash
SPARK-14061 - CreateMap
SPARK-14160 - TimeWindow
SPARK-14580 - AssertTrue
SPARK-16274 - XPathBoolean
SPARK-16278 - MapKeys
SPARK-16279 - MapValues
SPARK-16284 - CallMethodViaReflection
SPARK-16286 - Stack
SPARK-16288 - Inline
SPARK-16289 - PosExplode
SPARK-16318 - XPathShort, XPathInt, XPathLong, XPathFloat, XPathDouble, XPathString, XPathList
SPARK-16730 - Cast aliases
SPARK-17495 - HiveHash
SPARK-18702 - InputFileBlockStart, InputFileBlockLength
SPARK-20910 - UUID

Closes #29743 from tanelk/SPARK-32870.

Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-09-23 10:18:38 +09:00
Max Gekk b53da23a28 [MINOR][SQL] Improve examples for percentile_approx()
### What changes were proposed in this pull request?
In the PR, I propose to replace current examples for `percentile_approx()` with **only one** input value by example **with multiple values** in the input column.

### Why are the changes needed?
Current examples are pretty trivial, and don't demonstrate function's behaviour on a sequence of values.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- by running `ExpressionInfoSuite`
- `./dev/scalastyle`

Closes #29841 from MaxGekk/example-percentile_approx.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-23 09:41:38 +09:00
Max Gekk 7c14f177eb [SPARK-32306][SQL][DOCS] Clarify the result of percentile_approx()
### What changes were proposed in this pull request?
More precise description of the result of the `percentile_approx()` function and its synonym `approx_percentile()`. The proposed sentence clarifies that  the function returns **one of elements** (or array of elements) from the input column.

### Why are the changes needed?
To improve Spark docs and avoid misunderstanding of the function behavior.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
`./dev/scalastyle`

Closes #29835 from MaxGekk/doc-percentile_approx.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2020-09-22 12:45:19 -07:00
Wenchen Fan fba5736c50 [SPARK-32757][SQL][FOLLOWUP] Preserve the attribute name as possible as we scan in SubqueryBroadcastExec
### What changes were proposed in this pull request?

This is a minor followup of https://github.com/apache/spark/pull/29601 , to preserve the attribute name in `SubqueryBroadcastExec.output`.

### Why are the changes needed?

During explain, it's better to see the origin column name instead of always "key".

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests.

Closes #29839 from cloud-fan/followup2.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-09-22 11:05:35 -07:00
Wenchen Fan 6145621495 [SPARK-32659][SQL][FOLLOWUP] Broadcast Array instead of Set in InSubqueryExec
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/29475.

This PR updates the code to broadcast the Array instead of Set, which was the behavior before #29475

### Why are the changes needed?

The size of Set can be much bigger than Array. It's safer to keep the behavior the same as before and build the set at the executor side.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing tests

Closes #29838 from cloud-fan/followup.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-09-22 08:49:58 -07:00
Peter Toth f03c03576a [SPARK-32951][SQL] Foldable propagation from Aggregate
### What changes were proposed in this pull request?
This PR adds foldable propagation from `Aggregate` as per: https://github.com/apache/spark/pull/29771#discussion_r490412031

### Why are the changes needed?
This is an improvement as `Aggregate`'s `aggregateExpressions` can contain foldables that can be propagated up.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
New UT.

Closes #29816 from peter-toth/SPARK-32951-foldable-propagation-from-aggregate.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-09-21 21:43:17 -07:00
zero323 7fb9f6884f [SPARK-32799][R][SQL] Add allowMissingColumns to SparkR unionByName
### What changes were proposed in this pull request?

Add optional `allowMissingColumns` argument to SparkR `unionByName`.

### Why are the changes needed?

Feature parity.

### Does this PR introduce _any_ user-facing change?

`unionByName` supports `allowMissingColumns`.

### How was this patch tested?

Existing unit tests. New unit tests targeting this feature.

Closes #29813 from zero323/SPARK-32799.

Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-21 09:39:34 +09:00
yangjie01 2128c4f14b [SPARK-32808][SQL] Pass all test of sql/core module in Scala 2.13
### What changes were proposed in this pull request?

After https://github.com/apache/spark/pull/29660 and https://github.com/apache/spark/pull/29689 there are 13 remaining  failed cases of sql core module with Scala 2.13.

The reason for the remaining failed cases is the optimization result of `CostBasedJoinReorder` maybe different with same input in Scala 2.12 and Scala 2.13 if there are more than one same cost candidate plans.

In this pr give a way to make the  optimization result deterministic as much as possible to pass all remaining failed cases of `sql/core` module in Scala 2.13, the main change of this pr as follow:

- Change to use `LinkedHashMap` instead of `Map` to store `foundPlans` in `JoinReorderDP.search` method to ensure same iteration order with same insert order because iteration order of `Map` behave differently under Scala 2.12 and 2.13

- Fixed `StarJoinCostBasedReorderSuite` affected by the above change

- Regenerate golden files affected by the above change.

### Why are the changes needed?
We need to support a Scala 2.13 build.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- Scala 2.12: Pass the Jenkins or GitHub Action

- Scala 2.13: All tests passed.

Do the following:

```
dev/change-scala-version.sh 2.13
mvn clean install -DskipTests  -pl sql/core -Pscala-2.13 -am
mvn test -pl sql/core -Pscala-2.13
```

**Before**
```
Tests: succeeded 8485, failed 13, canceled 1, ignored 52, pending 0
*** 13 TESTS FAILED ***

```

**After**

```
Tests: succeeded 8498, failed 0, canceled 1, ignored 52, pending 0
All tests passed.
```

Closes #29711 from LuciferYang/SPARK-32808-3.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-09-18 10:38:30 -05:00
gengjiaan 8b09536cdf [SPARK-27951][SQL] Support ANSI SQL NTH_VALUE window function
### What changes were proposed in this pull request?
The `NTH_VALUE` function is an ANSI SQL.
For examples:
```
CREATE TEMPORARY TABLE empsalary (
    depname varchar,
    empno bigint,
    salary int,
    enroll_date date
);

INSERT INTO empsalary VALUES
('develop', 10, 5200, '2007-08-01'),
('sales', 1, 5000, '2006-10-01'),
('personnel', 5, 3500, '2007-12-10'),
('sales', 4, 4800, '2007-08-08'),
('personnel', 2, 3900, '2006-12-23'),
('develop', 7, 4200, '2008-01-01'),
('develop', 9, 4500, '2008-01-01'),
('sales', 3, 4800, '2007-08-01'),
('develop', 8, 6000, '2006-10-01'),
('develop', 11, 5200, '2007-08-15');

select first_value(salary) over(order by salary range between 1000 preceding and 1000 following),
	lead(salary) over(order by salary range between 1000 preceding and 1000 following),
	nth_value(salary, 1) over(order by salary range between 1000 preceding and 1000 following),
	salary from empsalary;
 first_value | lead | nth_value | salary
-------------+------+-----------+--------
        3500 | 3900 |      3500 |   3500
        3500 | 4200 |      3500 |   3900
        3500 | 4500 |      3500 |   4200
        3500 | 4800 |      3500 |   4500
        3900 | 4800 |      3900 |   4800
        3900 | 5000 |      3900 |   4800
        4200 | 5200 |      4200 |   5000
        4200 | 5200 |      4200 |   5200
        4200 | 6000 |      4200 |   5200
        5000 |      |      5000 |   6000
(10 rows)
```

There are some mainstream database support the syntax.

**PostgreSQL:**
https://www.postgresql.org/docs/8.4/functions-window.html

**Vertica:**
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/Analytic/NTH_VALUEAnalytic.htm?tocpath=SQL%20Reference%20Manual%7CSQL%20Functions%7CAnalytic%20Functions%7C_____23

**Oracle:**
https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/NTH_VALUE.html#GUID-F8A0E88C-67E5-4AA6-9515-95D03A7F9EA0

**Redshift**
https://docs.aws.amazon.com/redshift/latest/dg/r_WF_NTH.html

**Presto**
https://prestodb.io/docs/current/functions/window.html

**MySQL**
https://www.mysqltutorial.org/mysql-window-functions/mysql-nth_value-function/

### Why are the changes needed?
The `NTH_VALUE` function is an ANSI SQL.
The `NTH_VALUE` function is very useful.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Exists and new UT.

Closes #29604 from beliefer/support-nth_value.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-18 07:06:38 +00:00
Takeshi Yamamuro b49aaa33e1 [SPARK-32906][SQL] Struct field names should not change after normalizing floats
### What changes were proposed in this pull request?

This PR intends to fix a minor bug when normalizing floats for struct types;
```
scala> import org.apache.spark.sql.execution.aggregate.HashAggregateExec
scala> val df = Seq(Tuple1(Tuple1(-0.0d)), Tuple1(Tuple1(0.0d))).toDF("k")
scala> val agg = df.distinct()
scala> agg.explain()
== Physical Plan ==
*(2) HashAggregate(keys=[k#40], functions=[])
+- Exchange hashpartitioning(k#40, 200), true, [id=#62]
   +- *(1) HashAggregate(keys=[knownfloatingpointnormalized(if (isnull(k#40)) null else named_struct(col1, knownfloatingpointnormalized(normalizenanandzero(k#40._1)))) AS k#40], functions=[])
      +- *(1) LocalTableScan [k#40]

scala> val aggOutput = agg.queryExecution.sparkPlan.collect { case a: HashAggregateExec => a.output.head }
scala> aggOutput.foreach { attr => println(attr.prettyJson) }
### Final Aggregate ###
[ {
  "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
  "num-children" : 0,
  "name" : "k",
  "dataType" : {
    "type" : "struct",
    "fields" : [ {
      "name" : "_1",
                ^^^
      "type" : "double",
      "nullable" : false,
      "metadata" : { }
    } ]
  },
  "nullable" : true,
  "metadata" : { },
  "exprId" : {
    "product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId",
    "id" : 40,
    "jvmId" : "a824e83f-933e-4b85-a1ff-577b5a0e2366"
  },
  "qualifier" : [ ]
} ]

### Partial Aggregate ###
[ {
  "class" : "org.apache.spark.sql.catalyst.expressions.AttributeReference",
  "num-children" : 0,
  "name" : "k",
  "dataType" : {
    "type" : "struct",
    "fields" : [ {
      "name" : "col1",
                ^^^^
      "type" : "double",
      "nullable" : true,
      "metadata" : { }
    } ]
  },
  "nullable" : true,
  "metadata" : { },
  "exprId" : {
    "product-class" : "org.apache.spark.sql.catalyst.expressions.ExprId",
    "id" : 40,
    "jvmId" : "a824e83f-933e-4b85-a1ff-577b5a0e2366"
  },
  "qualifier" : [ ]
} ]
```

### Why are the changes needed?

bugfix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added tests.

Closes #29780 from maropu/FixBugInNormalizedFloatingNumbers.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2020-09-17 22:07:47 -07:00
Max Gekk 75dd86400c [SPARK-32908][SQL] Fix target error calculation in percentile_approx()
### What changes were proposed in this pull request?
1. Change the target error calculation according to the paper [Space-Efficient Online Computation of Quantile Summaries](http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf). It says that the error `e = max(gi, deltai)/2` (see the page 59). Also this has clear explanation [ε-approximate quantiles](http://www.mathcs.emory.edu/~cheung/Courses/584/Syllabus/08-Quantile/Greenwald.html#proofprop1).
2. Added a test to check different accuracies.
3. Added an input CSV file `percentile_approx-input.csv.bz2` to the resource folder `sql/catalyst/src/main/resources` for the test.

### Why are the changes needed?
To fix incorrect percentile calculation, see an example in SPARK-32908.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
- By running existing tests in `QuantileSummariesSuite` and in `ApproximatePercentileQuerySuite`.
- Added new test `SPARK-32908: maximum target error in percentile_approx` to `ApproximatePercentileQuerySuite`.

Closes #29784 from MaxGekk/fix-percentile_approx-2.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-18 10:47:06 +09:00
Takeshi Yamamuro 68e0d5f296 [SPARK-32902][SQL] Logging plan changes for AQE
### What changes were proposed in this pull request?

Recently, we added code to log plan changes in the preparation phase in `QueryExecution` for execution (https://github.com/apache/spark/pull/29544). This PR intends to apply the same fix  for logging plan changes in AQE.

### Why are the changes needed?

Easy debugging for AQE plans

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit tests.

Closes #29774 from maropu/PlanChangeLogForAQE.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-09-18 08:29:29 +09:00
Peter Toth 4ced58862c [SPARK-32635][SQL] Fix foldable propagation
### What changes were proposed in this pull request?
This PR rewrites `FoldablePropagation` rule to replace attribute references in a node with foldables coming only from the node's children.

Before this PR in the case of this example (with setting`spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation`):
```scala
val a = Seq("1").toDF("col1").withColumn("col2", lit("1"))
val b = Seq("2").toDF("col1").withColumn("col2", lit("2"))
val aub = a.union(b)
val c = aub.filter($"col1" === "2").cache()
val d = Seq("2").toDF( "col4")
val r = d.join(aub, $"col2" === $"col4").select("col4")
val l = c.select("col2")
val df = l.join(r, $"col2" === $"col4", "LeftOuter")
df.show()
```
foldable propagation happens incorrectly:
```
 Join LeftOuter, (col2#6 = col4#34)                                                              Join LeftOuter, (col2#6 = col4#34)
!:- Project [col2#6]                                                                             :- Project [1 AS col2#6]
 :  +- InMemoryRelation [col1#4, col2#6], StorageLevel(disk, memory, deserialized, 1 replicas)   :  +- InMemoryRelation [col1#4, col2#6], StorageLevel(disk, memory, deserialized, 1 replicas)
 :        +- Union                                                                               :        +- Union
 :           :- *(1) Project [value#1 AS col1#4, 1 AS col2#6]                                    :           :- *(1) Project [value#1 AS col1#4, 1 AS col2#6]
 :           :  +- *(1) Filter (isnotnull(value#1) AND (value#1 = 2))                            :           :  +- *(1) Filter (isnotnull(value#1) AND (value#1 = 2))
 :           :     +- *(1) LocalTableScan [value#1]                                              :           :     +- *(1) LocalTableScan [value#1]
 :           +- *(2) Project [value#10 AS col1#13, 2 AS col2#15]                                 :           +- *(2) Project [value#10 AS col1#13, 2 AS col2#15]
 :              +- *(2) Filter (isnotnull(value#10) AND (value#10 = 2))                          :              +- *(2) Filter (isnotnull(value#10) AND (value#10 = 2))
 :                 +- *(2) LocalTableScan [value#10]                                             :                 +- *(2) LocalTableScan [value#10]
 +- Project [col4#34]                                                                            +- Project [col4#34]
    +- Join Inner, (col2#6 = col4#34)                                                               +- Join Inner, (col2#6 = col4#34)
       :- Project [value#31 AS col4#34]                                                                :- Project [value#31 AS col4#34]
       :  +- LocalRelation [value#31]                                                                  :  +- LocalRelation [value#31]
       +- Project [col2#6]                                                                             +- Project [col2#6]
          +- Union false, false                                                                           +- Union false, false
             :- Project [1 AS col2#6]                                                                        :- Project [1 AS col2#6]
             :  +- LocalRelation [value#1]                                                                   :  +- LocalRelation [value#1]
             +- Project [2 AS col2#15]                                                                       +- Project [2 AS col2#15]
                +- LocalRelation [value#10]                                                                     +- LocalRelation [value#10]

```
and so the result is wrong:
```
+----+----+
|col2|col4|
+----+----+
|   1|null|
+----+----+
```

After this PR foldable propagation will not happen incorrectly and the result is correct:
```
+----+----+
|col2|col4|
+----+----+
|   2|   2|
+----+----+
```

### Why are the changes needed?
To fix a correctness issue.

### Does this PR introduce _any_ user-facing change?
Yes, fixes a correctness issue.

### How was this patch tested?
Existing and new UTs.

Closes #29771 from peter-toth/SPARK-32635-fix-foldable-propagation.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-09-18 08:17:23 +09:00
jzc ea3b979e95 [SPARK-32889][SQL] orc table column name supports special characters
### What changes were proposed in this pull request?
make orc table column name support special characters like `$`

### Why are the changes needed?
Special characters like `$` are allowed in orc table column name by Hive.
But it's error when execute command "CREATE TABLE tbl(`$` INT, b INT) using orc" in spark. it's not compatible with Hive.

`Column name "$" contains invalid character(s). Please use alias to rename it.;Column name "$" contains invalid character(s). Please use alias to rename it.;org.apache.spark.sql.AnalysisException: Column name "$" contains invalid character(s). Please use alias to rename it.;
at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$.checkFieldName(OrcFileFormat.scala:51)
at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$.$anonfun$checkFieldNames$1(OrcFileFormat.scala:59)
at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$.$anonfun$checkFieldNames$1$adapted(OrcFileFormat.scala:59)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38) `

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add unit test

Closes #29761 from jzc928/orcColSpecialChar.

Authored-by: jzc <jzc@jzcMacBookPro.local>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-09-17 14:50:47 -07:00
yangjie01 5817c584b8 [SPARK-32909][SQL] Pass all sql/hive-thriftserver module UTs in Scala 2.13
### What changes were proposed in this pull request?

This pr fix failed and aborted cases in sql hive-thriftserver module in Scala 2.13, the main change of this pr as follow:

- Use `s.c.Seq` instead of `Seq` in `HiveResult` because the input type maybe `mutable.ArraySeq`, but `Seq` represent `immutable.Seq` in Scala 2.13.

- Reset classLoader after `HiveMetastoreLazyInitializationSuite` completed because context class loader is `NonClosableMutableURLClassLoader`  in `HiveMetastoreLazyInitializationSuite` running process, and it propagate to `HiveThriftServer2ListenerSuite` trigger following problems in Scala 2.13:

```
HiveThriftServer2ListenerSuite:
*** RUN ABORTED ***
  java.lang.LinkageError: loader constraint violation: loader (instance of net/bytebuddy/dynamic/loading/MultipleParentClassLoader) previously initiated loading for a different type with name "org/apache/hive/service/ServiceStateChangeListener"
  at org.mockito.codegen.HiveThriftServer2$MockitoMock$1850222569.<clinit>(Unknown Source)
  at sun.reflect.GeneratedSerializationConstructorAccessor530.newInstance(Unknown Source)
  at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
  at org.objenesis.instantiator.sun.SunReflectionFactoryInstantiator.newInstance(SunReflectionFactoryInstantiator.java:48)
  at org.objenesis.ObjenesisBase.newInstance(ObjenesisBase.java:73)
  at org.mockito.internal.creation.instance.ObjenesisInstantiator.newInstance(ObjenesisInstantiator.java:19)
  at org.mockito.internal.creation.bytebuddy.SubclassByteBuddyMockMaker.createMock(SubclassByteBuddyMockMaker.java:47)
  at org.mockito.internal.creation.bytebuddy.ByteBuddyMockMaker.createMock(ByteBuddyMockMaker.java:25)
  at org.mockito.internal.util.MockUtil.createMock(MockUtil.java:35)
  at org.mockito.internal.MockitoCore.mock(MockitoCore.java:63)
  ...
```

After this pr `HiveThriftServer2Suites` and `HiveThriftServer2ListenerSuite` was fixed and all 461 test passed

### Why are the changes needed?
We need to support a Scala 2.13 build.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Scala 2.12: Pass the Jenkins or GitHub Action

- Scala 2.13: All tests passed.

Do the following:

```
dev/change-scala-version.sh 2.13
mvn clean install -DskipTests -pl sql/hive-thriftserver -am -Phive-thriftserver -Pscala-2.13
mvn test -pl sql/hive-thriftserver -Phive -Phive-thriftserver -Pscala-2.13
```

**Before**

```
HiveThriftServer2ListenerSuite:
*** RUN ABORTED ***
```

**After**

```
Tests: succeeded 461, failed 0, canceled 0, ignored 17, pending 0
All tests passed.
```

Closes #29783 from LuciferYang/sql-thriftserver-tests.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-09-17 14:35:01 -07:00
Chao Sun 482a79a5e3 [SPARK-24994][SQL][FOLLOW-UP] Handle foldable, timezone and cleanup
### What changes were proposed in this pull request?

This is a follow-up on #29565, and addresses a few issues in the last PR:
- style issue pointed by [this comment](https://github.com/apache/spark/pull/29565#discussion_r487646749)
- skip optimization when `fromExp` is foldable (by [this comment](https://github.com/apache/spark/pull/29565#discussion_r487646973)) as there could be more efficient rule to apply for this case.
- pass timezone info to the generated cast on the literal value
- a bunch of cleanups and test improvements

Originally I plan to handle this when implementing [SPARK-32858](https://issues.apache.org/jira/browse/SPARK-32858) but now think it's better to isolate these changes from that.

### Why are the changes needed?

To fix a few left over issues in the above PR.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added a test for the foldable case. Otherwise relying on existing tests.

Closes #29775 from sunchao/SPARK-24994-followup.

Authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-09-17 07:50:39 -07:00
sychen 92b75dc260 [SPARK-32508][SQL] Disallow empty part col values in partition spec before static partition writing
### What changes were proposed in this pull request?
Write to static partition, check in advance that the partition field is empty.

### Why are the changes needed?
When writing to the current static partition, the partition field is empty, and an error will be reported when all tasks are completed.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
add ut

Closes #29316 from cxzl25/SPARK-32508.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-17 06:50:30 +00:00
Jungtaek Lim (HeartSaVioR) d936cb328d [SPARK-26425][SS] Add more constraint checks to avoid checkpoint corruption
### What changes were proposed in this pull request?

Credits to tdas who reported and described the fix to [SPARK-26425](https://issues.apache.org/jira/browse/SPARK-26425). I just followed the description of the issue.

This patch adds more checks on commit log as well as file streaming source so that multiple concurrent runs of streaming query don't mess up the status of query/checkpoint. This patch addresses two different spots which are having a bit different issues:

1. FileStreamSource.fetchMaxOffset()

In structured streaming, we don't allow multiple streaming queries to run with same checkpoint (including concurrent runs of same query), so query should fail if it fails to write the metadata of specific batch ID due to same batch ID being written by others.

2. commit log

As described in JIRA issue, assertion is already applied to the `offsetLog` for the same reason.

8167714cab/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/MicroBatchExecution.scala (L394-L402)

This patch applied the same for commit log.

### Why are the changes needed?

This prevents the inconsistent behavior on streaming query and lets query fail instead.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

N/A, as the change is simple and obvious, and it's really hard to artificially reproduce the issue.

Closes #25965 from HeartSaVioR/SPARK-26425.

Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
2020-09-17 09:01:06 +09:00
Linhong Liu 40ef5c91ad [SPARK-32816][SQL] Fix analyzer bug when aggregating multiple distinct DECIMAL columns
### What changes were proposed in this pull request?
This PR fixes a conflict between `RewriteDistinctAggregates` and `DecimalAggregates`.
In some cases, `DecimalAggregates` will wrap the decimal column to `UnscaledValue` using
different rules for different aggregates.

This means, same distinct column with different aggregates will change to different distinct columns
after `DecimalAggregates`. For example:
`avg(distinct decimal_col), sum(distinct decimal_col)` may change to
`avg(distinct UnscaledValue(decimal_col)), sum(distinct decimal_col)`

We assume after `RewriteDistinctAggregates`, there will be at most one distinct column in aggregates,
but `DecimalAggregates` breaks this assumption. To fix this, we have to switch the order of these two
rules.

### Why are the changes needed?
bug fix

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
added test cases

Closes #29673 from linhongliu-db/SPARK-32816.

Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-16 16:53:25 +00:00
Liang-Chi Hsieh 550c1c9cfb [SPARK-32888][DOCS] Add user document about header flag and RDD as path for reading CSV
### What changes were proposed in this pull request?

This proposes to enhance user document of the API for loading a Dataset of strings storing CSV rows. If the header option is set to true, the API will remove all lines same with the header.

### Why are the changes needed?

This behavior can confuse users. We should explicitly document it.

### Does this PR introduce _any_ user-facing change?

No. Only doc change.

### How was this patch tested?

Only doc change.

Closes #29765 from viirya/SPARK-32888.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-16 20:16:15 +09:00
allisonwang-db 2e3aa2f023 [SPARK-32861][SQL] GenerateExec should require column ordering
### What changes were proposed in this pull request?
This PR updates the `RemoveRedundantProjects` rule to make `GenerateExec` require column ordering.

### Why are the changes needed?
`GenerateExec` was originally considered as a node that does not require column ordering. However, `GenerateExec` binds its input rows directly with its `requiredChildOutput` without using the child's output schema.
In `doExecute()`:
```scala
val proj = UnsafeProjection.create(output, output)
```
In `doConsume()`:
```scala
val values = if (requiredChildOutput.nonEmpty) {
  input
} else {
  Seq.empty
}
```
In this case, changing input column ordering will result in `GenerateExec` binding the wrong schema to the input columns. For example, if we do not require child columns to be ordered, the `requiredChildOutput` [a, b, c] will directly bind to the schema of the input columns [c, b, a], which is incorrect:
```
GenerateExec explode(array(a, b, c)), [a, b, c], false, [d]
  HashAggregate(keys=[a, b, c], functions=[], output=[c, b, a])
    ...
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Unit test

Closes #29734 from allisonwang-db/generator.

Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-16 06:05:35 +00:00
HyukjinKwon b46c7302db [SPARK-32704][SQL][TESTS][FOLLOW-UP] Check any physical rule instead of a specific rule in the test
### What changes were proposed in this pull request?

This PR only checks if there's any physical rule runs instead of a specific rule. This is rather just a trivial fix to make the tests more robust.

In fact, I faced a test failure from a in-house fork that applies a different physical rule that makes `CollapseCodegenStages` ineffective.

### Why are the changes needed?

To make the test more robust by unrelated changes.

### Does this PR introduce _any_ user-facing change?

No, test-only

### How was this patch tested?

Manually tested. Jenkins tests should pass.

Closes #29766 from HyukjinKwon/SPARK-32704.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-16 12:06:57 +09:00
HyukjinKwon 108c4c8fdc [SPARK-32481][SQL][TESTS][FOLLOW-UP] Skip the test if trash directory cannot be created
### What changes were proposed in this pull request?

This PR skips the test if trash directory cannot be created. It is possible that the trash directory cannot be created, for example, by permission. And the test fails below:

```
- SPARK-32481 Move data to trash on truncate table if enabled *** FAILED *** (154 milliseconds)
  fs.exists(trashPath) was false (DDLSuite.scala:3184)
  org.scalatest.exceptions.TestFailedException:
  at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:530)
  at org.scalatest.Assertions.newAssertionFailedException$(Assertions.scala:529)
  at org.scalatest.FunSuite.newAssertionFailedException(FunSuite.scala:1560)
  at org.scalatest.Assertions$AssertionsHelper.macroAssert(Assertions.scala:503)
```

### Why are the changes needed?

To make the tests pass independently.

### Does this PR introduce _any_ user-facing change?

No, test-only.

### How was this patch tested?

Manually tested.

Closes #29759 from HyukjinKwon/SPARK-32481.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-16 08:11:43 +09:00
ulysses 888b343587 [SPARK-32827][SQL] Add spark.sql.maxMetadataStringLength config
### What changes were proposed in this pull request?

Add a new config `spark.sql.maxMetadataStringLength`. This config aims to limit metadata value length, e.g. file location.

### Why are the changes needed?

Some metadata have been abbreviated by `...` when I tried to add some test in `SQLQueryTestSuite`. We need to replace such value to `notIncludedMsg`. That caused we can't replace that like location value by `className` since the `className` has been abbreviated.

Here is a case:
```
CREATE table  explain_temp1 (key int, val int) USING PARQUET;

EXPLAIN EXTENDED SELECT sum(distinct val) FROM explain_temp1;

-- ignore parsed,analyzed,optimized
-- The output like
== Physical Plan ==
*HashAggregate(keys=[], functions=[sum(distinct cast(val#x as bigint)#xL)], output=[sum(DISTINCT val)#xL])
+- Exchange SinglePartition, true, [id=#x]
   +- *HashAggregate(keys=[], functions=[partial_sum(distinct cast(val#x as bigint)#xL)], output=[sum#xL])
      +- *HashAggregate(keys=[cast(val#x as bigint)#xL], functions=[], output=[cast(val#x as bigint)#xL])
         +- Exchange hashpartitioning(cast(val#x as bigint)#xL, 4), true, [id=#x]
            +- *HashAggregate(keys=[cast(val#x as bigint) AS cast(val#x as bigint)#xL], functions=[], output=[cast(val#x as bigint)#xL])
               +- *ColumnarToRow
                  +- FileScan parquet default.explain_temp1[val#x] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/home/runner/work/spark/spark/sql/core/spark-warehouse/org.apache.spark.sq...], PartitionFilters: ...
```

### Does this PR introduce _any_ user-facing change?

No, a new config.

### How was this patch tested?

new test.

Closes #29688 from ulysses-you/SPARK-32827.

Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-15 14:11:30 +00:00
herman c8baab1a1f [SPARK-32879][SQL] Refactor SparkSession initial options
### What changes were proposed in this pull request?
This PR refactors the way we propagate the options from the `SparkSession.Builder` to the` SessionState`. This currently done via a mutable map inside the SparkSession. These setting settings are then applied **after** the Session. This is a bit confusing when you expect something to be set when constructing the `SessionState`. This PR passes the options as a constructor parameter to the `SessionStateBuilder` and this will set the options when the configuration is created.

### Why are the changes needed?
It makes it easier to reason about the configurations set in a SessionState than before. We recently had an incident where someone was using `SparkSessionExtensions` to create a planner rule that relied on a conf to be set. While this is in itself probably incorrect usage, it still illustrated this somewhat funky behavior.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests.

Closes #29752 from hvanhovell/SPARK-32879.

Authored-by: herman <herman@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-15 06:24:54 +00:00
Dongjoon Hyun d8a0d85692 [SPARK-32884][TESTS] Mark TPCDSQuery*Suite as ExtendedSQLTest
### What changes were proposed in this pull request?

This PR aims to mark the following suite as `ExtendedSQLTest` to reduce GitHub Action test time.
- TPCDSQuerySuite
- TPCDSQueryANSISuite
- TPCDSQueryWithStatsSuite

### Why are the changes needed?

Currently, the longest GitHub Action task is `Build and test / Build modules: sql - other tests` with `1h 57m 10s` while `Build and test / Build modules: sql - slow tests` takes `42m 20s`. With this PR, we can move the workload from `other tests` to `slow tests` task and reduce the total waiting time about 7 ~ 8 minutes.

### Does this PR introduce _any_ user-facing change?

No. This is a test-only change.

### How was this patch tested?

Pass the GitHub Action with the reduced running time.

Closes #29755 from dongjoon-hyun/SPARK-SLOWTEST.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-15 14:38:01 +09:00
Kousuke Saruta 4fac6d501a [SPARK-32871][BUILD] Append toMap to Map#filterKeys if the result of filter is concatenated with another Map for Scala 2.13
### What changes were proposed in this pull request?

This PR appends `toMap` to `Map` instances with `filterKeys` if such maps is to be concatenated with another maps.

### Why are the changes needed?

As of Scala 2.13, Map#filterKeys return a MapView, not the original Map type.
This can cause compile error.
```
/sql/DataFrameReader.scala:279: type mismatch;
[error]  found   : Iterable[(String, String)]
[error]  required: java.util.Map[String,String]
[error] Error occurred in an application involving default arguments.
[error]       val dsOptions = new CaseInsensitiveStringMap(finalOptions.asJava)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Compile passed with the following command.
`build/mvn -Pscala-2.13 -Phive -Phive-thriftserver -Pyarn -Pkubernetes -DskipTests test-compile`

Closes #29742 from sarutak/fix-filterKeys-issue.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-15 09:27:47 +09:00
Yuanjian Li 5e825482d7 [SPARK-32844][SQL] Make DataFrameReader.table take the specified options for datasource v1
### What changes were proposed in this pull request?
Make `DataFrameReader.table` take the specified options for datasource v1.

### Why are the changes needed?
Keep the same behavior of v1/v2 datasource, the v2 fix has been done in SPARK-32592.

### Does this PR introduce _any_ user-facing change?
Yes. The DataFrameReader.table will take the specified options. Also, if there are the same key and value exists in specified options and table properties, an exception will be thrown.

### How was this patch tested?
New UT added.

Closes #29712 from xuanyuanking/SPARK-32844.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-14 09:20:24 +00:00
Cheng Su 978f531010 [SPARK-32854][SS] Minor code and doc improvement for stream-stream join
### What changes were proposed in this pull request?

Several minor code and documentation improvement for stream-stream join. Specifically:

* Remove extending from `SparkPlan`, as extending from `BinaryExecNode` is enough.
* Return `left/right.outputPartitioning` for `Left/RightOuter` in `outputPartitioning`, as the `PartitioningCollection` wrapper is unnecessary (similar to batch joins `ShuffledHashJoinExec`, `SortMergeJoinExec`).
*  Avoid per-row check for join type (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L486-L492), by creating the method before the loop of reading rows (`generateFilteredJoinedRow` in `storeAndJoinWithOtherSide`). Similar optimization (i.e. create auxiliary method/variable per different join type before the iterator of input rows) has been done in batch join world (`SortMergeJoinExec`, `ShuffledHashJoinExec`).
* Minor fix for comment/indentation for better readability.

### Why are the changes needed?

Minor optimization to avoid per-row unnecessary work (this probably can be optimized away by compiler, but we can do a better join to avoid it at the first place). And other comment/indentation fix to have better code readability for future developers.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests in `StreamingJoinSuite.scala` as no new logic is introduced.

Closes #29724 from c21/streaming.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-14 08:49:51 +00:00
Kousuke Saruta b121f0d459 [SPARK-32873][BUILD] Fix code which causes error when build with sbt and Scala 2.13
### What changes were proposed in this pull request?

This PR fix code which causes error when build with sbt and Scala 2.13 like as follows.
```
[error] [warn] /home/kou/work/oss/spark-scala-2.13/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaRDD.scala:251: method with a single empty parameter list overrides method without any parameter list
[error] [warn]   override def hasNext(): Boolean = requestOffset < part.untilOffset
[error] [warn]
[error] [warn] /home/kou/work/oss/spark-scala-2.13/external/kafka-0-10/src/main/scala/org/apache/spark/streaming/kafka010/KafkaRDD.scala:294: method with a single empty parameter list overrides method without any parameter list
[error] [warn]   override def hasNext(): Boolean = okNext
```

More specifically, what this PR fixes are

* Methods which has an empty parameter list and overrides an method which has no parameter list.
```
override def hasNext(): Boolean = okNext
```

* Methods which has no parameter list and overrides an method which has an empty parameter list.
```
      override def next: (Int, Double) = {
```

* Infix operator expression that the operator wraps.
```
    3L * math.min(k, numFeatures) * math.min(k, numFeatures)
    3L * math.min(k, numFeatures) * math.min(k, numFeatures) +
    + math.max(math.max(k, numFeatures), 4L * math.min(k, numFeatures)
      math.max(math.max(k, numFeatures), 4L * math.min(k, numFeatures) *
    * math.min(k, numFeatures) + 4L * math.min(k, numFeatures))
```

### Why are the changes needed?

For building Spark with sbt and Scala 2.13.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

After this change and #29742 applied, compile passed with the following command.
```
build/sbt -Pscala-2.13  -Phive -Phive-thriftserver -Pyarn -Pkubernetes compile test:compile
```

Closes #29745 from sarutak/fix-code-for-sbt-and-spark-2.13.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-14 15:34:58 +09:00
Chao Sun a6d6ea3efe [SPARK-32802][SQL] Avoid using SpecificInternalRow in RunLengthEncoding#Encoder
### What changes were proposed in this pull request?

Currently `RunLengthEncoding#Encoder` uses `SpecificInternalRow` as a holder for the current value when calculating compression stats and doing the actual compression. It calls `ColumnType.copyField` and `ColumnType.getField` on the internal row which incurs extra cost comparing to directly operating on the internal type. This proposes to replace the `SpecificInternalRow` with `T#InternalType` to avoid the extra cost.

### Why are the changes needed?

Operating on `SpecificInternalRow` carries certain cost and negatively impact performance when using `RunLengthEncoding` for compression.

With the change I see some improvements through `CompressionSchemeBenchmark`:

```diff
 Intel(R) Core(TM) i9-9880H CPU  2.30GHz
 BOOLEAN Encode:                           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-PassThrough(1.000)                                    1              1           0      51957.0           0.0       1.0X
-RunLengthEncoding(2.502)                            549            555           9        122.2           8.2       0.0X
-BooleanBitSet(0.125)                                296            301           3        226.6           4.4       0.0X
+PassThrough(1.000)                                    2              2           0      42985.4           0.0       1.0X
+RunLengthEncoding(2.517)                            487            500          10        137.7           7.3       0.0X
+BooleanBitSet(0.125)                                348            353           4        192.8           5.2       0.0X

 OpenJDK 64-Bit Server VM 11.0.8+10-LTS on Mac OS X 10.15.5
 Intel(R) Core(TM) i9-9880H CPU  2.30GHz
 SHORT Encode (Lower Skew):                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-PassThrough(1.000)                                    3              3           0      22779.9           0.0       1.0X
-RunLengthEncoding(1.520)                           1186           1192           9         56.6          17.7       0.0X
+PassThrough(1.000)                                    3              4           0      21216.6           0.0       1.0X
+RunLengthEncoding(1.493)                            882            931          50         76.1          13.1       0.0X

 OpenJDK 64-Bit Server VM 11.0.8+10-LTS on Mac OS X 10.15.5
 Intel(R) Core(TM) i9-9880H CPU  2.30GHz
 SHORT Encode (Higher Skew):               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-PassThrough(1.000)                                    3              4           0      21352.2           0.0       1.0X
-RunLengthEncoding(2.009)                           1173           1175           3         57.2          17.5       0.0X
+PassThrough(1.000)                                    3              3           0      22388.6           0.0       1.0X
+RunLengthEncoding(2.015)                            924            941          23         72.6          13.8       0.0X

 OpenJDK 64-Bit Server VM 11.0.8+10-LTS on Mac OS X 10.15.5
 Intel(R) Core(TM) i9-9880H CPU  2.30GHz
 INT Encode (Lower Skew):                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-PassThrough(1.000)                                    9             10           1       7410.1           0.1       1.0X
-RunLengthEncoding(1.000)                           1499           1502           4         44.8          22.3       0.0X
-DictionaryEncoding(0.500)                           621            630          11        108.0           9.3       0.0X
-IntDelta(0.250)                                     134            149          10        502.0           2.0       0.1X
+PassThrough(1.000)                                    9             10           1       7575.9           0.1       1.0X
+RunLengthEncoding(1.002)                            952            966          12         70.5          14.2       0.0X
+DictionaryEncoding(0.500)                           561            567           6        119.7           8.4       0.0X
+IntDelta(0.250)                                     129            134           3        521.9           1.9       0.1X

 OpenJDK 64-Bit Server VM 11.0.8+10-LTS on Mac OS X 10.15.5
 Intel(R) Core(TM) i9-9880H CPU  2.30GHz
 INT Encode (Higher Skew):                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-PassThrough(1.000)                                    9             10           1       7668.3           0.1       1.0X
-RunLengthEncoding(1.332)                           1561           1685         175         43.0          23.3       0.0X
-DictionaryEncoding(0.501)                           616            642          21        108.9           9.2       0.0X
-IntDelta(0.250)                                     126            131           2        533.4           1.9       0.1X
+PassThrough(1.000)                                    9             10           1       7494.1           0.1       1.0X
+RunLengthEncoding(1.336)                            974            987          13         68.9          14.5       0.0X
+DictionaryEncoding(0.501)                           709            719          10         94.6          10.6       0.0X
+IntDelta(0.250)                                     127            132           4        528.4           1.9       0.1X

 OpenJDK 64-Bit Server VM 11.0.8+10-LTS on Mac OS X 10.15.5
 Intel(R) Core(TM) i9-9880H CPU  2.30GHz
 LONG Encode (Lower Skew):                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-PassThrough(1.000)                                   18             19           1       3803.0           0.3       1.0X
-RunLengthEncoding(0.754)                           1526           1540          20         44.0          22.7       0.0X
-DictionaryEncoding(0.250)                           735            759          33         91.3          11.0       0.0X
-LongDelta(0.125)                                    126            129           2        530.8           1.9       0.1X
+PassThrough(1.000)                                   19             21           1       3543.5           0.3       1.0X
+RunLengthEncoding(0.747)                           1049           1058          12         63.9          15.6       0.0X
+DictionaryEncoding(0.250)                           620            634          17        108.2           9.2       0.0X
+LongDelta(0.125)                                    129            132           2        520.1           1.9       0.1X

 OpenJDK 64-Bit Server VM 11.0.8+10-LTS on Mac OS X 10.15.5
 Intel(R) Core(TM) i9-9880H CPU  2.30GHz
 LONG Encode (Higher Skew):                Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-PassThrough(1.000)                                   18             20           1       3705.4           0.3       1.0X
-RunLengthEncoding(1.002)                           1665           1669           6         40.3          24.8       0.0X
-DictionaryEncoding(0.251)                           890            901          11         75.4          13.3       0.0X
-LongDelta(0.125)                                    125            130           3        537.2           1.9       0.1X
+PassThrough(1.000)                                   18             20           2       3726.8           0.3       1.0X
+RunLengthEncoding(0.999)                           1076           1077           2         62.4          16.0       0.0X
+DictionaryEncoding(0.251)                           904            919          19         74.3          13.5       0.0X
+LongDelta(0.125)                                    125            131           4        536.5           1.9       0.1X

 OpenJDK 64-Bit Server VM 11.0.8+10-LTS on Mac OS X 10.15.5
 Intel(R) Core(TM) i9-9880H CPU  2.30GHz
 STRING Encode:                            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
 ------------------------------------------------------------------------------------------------------------------------
-PassThrough(1.000)                                   27             30           2       2497.1           0.4       1.0X
-RunLengthEncoding(0.892)                           3443           3587         204         19.5          51.3       0.0X
-DictionaryEncoding(0.167)                          2286           2290           6         29.4          34.1       0.0X
+PassThrough(1.000)                                   28             31           2       2430.2           0.4       1.0X
+RunLengthEncoding(0.889)                           1798           1800           3         37.3          26.8       0.0X
+DictionaryEncoding(0.167)                          1956           1959           4         34.3          29.1       0.0X
```

In the above diff, new results are with changes in this PR. It can be seen that encoding performance has improved quite a lot especially for string type.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Relies on existing unit tests.

Closes #29654 from sunchao/SPARK-32802.

Authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-09-12 22:19:30 -07:00
Chao Sun 3d08084022 [SPARK-24994][SQL] Add UnwrapCastInBinaryComparison optimizer to simplify literal types
### What changes were proposed in this pull request?

Currently, in cases like the following:
```sql
SELECT * FROM t WHERE age < 40
```
where `age` is of short type, Spark won't be able to simplify this and can only generate filter `cast(age, int) < 40`. This won't get pushed down to datasources and therefore is not optimized.

This PR proposes a optimizer rule to improve this when the following constraints are satisfied:
 - input expression is binary comparisons when one side is a cast operation and another is a literal.
 - both the cast child expression and literal are of integral type (i.e., byte, short, int or long)

When this is true, it tries to do several optimizations to either simplify the expression or move the cast to the literal side, so
result filter for the above case becomes `age < cast(40 as smallint)`. This is better since the cast can be optimized away later and the filter can be pushed down to data sources.

This PR follows a similar effort in Presto (https://prestosql.io/blog/2019/05/21/optimizing-the-casts-away.html). Here we only handles integral types but plan to extend to other types as follow-ups.

### Why are the changes needed?

As mentioned in the previous section, when cast is not optimized, it cannot be pushed down to data sources which can lead
to unnecessary IO and therefore longer job time and waste of resources. This helps to improve that.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit tests for both the optimizer rule and filter pushdown on datasource level for both Orc and Parquet.

Closes #29565 from sunchao/SPARK-24994.

Authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-09-12 21:34:35 -07:00
Takeshi Yamamuro 4269c2c252 [SPARK-32851][SQL][TEST] Tests should fail if errors happen when generating projection code
### What changes were proposed in this pull request?

This PR intends to set `CODEGEN_ONLY` at `CODEGEN_FACTORY_MODE` in test spark context so that tests can fail if errors happen when generating expr code.

### Why are the changes needed?

I noticed that the code generation of `SafeProjection` failed in the existing test (https://issues.apache.org/jira/browse/SPARK-32828) but it passed because `FALLBACK` was set at `CODEGEN_FACTORY_MODE` (by default) in `SharedSparkSession`. To get aware of these failures quickly, I think its worth setting `CODEGEN_ONLY` at `CODEGEN_FACTORY_MODE`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #29721 from maropu/ExprCodegenTest.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-09-12 08:42:07 +09:00
Dongjoon Hyun b4be6a6d12 [SPARK-32845][SS][TESTS] Add sinkParameter to check sink options robustly in DataStreamReaderWriterSuite
### What changes were proposed in this pull request?

This PR aims to add `sinkParameter`  to check sink options robustly and independently in DataStreamReaderWriterSuite

### Why are the changes needed?

`LastOptions.parameters` is designed to catch three cases: `sourceSchema`, `createSource`, `createSink`. However, `StreamQuery.stop` invokes `queryExecutionThread.join`, `runStream`, `createSource` immediately and reset the stored options by `createSink`.

To catch `createSink` options, currently, the test suite is trying a workaround pattern. However, we observed a flakiness in this pattern sometimes. If we split `createSink` option separately, we don't need this workaround and can eliminate this flakiness.

```scala
val query = df.writeStream.
   ...
   .start()
assert(LastOptions.paramters(..))
query.stop()
```

### Does this PR introduce _any_ user-facing change?

No. This is a test-only change.

### How was this patch tested?

Pass the newly updated test case.

Closes #29730 from dongjoon-hyun/SPARK-32845.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-09-11 11:48:34 -07:00
Peter Toth 94cac5978c [SPARK-32730][SQL][FOLLOW-UP] Improve LeftAnti SortMergeJoin right side buffering
### What changes were proposed in this pull request?

This is a follow-up to https://github.com/apache/spark/pull/29572.

LeftAnti SortMergeJoin should not buffer all matching right side rows when bound condition is empty, this is unnecessary and can lead to performance degradation especially when spilling happens.

### Why are the changes needed?

Performance improvement.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New UT.

Closes #29727 from peter-toth/SPARK-32730-improve-leftsemi-sortmergejoin-followup.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-11 13:42:33 +00:00
Wenchen Fan 9f4f49cbaa [SPARK-32853][SQL] Consecutive save/load calls in DataFrame/StreamReader/Writer should not fail
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/29328

In https://github.com/apache/spark/pull/29328 , we forbid the use case that path option and path parameter are both specified.  However,  it breaks some use cases:
```
val dfr =  spark.read.format(...).option(...)
dfr.load(path1).xxx
dfr.load(path2).xxx
```

The reason is that: `load` has side effects. It will set path option to the `DataFrameReader` instance. The next time you call `load`, Spark will fail because both path option and path parameter are specified.

This PR removes the side effect of `save`/`load`/`start`  to not set the path option.

### Why are the changes needed?

recover some use cases

### Does this PR introduce _any_ user-facing change?

Yes, some use cases fail before this PR, and can run successfully after this PR.

### How was this patch tested?

new tests

Closes #29723 from cloud-fan/df.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-09-11 06:15:58 -07:00
Wenchen Fan 328d81a2d1 [SPARK-32677][SQL][DOCS][MINOR] Improve code comment in CreateFunctionCommand
### What changes were proposed in this pull request?

We made a mistake in https://github.com/apache/spark/pull/29502, as there is no code comment to explain why we can't load the UDF class when creating functions. This PR improves the code comment.

### Why are the changes needed?

To avoid making the same mistake.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

N/A

Closes #29713 from cloud-fan/comment.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-09-11 09:22:56 +09:00
Kousuke Saruta 5f468cc21e [SPARK-32822][SQL] Change the number of partitions to zero when a range is empty with WholeStageCodegen disabled or falled back
### What changes were proposed in this pull request?

This PR changes the behavior of RangeExec with WholeStageCodegen disabled or falled back to change the number of partitions to zero when a range is empty.

In the current master, if WholeStageCodegen effects, the number of partitions of an empty range will be changed to zero.
```
spark.range(1, 1, 1, 1000).rdd.getNumPartitions
res0: Int = 0
```
But it doesn't if WholeStageCodegen is disabled or falled back.
```
spark.conf.set("spark.sql.codegen.wholeStage", false)
spark.range(1, 1, 1, 1000).rdd.getNumPartitions
res2: Int = 1000
```

### Why are the changes needed?

To archive better performance even though WholeStageCodegen disabled or falled back.

### Does this PR introduce _any_ user-facing change?

Yes. the number of partitions gotten with `getNumPartitions` for an empty range will be changed when WholeStageCodegen is disabled.

### How was this patch tested?

New test.

Closes #29681 from sarutak/zero-size-range.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-09-11 09:11:35 +09:00
gengjiaan a22871f50a [SPARK-32777][SQL] Aggregation support aggregate function with multiple foldable expressions
### What changes were proposed in this pull request?
Spark SQL exists a bug show below:

```
spark.sql(
  " SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 2, 3)")
  .show()
+-----------------+--------------------+
|count(DISTINCT 2)|count(DISTINCT 2, 3)|
+-----------------+--------------------+
|                1|                   1|
+-----------------+--------------------+

spark.sql(
  " SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 3, 2)")
  .show()
+-----------------+--------------------+
|count(DISTINCT 2)|count(DISTINCT 3, 2)|
+-----------------+--------------------+
|                1|                   0|
+-----------------+--------------------+
```
The first query is correct, but the second query is not.
The root reason is the second query rewrited by `RewriteDistinctAggregates` who expand the output but lost the 2.

### Why are the changes needed?
Fix a bug.
`SELECT COUNT(DISTINCT 2), COUNT(DISTINCT 3, 2)` should return `1, 1`

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
New UT

Closes #29626 from beliefer/support-multiple-foldable-distinct-expressions.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-10 11:25:32 +00:00
Kent Yao 5669b212ec [SPARK-32840][SQL] Invalid interval value can happen to be just adhesive with the unit
### What changes were proposed in this pull request?
In this PR, we add a checker for STRING form interval value ahead for parsing multiple units intervals and fail directly if the interval value contains alphabets to prevent correctness issues like `interval '1 day 2' day`=`3 days`.

### Why are the changes needed?

fix correctness issue

### Does this PR introduce _any_ user-facing change?

yes, in spark 3.0.0 `interval '1 day 2' day`=`3 days` but now we fail with ParseException
### How was this patch tested?

add a test.

Closes #29708 from yaooqinn/SPARK-32840.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-10 11:20:05 +00:00
Takeshi Yamamuro 7eb76d6988 [SPARK-32828][SQL] Cast from a derived user-defined type to a base type
### What changes were proposed in this pull request?

This PR intends to fix an existing bug below in `UserDefinedTypeSuite`;
```
[info] - SPARK-19311: UDFs disregard UDT type hierarchy (931 milliseconds)
16:22:35.936 WARN org.apache.spark.sql.catalyst.expressions.SafeProjection: Expr codegen error and falling back to interpreter mode
org.apache.spark.SparkException: Cannot cast org.apache.spark.sql.ExampleSubTypeUDT46b1771f to org.apache.spark.sql.ExampleBaseTypeUDT31e8d979.
	at org.apache.spark.sql.catalyst.expressions.CastBase.nullSafeCastFunction(Cast.scala:891)
	at org.apache.spark.sql.catalyst.expressions.CastBase.doGenCode(Cast.scala:852)
	at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:147)
    ...
```

### Why are the changes needed?

bugfix

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit tests.

Closes #29691 from maropu/FixUdtBug.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-09-10 19:19:26 +09:00
Jungtaek Lim (HeartSaVioR) 8f61005723 [SPARK-32456][SS][FOLLOWUP] Update doc to note about using SQL statement with streaming Dataset
### What changes were proposed in this pull request?

This patch proposes to update the doc (both SS guide doc and Dataset dropDuplicates method doc) to leave a note to check on using SQL statements with streaming Dataset.

Once end users create a temp view based on streaming Dataset, they won't bother with thinking about "streaming" and do whatever they do with batch query. In many cases it works, but not just smoothly for the case when streaming aggregation is involved. They still need to concern about maintaining state store.

### Why are the changes needed?

Although SPARK-32456 fixed the weird error message, as a side effect some operations are enabled on streaming workload via SQL statement, which is error-prone if end users don't indicate what they're doing.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Only doc change.

Closes #29461 from HeartSaVioR/SPARK-32456-FOLLOWUP-DOC.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-10 08:10:32 +00:00
Dongjoon Hyun 2f85f9516c [SPARK-32832][SS] Use CaseInsensitiveMap for DataStreamReader/Writer options
### What changes were proposed in this pull request?

This PR aims to fix indeterministic behavior on DataStreamReader/Writer options like the following.
```scala
scala> spark.readStream.format("parquet").option("paTh", "1").option("PATH", "2").option("Path", "3").option("patH", "4").option("path", "5").load()
org.apache.spark.sql.AnalysisException: Path does not exist: 1;
```

### Why are the changes needed?

This will make the behavior deterministic.

### Does this PR introduce _any_ user-facing change?

Yes, but the previous behavior is indeterministic.

### How was this patch tested?

Pass the newly test cases.

Closes #29702 from dongjoon-hyun/SPARK-32832.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-09-09 23:41:32 -07:00
Jungtaek Lim (HeartSaVioR) db89b0e1b8 [SPARK-32831][SS] Refactor SupportsStreamingUpdate to represent actual meaning of the behavior
### What changes were proposed in this pull request?

This PR renames `SupportsStreamingUpdate` to `SupportsStreamingUpdateAsAppend` as the new interface name represents the actual behavior clearer. This PR also removes the `update()` method (so the interface is more likely a marker), as the implementations of `SupportsStreamingUpdateAsAppend` should support append mode by default, hence no need to trigger some flag on it.

### Why are the changes needed?

SupportsStreamingUpdate was intended to revive the functionality of Streaming update output mode for internal data sources, but despite the name, that interface isn't really used to do actual update on sink; all sinks are implementing this interface to do append, so strictly saying, it's just to support update as append. Renaming the interface would make it clear.

### Does this PR introduce _any_ user-facing change?

No, as the class is only for internal data sources.

### How was this patch tested?

Jenkins test will follow.

Closes #29693 from HeartSaVioR/SPARK-32831.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
2020-09-10 15:33:18 +09:00
Liang-Chi Hsieh add267c4de [SPARK-32819][SQL] ignoreNullability parameter should be effective recursively
### What changes were proposed in this pull request?

This patch proposes to check `ignoreNullability` parameter recursively in `equalsStructurally` method.

### Why are the changes needed?

`equalsStructurally` is used to check type equality. We can optionally ask to ignore nullability check. But the parameter `ignoreNullability` is not passed recursively down to nested types. So it produces weird error like:

```
data type mismatch: argument 3 requires array<array<string>> type, however ... is of array<array<string>> type.
```

when running the query `select aggregate(split('abcdefgh',''), array(array('')), (acc, x) -> array(array( x ) ) )`.

### Does this PR introduce _any_ user-facing change?

Yes, fixed a bug when running user query.

### How was this patch tested?

Unit tests.

Closes #29698 from viirya/SPARK-32819.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-10 02:53:22 +00:00
Dongjoon Hyun 06a994517f [SPARK-32836][SS][TESTS] Fix DataStreamReaderWriterSuite to check writer options correctly
### What changes were proposed in this pull request?

This PR aims to fix the test coverage at `DataStreamReaderWriterSuite`.

### Why are the changes needed?

Currently, the test case checks `DataStreamReader` options instead of `DataStreamWriter` options.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the revised test case.

Closes #29701 from dongjoon-hyun/SPARK-32836.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-09-09 19:46:55 -07:00
Terry Kim ab2fa881ed [SPARK-32516][SQL][FOLLOWUP] Remove unnecessary check if path string is empty for DataFrameWriter.save(), DataStreamReader.load() and DataStreamWriter.start()
### What changes were proposed in this pull request?

This PR is a follow up to https://github.com/apache/spark/pull/29543#discussion_r485409606, which correctly points out that the check for the empty string is not necessary.

### Why are the changes needed?

The unnecessary check actually could cause more confusion.

For example,
```scala
scala> Seq(1).toDF.write.option("path", "/tmp/path1").parquet("")
java.lang.IllegalArgumentException: Can not create a Path from an empty string
  at org.apache.hadoop.fs.Path.checkPathArg(Path.java:168)
```
even when `path` option is available. This PR addresses to fix this confusion.

### Does this PR introduce _any_ user-facing change?

Yes, now the above example prints the consistent exception message whether the path parameter value is empty or not.
```scala
scala> Seq(1).toDF.write.option("path", "/tmp/path1").parquet("")
org.apache.spark.sql.AnalysisException: There is a 'path' option set and save() is called with a path parameter. Either remove the path option, or call save() without the parameter. To ignore this check, set 'spark.sql.legacy.pathOptionBehavior.enabled' to 'true'.;
  at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:290)
  at org.apache.spark.sql.DataFrameWriter.parquet(DataFrameWriter.scala:856)
  ... 47 elided
```

### How was this patch tested?

Added unit tests.

Closes #29697 from imback82/SPARK-32516-followup.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-10 01:48:51 +00:00
Wenchen Fan f7995c576a Revert "[SPARK-32677][SQL] Load function resource before create"
This reverts commit 05fcf26b79.
2020-09-09 18:15:22 +00:00
Tathagata Das e4237bbda6 [SPARK-32794][SS] Fixed rare corner case error in micro-batch engine with some stateful queries + no-data-batches + V1 sources
### What changes were proposed in this pull request?
Make MicroBatchExecution explicitly call `getBatch` when the start and end offsets are the same.

### Why are the changes needed?

Structured Streaming micro-batch engine has the contract with V1 data sources that, after a restart, it will call `source.getBatch()` on the last batch attempted before the restart. However, a very rare combination of sequences violates this contract. It occurs only when
- The streaming query has specific types of stateful operations with watermarks (e.g., aggregation in append, mapGroupsWithState with timeouts).
    - These queries can execute a batch even without new data when the previous updates the watermark and the stateful ops are such that the new watermark can cause new output/cleanup. Such batches are called no-data-batches.
- The last batch before termination was an incomplete no-data-batch. Upon restart, the micro-batch engine fails to call `source.getBatch` when attempting to re-execute the incomplete no-data-batch.

This occurs because no-data-batches has the same and end offsets, and when a batch is executed, if the start and end offset is same then calling `source.getBatch` is skipped as it is assumed the generated plan will be empty. This only affects V1 data sources like Delta and Autoloader which rely on this invariant to detect in the source whether the query is being started from scratch or restarted.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

New unit test with a mock v1 source that fails without the fix.

Closes #29651 from tdas/SPARK-32794.

Authored-by: Tathagata Das <tathagata.das1565@gmail.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
2020-09-09 13:35:51 -04:00
yangjie01 513d51a2c5 [SPARK-32808][SQL] Fix some test cases of sql/core module in scala 2.13
### What changes were proposed in this pull request?
The purpose of this pr is to partial resolve [SPARK-32808](https://issues.apache.org/jira/browse/SPARK-32808), total of 26 failed test cases were fixed, the related suite as follow:

- `StreamingAggregationSuite` related test cases (2 FAILED -> Pass)

- `GeneratorFunctionSuite` related test cases (2 FAILED -> Pass)

- `UDFSuite` related test cases (2 FAILED -> Pass)

- `SQLQueryTestSuite` related test cases (5 FAILED -> Pass)

- `WholeStageCodegenSuite` related test cases (1 FAILED -> Pass)

- `DataFrameSuite` related test cases (3 FAILED -> Pass)

- `OrcV1QuerySuite\OrcV2QuerySuite` related test cases (4 FAILED -> Pass)

- `ExpressionsSchemaSuite` related test cases (1 FAILED -> Pass)

- `DataFrameStatSuite` related test cases (1 FAILED -> Pass)

- `JsonV1Suite\JsonV2Suite\JsonLegacyTimeParserSuite` related test cases (6 FAILED -> Pass)

The main change of this pr as following:

- Fix Scala 2.13 compilation problems in   `ShuffleBlockFetcherIterator`  and `Analyzer`

- Specified `Seq` to `scala.collection.Seq` in `objects.scala` and `GenericArrayData` because internal use `Seq` maybe `mutable.ArraySeq` and not easy to call `.toSeq`

- Should specified `Seq` to `scala.collection.Seq`  when we call `Row.getAs[Seq]` and `Row.get(i).asInstanceOf[Seq]` because the data maybe `mutable.ArraySeq` but `Seq` is `immutable.Seq` in Scala 2.13

- Use a compatible way to let `+` and `-` method  of `Decimal` having the same behavior in Scala 2.12 and Scala 2.13

- Call `toList` in `RelationalGroupedDataset.toDF` method when `groupingExprs` is `Stream` type because `Stream` can't serialize in Scala 2.13

- Add a manual sort to `classFunsMap` in `ExpressionsSchemaSuite` because `Iterable.groupBy` in Scala 2.13 has different result with `TraversableLike.groupBy`  in Scala 2.12

### Why are the changes needed?
We need to support a Scala 2.13 build.

### Does this PR introduce _any_ user-facing change?

Should specified `Seq` to `scala.collection.Seq`  when we call `Row.getAs[Seq]` and `Row.get(i).asInstanceOf[Seq]` because the data maybe `mutable.ArraySeq` but the `Seq` is `immutable.Seq` in Scala 2.13

### How was this patch tested?

- Scala 2.12: Pass the Jenkins or GitHub Action

- Scala 2.13: Do the following:

```
dev/change-scala-version.sh 2.13
mvn clean install -DskipTests  -pl sql/core -Pscala-2.13 -am
mvn test -pl sql/core -Pscala-2.13
```

**Before**
```
Tests: succeeded 8166, failed 319, canceled 1, ignored 52, pending 0
*** 319 TESTS FAILED ***

```

**After**

```
Tests: succeeded 8204, failed 286, canceled 1, ignored 52, pending 0
*** 286 TESTS FAILED ***

```

Closes #29660 from LuciferYang/SPARK-32808.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-09-09 08:53:44 -05:00
Liang-Chi Hsieh de0dc52a84 [SPARK-32813][SQL] Get default config of ParquetSource vectorized reader if no active SparkSession
### What changes were proposed in this pull request?

If no active SparkSession is available, let `FileSourceScanExec.needsUnsafeRowConversion` look at default SQL config of ParquetSource vectorized reader instead of failing the query execution.

### Why are the changes needed?

Fix a bug that if no active SparkSession is available, file-based data source scan for Parquet Source will throw exception.

### Does this PR introduce _any_ user-facing change?

Yes, this change fixes the bug.

### How was this patch tested?

Unit test.

Closes #29667 from viirya/SPARK-32813.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-09 12:23:05 +09:00
Max Gekk adc8d687ce [SPARK-32810][SQL][TESTS][FOLLOWUP] Check path globbing in JSON/CSV datasources v1 and v2
### What changes were proposed in this pull request?
In the PR, I propose to move the test `SPARK-32810: CSV and JSON data sources should be able to read files with escaped glob metacharacter in the paths` from `DataFrameReaderWriterSuite` to `CSVSuite` and to `JsonSuite`. This will allow to run the same test in `CSVv1Suite`/`CSVv2Suite` and in `JsonV1Suite`/`JsonV2Suite`.

### Why are the changes needed?
To improve test coverage by checking JSON/CSV datasources v1 and v2.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running affected test suites:
```
$ build/sbt "sql/test:testOnly org.apache.spark.sql.execution.datasources.csv.*"
$ build/sbt "sql/test:testOnly org.apache.spark.sql.execution.datasources.json.*"
```

Closes #29684 from MaxGekk/globbing-paths-when-inferring-schema-dsv2.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-09 10:29:58 +09:00
manuzhang 96ff87dce8 [SPARK-32753][SQL][FOLLOWUP] Fix indentation and clean up view in test
### What changes were proposed in this pull request?
Fix indentation and clean up view in the test added by https://github.com/apache/spark/pull/29593.

### Why are the changes needed?
Address review comments in https://github.com/apache/spark/pull/29665.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Updated test.

Closes #29682 from manuzhang/spark-32753-followup.

Authored-by: manuzhang <owenzhang1990@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-09 10:20:21 +09:00
Zhenhua Wang e7d9a24565 [SPARK-32817][SQL] DPP throws error when broadcast side is empty
### What changes were proposed in this pull request?

In `SubqueryBroadcastExec.relationFuture`, if the `broadcastRelation` is an `EmptyHashedRelation`, then `broadcastRelation.keys()` will throw `UnsupportedOperationException`.

### Why are the changes needed?

To fix a bug.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added a new test.

Closes #29671 from wzhfy/dpp_empty_broadcast.

Authored-by: Zhenhua Wang <wzh_zju@163.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-09-08 21:36:21 +09:00
sychen bd3dc2f54d [SPARK-31511][FOLLOW-UP][TEST][SQL] Make BytesToBytesMap iterators thread-safe
### What changes were proposed in this pull request?
Before SPARK-31511 is fixed, `BytesToBytesMap` iterator() is not thread-safe and may cause data inaccuracy.
We need to add a unit test.

### Why are the changes needed?
Increase test coverage to ensure that iterator() is thread-safe.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
add ut

Closes #29669 from cxzl25/SPARK-31511-test.

Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-08 11:54:04 +00:00
Zhenhua Wang 55d38a479b [SPARK-32748][SQL] Revert "Support local property propagation in SubqueryBroadcastExec"
### What changes were proposed in this pull request?

This reverts commit 04f7f6dac0 due to the discussion in [comment](https://github.com/apache/spark/pull/29589#discussion_r484657207).

### Why are the changes needed?

Based on  the discussion in [comment](https://github.com/apache/spark/pull/29589#discussion_r484657207), propagation for thread local properties in `SubqueryBroadcastExec` is not necessary, since they will be propagated by broadcast exchange threads anyway.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Also revert the added test.

Closes #29674 from wzhfy/revert_dpp_thread_local.

Authored-by: Zhenhua Wang <wzh_zju@163.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-09-08 20:20:16 +09:00
Wenchen Fan 4144b6da52 [SPARK-32764][SQL] -0.0 should be equal to 0.0
### What changes were proposed in this pull request?

This is a Spark 3.0 regression introduced by https://github.com/apache/spark/pull/26761. We missed a corner case that `java.lang.Double.compare` treats 0.0 and -0.0 as different, which breaks SQL semantic.

This PR adds back the `OrderingUtil`, to provide custom compare methods that take care of 0.0 vs -0.0

### Why are the changes needed?

Fix a correctness bug.

### Does this PR introduce _any_ user-facing change?

Yes, now `SELECT  0.0 > -0.0` returns false correctly as Spark 2.x.

### How was this patch tested?

new tests

Closes #29647 from cloud-fan/float.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-09-07 20:43:43 -07:00
Max Gekk 954cd9feaa [SPARK-32810][SQL] CSV/JSON data sources should avoid globbing paths when inferring schema
### What changes were proposed in this pull request?
In the PR, I propose to fix an issue with the CSV and JSON data sources in Spark SQL when both of the following are true:
* no user specified schema
* some file paths contain escaped glob metacharacters, such as `[``]`, `{``}`, `*` etc.

### Why are the changes needed?
To fix the issue when the follow two queries try to read from paths `[abc].csv` and `[abc].json`:
```scala
spark.read.csv("""/tmp/\[abc\].csv""").show
spark.read.json("""/tmp/\[abc\].json""").show
```
but would end up hitting an exception:
```
org.apache.spark.sql.AnalysisException: Path does not exist: file:/tmp/[abc].csv;
  at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$checkAndGlobPathIfNecessary$1(DataSource.scala:722)
  at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:244)
  at scala.collection.immutable.List.foreach(List.scala:392)
```

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
Added new test cases in `DataFrameReaderWriterSuite`.

Closes #29659 from MaxGekk/globbing-paths-when-inferring-schema.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-08 09:42:59 +09:00
manuzhang c43460cf82 [SPARK-32753][SQL] Only copy tags to node with no tags
### What changes were proposed in this pull request?
Only copy tags to node with no tags when transforming plans.

### Why are the changes needed?
cloud-fan [made a good point](https://github.com/apache/spark/pull/29593#discussion_r482013121) that it doesn't make sense to append tags to existing nodes when nodes are removed. That will cause such bugs as duplicate rows when deduplicating and repartitioning by the same column with AQE.

```
spark.range(10).union(spark.range(10)).createOrReplaceTempView("v1")
val df = spark.sql("select id from v1 group by id distribute by id")
println(df.collect().toArray.mkString(","))
println(df.queryExecution.executedPlan)

// With AQE
[4],[0],[3],[2],[1],[7],[6],[8],[5],[9],[4],[0],[3],[2],[1],[7],[6],[8],[5],[9]
AdaptiveSparkPlan(isFinalPlan=true)
+- CustomShuffleReader local
   +- ShuffleQueryStage 0
      +- Exchange hashpartitioning(id#183L, 10), true
         +- *(3) HashAggregate(keys=[id#183L], functions=[], output=[id#183L])
            +- Union
               :- *(1) Range (0, 10, step=1, splits=2)
               +- *(2) Range (0, 10, step=1, splits=2)

// Without AQE
[4],[7],[0],[6],[8],[3],[2],[5],[1],[9]
*(4) HashAggregate(keys=[id#206L], functions=[], output=[id#206L])
+- Exchange hashpartitioning(id#206L, 10), true
   +- *(3) HashAggregate(keys=[id#206L], functions=[], output=[id#206L])
      +- Union
         :- *(1) Range (0, 10, step=1, splits=2)
         +- *(2) Range (0, 10, step=1, splits=2)
```

It's too expensive to detect node removal so we make a compromise only to copy tags to node with no tags.

### Does this PR introduce _any_ user-facing change?
Yes. Fix a bug.

### How was this patch tested?
Add test.

Closes #29593 from manuzhang/spark-32753.

Authored-by: manuzhang <owenzhang1990@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-07 16:08:57 +00:00
Zhenhua Wang 04f7f6dac0 [SPARK-32748][SQL] Support local property propagation in SubqueryBroadcastExec
### What changes were proposed in this pull request?

Since [SPARK-22590](2854091d12), local property propagation is supported through `SQLExecution.withThreadLocalCaptured` in both `BroadcastExchangeExec` and `SubqueryExec` when computing `relationFuture`. This pr adds the support in `SubqueryBroadcastExec`.

### Why are the changes needed?

Local property propagation is missed in `SubqueryBroadcastExec`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Add a new test.

Closes #29589 from wzhfy/thread_local.

Authored-by: Zhenhua Wang <wzh_zju@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-07 06:26:14 +00:00
ulysses 05fcf26b79 [SPARK-32677][SQL] Load function resource before create
### What changes were proposed in this pull request?

Change `CreateFunctionCommand` code that add class check before create function.

### Why are the changes needed?

We have different behavior between create permanent function and temporary function when function class is invaild. e.g.,
```
create function f as 'test.non.exists.udf';
-- Time taken: 0.104 seconds

create temporary function f as 'test.non.exists.udf'
-- Error in query: Can not load class 'test.non.exists.udf' when registering the function 'f', please make sure it is on the classpath;
```

And Hive also fails both of them.

### Does this PR introduce _any_ user-facing change?

Yes, user will get exception when create a invalid udf.

### How was this patch tested?

New test.

Closes #29502 from ulysses-you/function.

Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-07 06:00:23 +00:00
Kent Yao de44e9cfa0 [SPARK-32785][SQL] Interval with dangling parts should not results null
### What changes were proposed in this pull request?

bugfix for incomplete interval values, e.g. interval '1', interval '1 day 2', currently these cases will result null, but actually we should fail them with IllegalArgumentsException

### Why are the changes needed?

correctness

### Does this PR introduce _any_ user-facing change?

yes, incomplete intervals will throw exception now

#### before
```
bin/spark-sql -S -e "select interval '1', interval '+', interval '1 day -'"

NULL NULL NULL
```
#### after

```
-- !query
select interval '1'
-- !query schema
struct<>
-- !query output
org.apache.spark.sql.catalyst.parser.ParseException

Cannot parse the INTERVAL value: 1(line 1, pos 7)

== SQL ==
select interval '1'
```

### How was this patch tested?

unit tests added

Closes #29635 from yaooqinn/SPARK-32785.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-07 05:11:30 +00:00
Eren Avsarogullari f5360e761e [SPARK-32548][SQL] - Add Application attemptId support to SQL Rest API
### What changes were proposed in this pull request?
Currently, Spark Public Rest APIs support Application attemptId except SQL API. This causes `no such app: application_X` issue when the application has `attemptId` (e.g: YARN cluster mode).

Please find existing and supported Rest endpoints with attemptId.
```
// Existing Rest Endpoints
applications/{appId}/sql
applications/{appId}/sql/{executionId}

// Rest Endpoints required support
applications/{appId}/{attemptId}/sql
applications/{appId}/{attemptId}/sql/{executionId}
```
Also fixing following compile warning on `SqlResourceSuite`:
```
[WARNING] [Warn] ~/spark/sql/core/src/test/scala/org/apache/spark/status/api/v1/sql/SqlResourceSuite.scala:67: Reference to uninitialized value edges
```
### Why are the changes needed?
This causes `no such app: application_X` issue when the application has `attemptId`.

### Does this PR introduce _any_ user-facing change?
Not yet because SQL Rest API is being planned to release with `Spark 3.1`.

### How was this patch tested?
1. New Unit tests are added for existing Rest endpoints. `attemptId` seems not coming in `local-mode` and coming in `YARN cluster mode` so could not be added for `attemptId` case (Suggestions are welcome).
2. Also, patch has been tested manually through both Spark Core and History Server Rest APIs.

Closes #29364 from erenavsarogullari/SPARK-32548.

Authored-by: Eren Avsarogullari <erenavsarogullari@gmail.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-09-06 19:23:12 +08:00
Yuming Wang 0b3bb45b89 [SPARK-32791][SQL] Non-partitioned table metric should not have dynamic partition pruning time
### What changes were proposed in this pull request?

This pr make non-partitioned table metric should not have dynamic partition pruning time.

### Why are the changes needed?

It is useless for non-partitioned table.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test

Before this pr:
![image](https://user-images.githubusercontent.com/5399861/92141803-87fed380-ee45-11ea-9784-09625b246fea.png)
After this pr:
![image](https://user-images.githubusercontent.com/5399861/92141774-7c131180-ee45-11ea-8a9e-6775c592f496.png)

Closes #29641 from wangyum/SPARK-32791.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
2020-09-05 23:49:17 +08:00
yangjie 1de272f98d [SPARK-32762][SQL][TEST] Enhance the verification of ExpressionsSchemaSuite to sql-expression-schema.md
### What changes were proposed in this pull request?
`sql-expression-schema.md` automatically generated by `ExpressionsSchemaSuite`, but only expressions entries are checked in `ExpressionsSchemaSuite`. So if we manually modify the contents of the file,  `ExpressionsSchemaSuite` does not necessarily guarantee the correctness of the it some times. For example, [Spark-24884](https://github.com/apache/spark/pull/27507) added `regexp_extract_all`  expression support, and manually modify the `sql-expression-schema.md` but not change the content of `Number of queries` cause file content inconsistency.

Some additional checks have been added to `ExpressionsSchemaSuite` to improve the correctness guarantee of `sql-expression-schema.md` as follow:

- `Number of queries` should equals size of `expressions entries` in `sql-expression-schema.md`

- `Number of expressions that missing example` should equals size of `Expressions missing examples` in `sql-expression-schema.md`

- `MissExamples` from case should same as  `expectedMissingExamples` from `sql-expression-schema.md`

### Why are the changes needed?
Ensure the correctness of `sql-expression-schema.md` content.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Enhanced ExpressionsSchemaSuite

Closes #29608 from LuciferYang/sql-expression-schema.

Authored-by: yangjie <yangjie@MacintoshdeMacBook-Pro.local>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-09-04 09:40:35 +09:00
Zhenhua Wang e693df2a07 [SPARK-32786][SQL][TEST] Improve performance for some slow DPP tests
### What changes were proposed in this pull request?

The whole `DynamicPartitionPruningSuite` takes about 2 min on my laptop (either AE on or off). The slowest tests are `test("simple inner join triggers DPP with mock-up tables")` and `test("cleanup any DPP filter that isn't pushed down due to expression id clashes")`, which totally take about 1 min.

We can reuse existing test tables or use smaller tables to reduce the cost. After that, the two tests takes only about 1 sec in total, leading to 2x speedup for the suite.

### Why are the changes needed?

To speedup DPP test suites.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Modified two existing tests.

Closes #29636 from wzhfy/improve_dpp_test.

Authored-by: Zhenhua Wang <wzh_zju@163.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-09-04 09:33:20 +09:00
Wenchen Fan 76330e0295 [SPARK-32788][SQL] non-partitioned table scan should not have partition filter
### What changes were proposed in this pull request?

This PR fixes a bug `FileSourceStrategy`, which generates partition filters even if the table is not partitioned. This can confuse `FileSourceScanExec`, which mistakenly think the table is partitioned and tries to update the `numPartitions` metrics, and cause a failure. We should not generate partition filters for non-partitioned table.

### Why are the changes needed?

The bug was exposed by https://github.com/apache/spark/pull/29436.

### Does this PR introduce _any_ user-facing change?

Yes, fix a bug.

### How was this patch tested?

new test

Closes #29637 from cloud-fan/refactor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
2020-09-03 23:49:17 +08:00
Takeshi Yamamuro a6114d8fb8 [SPARK-32638][SQL] Corrects references when adding aliases in WidenSetOperationTypes
### What changes were proposed in this pull request?

This PR intends to fix a bug where references can be missing when adding aliases to widen data types in `WidenSetOperationTypes`. For example,
```
CREATE OR REPLACE TEMPORARY VIEW t3 AS VALUES (decimal(1)) tbl(v);
SELECT t.v FROM (
  SELECT v FROM t3
  UNION ALL
  SELECT v + v AS v FROM t3
) t;

org.apache.spark.sql.AnalysisException: Resolved attribute(s) v#1 missing from v#3 in operator !Project [v#1]. Attribute(s) with the same name appear in the operation: v. Please check if the right attribute(s) are used.;;
!Project [v#1]  <------ the reference got missing
+- SubqueryAlias t
   +- Union
      :- Project [cast(v#1 as decimal(11,0)) AS v#3]
      :  +- Project [v#1]
      :     +- SubqueryAlias t3
      :        +- SubqueryAlias tbl
      :           +- LocalRelation [v#1]
      +- Project [v#2]
         +- Project [CheckOverflow((promote_precision(cast(v#1 as decimal(11,0))) + promote_precision(cast(v#1 as decimal(11,0)))), DecimalType(11,0), true) AS v#2]
            +- SubqueryAlias t3
               +- SubqueryAlias tbl
                  +- LocalRelation [v#1]
```
In the case, `WidenSetOperationTypes` added the alias `cast(v#1 as decimal(11,0)) AS v#3`, then the reference in the top `Project` got missing. This PR correct the reference (`exprId` and widen `dataType`) after adding aliases in the rule.

### Why are the changes needed?

bugfixes

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added unit tests

Closes #29485 from maropu/SPARK-32638.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-03 14:48:26 +00:00
Peter Toth ffd5227543 [SPARK-32730][SQL] Improve LeftSemi and Existence SortMergeJoin right side buffering
### What changes were proposed in this pull request?

LeftSemi and Existence SortMergeJoin should not buffer all matching right side rows when bound condition is empty, this is unnecessary and can lead to performance degradation especially when spilling happens.

### Why are the changes needed?

Performance improvement.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New UT and TPCDS benchmarks.

Closes #29572 from peter-toth/SPARK-32730-improve-leftsemi-sortmergejoin.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-03 14:17:34 +00:00
Ali Afroozeh 0a6043f683 [SPARK-32755][SQL] Maintain the order of expressions in AttributeSet and ExpressionSet
### What changes were proposed in this pull request?
This PR changes `AttributeSet` and `ExpressionSet` to maintain the insertion order of the elements. More specifically, we:
- change the underlying data structure of `AttributeSet` from `HashSet` to `LinkedHashSet` to maintain the insertion order.
- `ExpressionSet` already uses a list to keep track of the expressions, however, since it is extending Scala's immutable.Set class, operations such as map and flatMap are delegated to the immutable.Set itself. This means that the result of these operations is not an instance of ExpressionSet anymore, rather it's a implementation picked up by the parent class. We also remove this inheritance from `immutable.Set `and implement the needed methods directly. ExpressionSet has a very specific semantics and it does not make sense to extend `immutable.Set` anyway.
- change the `PlanStabilitySuite` to not sort the attributes, to be able to catch changes in the order of expressions in different runs.

### Why are the changes needed?
Expressions identity is based on the `ExprId` which is an auto-incremented number. This means that the same query can yield a query plan with different expression ids in different runs. `AttributeSet` and `ExpressionSet` internally use a `HashSet` as the underlying data structure, and therefore cannot guarantee the a fixed order of operations in different runs. This can be problematic in cases we like to check for plan changes in different runs.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Passes `PlanStabilitySuite` after regenerating the golden files.

Closes #29598 from dbaliafroozeh/FixOrderOfExpressions.

Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com>
Signed-off-by: herman <herman@databricks.com>
2020-09-03 13:56:03 +02:00
Yuanjian Li 95f1e9549b [SPARK-32782][SS] Refactor StreamingRelationV2 and move it to catalyst
### What changes were proposed in this pull request?
Move StreamingRelationV2 to the catalyst module and bind with the Table interface.

### Why are the changes needed?
Currently, the StreamingRelationV2 is bind with TableProvider. Since the V2 relation is not bound with `DataSource`, to make it more flexible and have better expansibility, it should be moved to the catalyst module and bound with the Table interface. We did a similar thing for DataSourceV2Relation.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing UT.

Closes #29633 from xuanyuanking/SPARK-32782.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-03 16:04:36 +09:00
Kent Yao 1fba286407 [SPARK-32781][SQL] Non-ASCII characters are mistakenly omitted in the middle of intervals
### What changes were proposed in this pull request?

This PR fails the interval values parsing when they contain non-ASCII characters which are silently omitted right now.

e.g. the case below should be invalid

```
select interval 'interval中文 1 day'
```

### Why are the changes needed?

bugfix, intervals should fail when containing invalid characters

### Does this PR introduce _any_ user-facing change?

yes,

#### before

select interval 'interval中文 1 day'  results 1 day, now it fails with

```
org.apache.spark.sql.catalyst.parser.ParseException

Cannot parse the INTERVAL value: interval中文 1 day
```

### How was this patch tested?

new tests

Closes #29632 from yaooqinn/SPARK-32781.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-03 04:56:40 +00:00
angerszhu 5e6173ebef [SPARK-31670][SQL] Trim unnecessary Struct field alias in Aggregate/GroupingSets
### What changes were proposed in this pull request?
Struct field both in GROUP BY and Aggregate Expresison with CUBE/ROLLUP/GROUPING SET will failed when analysis.

```
test("SPARK-31670") {
  withTable("t1") {
      sql(
        """
          |CREATE TEMPORARY VIEW t(a, b, c) AS
          |SELECT * FROM VALUES
          |('A', 1, NAMED_STRUCT('row_id', 1, 'json_string', '{"i": 1}')),
          |('A', 2, NAMED_STRUCT('row_id', 2, 'json_string', '{"i": 1}')),
          |('A', 2, NAMED_STRUCT('row_id', 2, 'json_string', '{"i": 2}')),
          |('B', 1, NAMED_STRUCT('row_id', 3, 'json_string', '{"i": 1}')),
          |('C', 3, NAMED_STRUCT('row_id', 4, 'json_string', '{"i": 1}'))
        """.stripMargin)

      checkAnswer(
        sql(
          """
            |SELECT a, c.json_string, SUM(b)
            |FROM t
            |GROUP BY a, c.json_string
            |WITH CUBE
            |""".stripMargin),
        Row("A", "{\"i\": 1}", 3) :: Row("A", "{\"i\": 2}", 2) :: Row("A", null, 5) ::
          Row("B", "{\"i\": 1}", 1) :: Row("B", null, 1) ::
          Row("C", "{\"i\": 1}", 3) :: Row("C", null, 3) ::
          Row(null, "{\"i\": 1}", 7) :: Row(null, "{\"i\": 2}", 2) :: Row(null, null, 9) :: Nil)

  }
}
```
Error 
```
[info] - SPARK-31670 *** FAILED *** (2 seconds, 857 milliseconds)
[info]   Failed to analyze query: org.apache.spark.sql.AnalysisException: expression 't.`c`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;
[info]   Aggregate [a#247, json_string#248, spark_grouping_id#246L], [a#247, c#223.json_string AS json_string#241, sum(cast(b#222 as bigint)) AS sum(b)#243L]
[info]   +- Expand [List(a#221, b#222, c#223, a#244, json_string#245, 0), List(a#221, b#222, c#223, a#244, null, 1), List(a#221, b#222, c#223, null, json_string#245, 2), List(a#221, b#222, c#223, null, null, 3)], [a#221, b#222, c#223, a#247, json_string#248, spark_grouping_id#246L]
[info]      +- Project [a#221, b#222, c#223, a#221 AS a#244, c#223.json_string AS json_string#245]
[info]         +- SubqueryAlias t
[info]            +- Project [col1#218 AS a#221, col2#219 AS b#222, col3#220 AS c#223]
[info]               +- Project [col1#218, col2#219, col3#220]
[info]                  +- LocalRelation [col1#218, col2#219, col3#220]
[info]
```
For Struct type Field, when we resolve it, it will construct with Alias. When struct field in GROUP BY with CUBE/ROLLUP etc,  struct field in groupByExpression and aggregateExpression will be resolved with different exprId as below
```
'Aggregate [cube(a#221, c#223.json_string AS json_string#240)], [a#221, c#223.json_string AS json_string#241, sum(cast(b#222 as bigint)) AS sum(b)#243L]
+- SubqueryAlias t
   +- Project [col1#218 AS a#221, col2#219 AS b#222, col3#220 AS c#223]
      +- Project [col1#218, col2#219, col3#220]
         +- LocalRelation [col1#218, col2#219, col3#220]
```
This makes `ResolveGroupingAnalytics.constructAggregateExprs()` failed to replace aggreagteExpression use expand groupByExpression attribute since there exprId is not same. then error happened.

### Why are the changes needed?
Fix analyze bug

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
Added UT

Closes #28490 from AngersZhuuuu/SPARK-31670.

Lead-authored-by: angerszhu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-02 13:49:09 +00:00
Zhenhua Wang 03afbc8820 [SPARK-32739][SQL] Support prune right for left semi join in DPP
### What changes were proposed in this pull request?

Currently in DPP, left semi can only prune left, this pr makes it also support prune right.

### Why are the changes needed?

A minor improvement for DPP.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Add a test case.

Closes #29582 from wzhfy/dpp_support_leftsemi_pruneRight.

Authored-by: Zhenhua Wang <wzh_zju@163.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
2020-09-02 21:34:49 +08:00
Karol Chmist 7511e43c50 [SPARK-32756][SQL] Fix CaseInsensitiveMap usage for Scala 2.13
### What changes were proposed in this pull request?

This is a follow-up of #29160. This allows Spark SQL project to compile for Scala 2.13.

### Why are the changes needed?

It's needed for #28545

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

I compiled with Scala 2.13. It fails in `Spark REPL` project, which will be fixed by #28545

Closes #29584 from karolchmist/SPARK-32364-scala-2.13.

Authored-by: Karol Chmist <info+github@chmist.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-09-02 08:27:00 -05:00
angerszhu 55ce49ed28 [SPARK-32400][SQL][TEST][FOLLOWUP][TEST-MAVEN] Fix resource loading error in HiveScripTransformationSuite
### What changes were proposed in this pull request?
#29401 move `test_script.py` from sql/hive module to sql/core module, cause HiveScripTransformationSuite load resource issue.

### Why are the changes needed?
This issue cause jenkins test failed in mvn

spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11/
spark-master-test-maven-hadoop-3.2-hive-2.3-jdk-11:
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-hive-2.3-jdk-11/
spark-master-test-maven-hadoop-3.2-hive-2.3:
https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-hive-2.3/
![image](https://user-images.githubusercontent.com/46485123/91681585-71285a80-eb81-11ea-8519-99fc9783d6b9.png)

![image](https://user-images.githubusercontent.com/46485123/91681010-aaf86180-eb7f-11ea-8dbb-61365a3b0ab4.png)

Error as below:
```
 Exception thrown while executing Spark plan:
 HiveScriptTransformation [a#349299, b#349300, c#349301, d#349302, e#349303], python /home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11/sql/hive/file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11/sql/core/target/spark-sql_2.12-3.1.0-SNAPSHOT-tests.jar!/test_script.py, [a#349309, b#349310, c#349311, d#349312, e#349313], ScriptTransformationIOSchema(List(),List(),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),Some(org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe),List((field.delim, )),List((field.delim, )),Some(org.apache.hadoop.hive.ql.exec.TextRecordReader),Some(org.apache.hadoop.hive.ql.exec.TextRecordWriter),false)
+- Project [_1#349288 AS a#349299, _2#349289 AS b#349300, _3#349290 AS c#349301, _4#349291 AS d#349302, _5#349292 AS e#349303]
   +- LocalTableScan [_1#349288, _2#349289, _3#349290, _4#349291, _5#349292]

 == Exception ==
 org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 18021.0 failed 1 times, most recent failure: Lost task 0.0 in stage 18021.0 (TID 37324) (192.168.10.31 executor driver): org.apache.spark.SparkException: Subprocess exited with status 2. Error: python: can't open file '/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11/sql/hive/file:/home/jenkins/workspace/spark-master-test-maven-hadoop-2.7-hive-2.3-jdk-11/sql/core/target/spark-sql_2.12-3.1.0-SNAPSHOT-tests.jar!/test_script.py': [Errno 2] No such file or directory

 at org.apache.spark.sql.execution.BaseScriptTransformationExec.checkFailureAndPropagate(BaseScriptTransformationExec.scala:180)
 at org.apache.spark.sql.execution.BaseScriptTransformationExec.checkFailureAndPropagate$(BaseScriptTransformationExec.scala:157)
 at org.apache.spark.sql.hive.execution.HiveScriptTransformationExec.checkFailureAndPropagate(HiveScriptTransformationExec.scala:49)
 at org.apache.spark.sql.hive.execution.HiveScriptTransformationExec$$anon$1.hasNext(HiveScriptTransformationExec.scala:110)
 at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
 at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
 at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
 at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
 at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
 at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
 at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
 at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
 at org.apache.spark.scheduler.Task.run(Task.scala:127)
 at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:480)
 at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1426)
 at o
```
### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
Existed UT

Closes #29588 from AngersZhuuuu/SPARK-32400-FOLLOWUP.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-02 18:27:29 +09:00
liwensun f0851e95c6 [SPARK-32776][SS] Limit in streaming should not be optimized away by PropagateEmptyRelation
### What changes were proposed in this pull request?

PropagateEmptyRelation will not be applied to LIMIT operators in streaming queries.

### Why are the changes needed?

Right now, the limit operator in a streaming query may get optimized away when the relation is empty. This can be problematic for stateful streaming, as this empty batch will not write any state store files, and the next batch will fail when trying to read these state store files and throw a file not found error.

We should not let PropagateEmptyRelation optimize away the Limit operator for streaming queries.

This PR is intended as a small and safe fix for PropagateEmptyRelation. A fundamental fix that can prevent this from happening again in the future and in other optimizer rules is more desirable, but that's a much larger task.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
unit tests.

Closes #29623 from liwensun/spark-32776.

Authored-by: liwensun <liwen.sun@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-02 18:05:06 +09:00
Yuming Wang 54348dbd21 [SPARK-32767][SQL] Bucket join should work if spark.sql.shuffle.partitions larger than bucket number
### What changes were proposed in this pull request?

Bucket join should work if `spark.sql.shuffle.partitions` larger than bucket number, such as:
```scala
spark.range(1000).write.bucketBy(432, "id").saveAsTable("t1")
spark.range(1000).write.bucketBy(34, "id").saveAsTable("t2")
sql("set spark.sql.shuffle.partitions=600")
sql("set spark.sql.autoBroadcastJoinThreshold=-1")
sql("select * from t1 join t2 on t1.id = t2.id").explain()
```

Before this pr:
```
== Physical Plan ==
*(5) SortMergeJoin [id#26L], [id#27L], Inner
:- *(2) Sort [id#26L ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(id#26L, 600), true
:     +- *(1) Filter isnotnull(id#26L)
:        +- *(1) ColumnarToRow
:           +- FileScan parquet default.t1[id#26L] Batched: true, DataFilters: [isnotnull(id#26L)], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 432 out of 432
+- *(4) Sort [id#27L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(id#27L, 600), true
      +- *(3) Filter isnotnull(id#27L)
         +- *(3) ColumnarToRow
            +- FileScan parquet default.t2[id#27L] Batched: true, DataFilters: [isnotnull(id#27L)], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 34 out of 34
```

After this pr:
```
== Physical Plan ==
*(4) SortMergeJoin [id#26L], [id#27L], Inner
:- *(1) Sort [id#26L ASC NULLS FIRST], false, 0
:  +- *(1) Filter isnotnull(id#26L)
:     +- *(1) ColumnarToRow
:        +- FileScan parquet default.t1[id#26L] Batched: true, DataFilters: [isnotnull(id#26L)], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 432 out of 432
+- *(3) Sort [id#27L ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(id#27L, 432), true
      +- *(2) Filter isnotnull(id#27L)
         +- *(2) ColumnarToRow
            +- FileScan parquet default.t2[id#27L] Batched: true, DataFilters: [isnotnull(id#27L)], Format: Parquet, PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 34 out of 34
```

### Why are the changes needed?

Spark 2.4 support this.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #29612 from wangyum/SPARK-32767.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-02 04:16:20 +00:00
Kousuke Saruta 812d0918a8 [SPARK-32771][DOCS] The example of expressions.Aggregator in Javadoc / Scaladoc is wrong
### What changes were proposed in this pull request?

This PR modifies an example for `expressions.Aggregator` in Javadoc and Scaladoc.
The definition of `bufferEncoder` and `outputEncoder` are added.

### Why are the changes needed?

To correct the example.
The current example is wrong and doesn't work because `bufferEncoder` and `outputEncoder` are not defined.

### Does this PR introduce _any_ user-facing change?

Yes.
Before this change, the scaladoc and javadoc are like as follows.
![wrong-example-java](https://user-images.githubusercontent.com/4736016/91897528-5ebf3580-ecd5-11ea-8d7b-e846b776ebbb.png)
![wrong-example](https://user-images.githubusercontent.com/4736016/91897509-58c95480-ecd5-11ea-81a3-98774083b689.png)

After this change, the docs are like as follows.
![fixed-example-java](https://user-images.githubusercontent.com/4736016/91897592-78607d00-ecd5-11ea-9e55-03fd9c9c6b54.png)
![fixed-example](https://user-images.githubusercontent.com/4736016/91897609-7c8c9a80-ecd5-11ea-837e-9dbcada6cd53.png)

### How was this patch tested?

Build with `build/sbt unidoc` and confirmed the generated javadoc/scaladoc and got the screenshots above.

Closes #29617 from sarutak/fix-aggregator-doc.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-02 10:03:07 +09:00
Linhong Liu a410658c9b [SPARK-32761][SQL] Allow aggregating multiple foldable distinct expressions
### What changes were proposed in this pull request?
For queries with multiple foldable distinct columns, since they will be eliminated during
execution, it's not mandatory to let `RewriteDistinctAggregates` handle this case. And
in the current code, `RewriteDistinctAggregates` *dose* miss some "aggregating with
multiple foldable distinct expressions" cases.
For example: `select count(distinct 2), count(distinct 2, 3)` will be missed.

But in the planner, this will trigger an error that "multiple distinct expressions" are not allowed.
As the foldable distinct columns can be eliminated finally, we can allow this in the aggregation
planner check.

### Why are the changes needed?
bug fix

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
added test case

Closes #29607 from linhongliu-db/SPARK-32761.

Authored-by: Linhong Liu <linhong.liu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-01 13:04:24 +00:00
Wenchen Fan fea9360ae7 [SPARK-32757][SQL][FOLLOW-UP] Use child's output for canonicalization in SubqueryBroadcastExec
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/29601 , to fix a small mistake in `SubqueryBroadcastExec`. `SubqueryBroadcastExec.doCanonicalize` should canonicalize the build keys with the query output, not the `SubqueryBroadcastExec.output`.

### Why are the changes needed?

fix mistake

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing test

Closes #29610 from cloud-fan/follow.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-01 12:54:40 +00:00
Huaxin Gao e1dbc85c72 [SPARK-32579][SQL] Implement JDBCScan/ScanBuilder/WriteBuilder
### What changes were proposed in this pull request?
Add JDBCScan, JDBCScanBuilder, JDBCWriteBuilder in Datasource V2 JDBC

### Why are the changes needed?
Complete Datasource V2 JDBC implementation

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
new tests

Closes #29396 from huaxingao/v2jdbc.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-01 07:23:20 +00:00
Wenchen Fan d2a5dad97c [SPARK-32757][SQL] Physical InSubqueryExec should be consistent with logical InSubquery
### What changes were proposed in this pull request?

`InSubquery` can be either single-column mode, or multi-column mode, depending on the output length of the subquery. For multi-column mode, the length of input `values` must match the subquery output length.

However, `InSubqueryExec` doesn't follow it and always be executed under single column mode. It's OK as it's only used by DPP, which looks up one key in one `InSubqueryExec`, so the multi-column mode is not needed. But it's better to make the physical and logical node consistent.

This PR updates `InSubqueryExec` to support multi-column mode, and also fix `SubqueryBroadcastExec` to report output correctly.

### Why are the changes needed?

Fix a potential bug.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

existing tests

Closes #29601 from cloud-fan/follow.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-01 07:19:43 +00:00
Yuming Wang a701bc79e3 [SPARK-32659][SQL][FOLLOWUP] Improve test for pruning DPP on non-atomic type
### What changes were proposed in this pull request?

Improve test for pruning DPP on non-atomic type:
- Avoid creating new partition tables. This may take 30 seconds..
- Add test `array` type.

### Why are the changes needed?

Improve test.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

N/A

Closes #29595 from wangyum/SPARK-32659-test.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-01 05:51:04 +00:00
Huaxin Gao 806140de40 [SPARK-32592][SQL] Make DataFrameReader.table take the specified options
### What changes were proposed in this pull request?
pass specified options in DataFrameReader.table to JDBCTableCatalog.loadTable

### Why are the changes needed?
Currently, `DataFrameReader.table` ignores the specified options. The options specified like the following are lost.
```
    val df = spark.read
      .option("partitionColumn", "id")
      .option("lowerBound", "0")
      .option("upperBound", "3")
      .option("numPartitions", "2")
      .table("h2.test.people")
```
We need to make `DataFrameReader.table` take the specified options.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manually test for now. Will add a test after V2 JDBC read is implemented.

Closes #29535 from huaxingao/table_options.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-31 13:21:15 +00:00
Cheng Su ce473b223a [SPARK-32740][SQL] Refactor common partitioning/distribution logic to BaseAggregateExec
### What changes were proposed in this pull request?

For all three different aggregate physical operator: `HashAggregateExec`, `ObjectHashAggregateExec` and `SortAggregateExec`, they have same `outputPartitioning` and `requiredChildDistribution` logic. Refactor these same logic into their super class `BaseAggregateExec` to avoid code duplication and future bugs (similar to `HashJoin` and `ShuffledJoin`).

### Why are the changes needed?

Reduce duplicated code across classes and prevent future bugs if we only update one class but forget another. We already did similar refactoring for join (`HashJoin` and `ShuffledJoin`).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing unit tests as this is pure refactoring and no new logic added.

Closes #29583 from c21/aggregate-refactor.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-08-31 15:43:13 +09:00
Udbhav30 065f17386d [SPARK-32481][CORE][SQL] Support truncate table to move data to trash
### What changes were proposed in this pull request?
Instead of deleting the data, we can move the data to trash.
Based on the configuration provided by the user it will be deleted permanently from the trash.

### Why are the changes needed?
Instead of directly deleting the data, we can provide flexibility to move data to the trash and then delete it permanently.

### Does this PR introduce _any_ user-facing change?
Yes, After truncate table the data is not permanently deleted now.
It is first moved to the trash and then after the given time deleted permanently;

### How was this patch tested?
new UTs added

Closes #29552 from Udbhav30/truncate.

Authored-by: Udbhav30 <u.agrawal30@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-30 10:25:32 -07:00
Cheng Su cfe012a431 [SPARK-32629][SQL] Track metrics of BitSet/OpenHashSet in full outer SHJ
### What changes were proposed in this pull request?

This is followup from https://github.com/apache/spark/pull/29342, where to do two things:
* Per https://github.com/apache/spark/pull/29342#discussion_r470153323, change from java `HashSet` to spark in-house `OpenHashSet` to track matched rows for non-unique join keys. I checked `OpenHashSet` implementation which is built from a key index (`OpenHashSet._bitset` as `BitSet`) and key array (`OpenHashSet._data` as `Array`). Java `HashSet` is built from `HashMap`, which stores value in `Node` linked list and by theory should have taken more memory than `OpenHashSet`. Reran the same benchmark query used in https://github.com/apache/spark/pull/29342, and verified the query has similar performance here between `HashSet` and `OpenHashSet`.
* Track metrics of the extra data structure `BitSet`/`OpenHashSet` for full outer SHJ. This depends on above thing, because there seems no easy way to get java `HashSet` memory size.

### Why are the changes needed?

To better surface the memory usage for full outer SHJ more accurately.
This can help users/developers to debug/improve full outer SHJ.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unite test in `SQLMetricsSuite.scala` .

Closes #29566 from c21/add-metrics.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-08-30 07:01:33 +09:00
Wenchen Fan ccc0250a08 [SPARK-32718][SQL] Remove unnecessary keywords for interval units
### What changes were proposed in this pull request?

Remove the YEAR, MONTH, DAY, HOUR, MINUTE, SECOND keywords. They are not useful in the parser, as we need to support plural like YEARS, so the parser has to accept the general identifier as interval unit anyway.

### Why are the changes needed?

These keywords are reserved in ANSI. If Spark has these keywords, then they become reserved under ANSI mode. This makes Spark not able to run TPCDS queries as they use YEAR as alias name.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added `TPCDSQueryANSISuite`, to make sure Spark with ANSI mode can run TPCDS queries.

Closes #29560 from cloud-fan/keyword.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-29 14:06:01 -07:00
Chen Zhang 58f87b3178 [SPARK-32639][SQL] Support GroupType parquet mapkey field
### What changes were proposed in this pull request?
Remove the assertion in ParquetSchemaConverter that the parquet mapKey field must be PrimitiveType.

### Why are the changes needed?
There is a parquet file in the attachment of [SPARK-32639](https://issues.apache.org/jira/browse/SPARK-32639), and the MessageType recorded in the file is:
```
message parquet_schema {
  optional group value (MAP) {
    repeated group key_value {
      required group key {
        optional binary first (UTF8);
        optional binary middle (UTF8);
        optional binary last (UTF8);
      }
      optional binary value (UTF8);
    }
  }
}
```

Use `spark.read.parquet("000.snappy.parquet")` to read the file. Spark will throw an exception when converting Parquet MessageType to Spark SQL StructType:

> AssertionError(Map key type is expected to be a primitive type, but found...)

Use `spark.read.schema("value MAP<STRUCT<first:STRING, middle:STRING, last:STRING>, STRING>").parquet("000.snappy.parquet")` to read the file, spark returns the correct result .

According to the parquet project document (https://github.com/apache/parquet-format/blob/master/LogicalTypes.md#maps), the mapKey in the parquet format does not need to be a primitive type.

Note: This parquet file is not written by spark, because spark will write additional sparkSchema string information in the parquet file. When Spark reads, it will directly use the additional sparkSchema information in the file instead of converting Parquet MessageType to Spark SQL StructType.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added a unit test case

Closes #29451 from izchen/SPARK-32639.

Authored-by: Chen Zhang <izchen@126.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-28 16:51:00 +00:00
Takeshi Yamamuro 0cb91b8c18 [SPARK-32704][SQL] Logging plan changes for execution
### What changes were proposed in this pull request?

Since we only log plan changes for analyzer/optimizer now, this PR intends to add code to log plan changes in the preparation phase in `QueryExecution` for execution.
```
scala> spark.sql("SET spark.sql.optimizer.planChangeLog.level=WARN")
scala> spark.range(10).groupBy("id").count().queryExecution.executedPlan
...
20/08/26 09:32:36 WARN PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.execution.CollapseCodegenStages ===
!HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, count#23L])              *(1) HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, count#23L])
!+- HashAggregate(keys=[id#19L], functions=[partial_count(1)], output=[id#19L, count#27L])   +- *(1) HashAggregate(keys=[id#19L], functions=[partial_count(1)], output=[id#19L, count#27L])
!   +- Range (0, 10, step=1, splits=4)                                                          +- *(1) Range (0, 10, step=1, splits=4)

20/08/26 09:32:36 WARN PlanChangeLogger:
=== Result of Batch Preparations ===
!HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, count#23L])              *(1) HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, count#23L])
!+- HashAggregate(keys=[id#19L], functions=[partial_count(1)], output=[id#19L, count#27L])   +- *(1) HashAggregate(keys=[id#19L], functions=[partial_count(1)], output=[id#19L, count#27L])
!   +- Range (0, 10, step=1, splits=4)                                                          +- *(1) Range (0, 10, step=1, splits=4)
```

### Why are the changes needed?

Easy debugging for executed plans

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit tests.

Closes #29544 from maropu/PlanLoggingInPreparations.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-28 16:35:47 +00:00
yi.wu c3b9404253 [SPARK-32717][SQL] Add a AQEOptimizer for AdaptiveSparkPlanExec
### What changes were proposed in this pull request?

This PR proposes to add a specific `AQEOptimizer` for the `AdaptiveSparkPlanExec` instead of implementing an anonymous `RuleExecutor`. At the same time, this PR also adds the configuration `spark.sql.adaptive.optimizer.excludedRules`, which follows the same pattern of `Optimizer`, to make the `AQEOptimizer` more flexible for users and developers.

### Why are the changes needed?

Currently, `AdaptiveSparkPlanExec` has implemented an anonymous `RuleExecutor` to apply the AQE optimize rules on the plan. However, the anonymous class usually could be inconvenient to maintain and extend for the long term.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

It's a pure refactor so pass existing tests should be ok.

Closes #29559 from Ngone51/impro-aqe-optimizer.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-28 21:23:53 +09:00
Jungtaek Lim (HeartSaVioR) 73bfed3633 [SPARK-28612][SQL][FOLLOWUP] Correct method doc of DataFrameWriterV2.replace()
### What changes were proposed in this pull request?

This patch corrects the method doc of DataFrameWriterV2.replace() which explanation of exception is described oppositely.

### Why are the changes needed?

The method doc is incorrect.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Only doc change.

Closes #29568 from HeartSaVioR/SPARK-28612-FOLLOWUP-fix-doc-nit.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-28 15:14:57 +09:00
Liang-Chi Hsieh d6c095c92c [SPARK-32693][SQL] Compare two dataframes with same schema except nullable property
### What changes were proposed in this pull request?

This PR changes key data types check in `HashJoin` to use `sameType`.

### Why are the changes needed?

Looks at the resolving condition of `SetOperation`, it requires only each left data types should be `sameType` as the right ones. Logically the `EqualTo` expression in equi-join, also requires only left data type `sameType` as right data type. Then `HashJoin` requires left keys data type exactly the same as right keys data type, looks not reasonable.

It makes inconsistent results when doing `except` between two dataframes.

If two dataframes don't have nested fields, even their field nullable property different, `HashJoin` passes the key type check because it checks field individually so field nullable property is ignored.

If two dataframes have nested fields like struct, `HashJoin` fails the key type check because now it compare two struct types and nullable property now affects.

### Does this PR introduce _any_ user-facing change?

Yes. Making consistent `except` operation between dataframes.

### How was this patch tested?

Unit test.

Closes #29555 from viirya/SPARK-32693.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-08-28 10:32:23 +09:00
xuewei.linxuewei eb379766f4 [SPARK-32705][SQL] Fix serialization issue for EmptyHashedRelation
### What changes were proposed in this pull request?
Currently, EmptyHashedRelation and HashedRelationWithAllNullKeys is an object, and it will cause JavaDeserialization Exception as following

```
20/08/26 11:13:30 WARN [task-result-getter-2] TaskSetManager: Lost task 34.0 in stage 57.0 (TID 18076, emr-worker-5.cluster-183257, executor 18): java.io.InvalidClassException: org.apache.spark.sql.execution.joins.EmptyHashedRelation$; no valid constructor
        at java.io.ObjectStreamClass$ExceptionInfo.newInvalidClassException(ObjectStreamClass.java:169)
        at java.io.ObjectStreamClass.checkDeserialize(ObjectStreamClass.java:874)
        at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2042)
        at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1572)
        at java.io.ObjectInputStream.readObject(ObjectInputStream.java:430)
        at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:76)
        at org.apache.spark.broadcast.TorrentBroadcast$.$anonfun$unBlockifyObject$4(TorrentBroadcast.scala:328)
```

This PR includes

* Using case object instead to fix serialization issue.
* Also change EmptyHashedRelation not to extend NullAwareHashedRelation since it's already being used in other non-NAAJ joins.

### Why are the changes needed?
It will cause BHJ failed when buildSide is Empty and BHJ(NAAJ) failed when buildSide with null partition keys.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
* Existing UT.
* Run entire TPCDS for E2E coverage.

Closes #29547 from leanken/leanken-SPARK-32705.

Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-27 06:24:42 +00:00
Terry Kim baaa756dee [SPARK-32516][SQL][FOLLOWUP] 'path' option cannot coexist with path parameter for DataFrameWriter.save(), DataStreamReader.load() and DataStreamWriter.start()
### What changes were proposed in this pull request?

This is a follow up PR to #29328 to apply the same constraint where `path` option cannot coexist with path parameter to `DataFrameWriter.save()`, `DataStreamReader.load()` and `DataStreamWriter.start()`.

### Why are the changes needed?

The current behavior silently overwrites the `path` option if path parameter is passed to `DataFrameWriter.save()`, `DataStreamReader.load()` and `DataStreamWriter.start()`.

For example,
```
Seq(1).toDF.write.option("path", "/tmp/path1").parquet("/tmp/path2")
```
will write the result to `/tmp/path2`.

### Does this PR introduce _any_ user-facing change?

Yes, if `path` option coexists with path parameter to any of the above methods, it will throw `AnalysisException`:
```
scala> Seq(1).toDF.write.option("path", "/tmp/path1").parquet("/tmp/path2")
org.apache.spark.sql.AnalysisException: There is a 'path' option set and save() is called with a  path parameter. Either remove the path option, or call save() without the parameter. To ignore this check, set 'spark.sql.legacy.pathOptionBehavior.enabled' to 'true'.;
```

The user can restore the previous behavior by setting `spark.sql.legacy.pathOptionBehavior.enabled` to `true`.

### How was this patch tested?

Added new tests.

Closes #29543 from imback82/path_option.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-27 06:21:04 +00:00
Dongjoon Hyun 2dee4352a0 Revert "[SPARK-32481][CORE][SQL] Support truncate table to move data to trash"
This reverts commit 5c077f0580.
2020-08-26 11:24:35 -07:00
Yuming Wang a8b568800e [SPARK-32659][SQL] Fix the data issue when pruning DPP on non-atomic type
### What changes were proposed in this pull request?

Use `InSet` expression to fix data issue when pruning DPP on non-atomic type. for example:
   ```scala
    spark.range(1000)
    .select(col("id"), col("id").as("k"))
    .write
    .partitionBy("k")
    .format("parquet")
    .mode("overwrite")
    .saveAsTable("df1");

   spark.range(100)
   .select(col("id"), col("id").as("k"))
   .write
   .partitionBy("k")
   .format("parquet")
   .mode("overwrite")
   .saveAsTable("df2")

   spark.sql("set spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio=2")
   spark.sql("set spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly=false")
   spark.sql("SELECT df1.id, df2.k FROM df1 JOIN df2 ON struct(df1.k) = struct(df2.k) AND df2.id < 2").show
   ```
   It should return two records, but it returns empty.

### Why are the changes needed?

Fix data issue

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Add new unit test.

Closes #29475 from wangyum/SPARK-32659.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-26 06:57:43 +00:00
Udbhav30 5c077f0580 [SPARK-32481][CORE][SQL] Support truncate table to move data to trash
### What changes were proposed in this pull request?
Instead of deleting the data, we can move the data to trash.
Based on the configuration provided by the user it will be deleted permanently from the trash.

### Why are the changes needed?
Instead of directly deleting the data, we can provide flexibility to move data to the trash and then delete it permanently.

### Does this PR introduce _any_ user-facing change?
Yes, After truncate table the data is not permanently deleted now.
It is first moved to the trash and then after the given time deleted permanently;

### How was this patch tested?
new UTs added

Closes #29387 from Udbhav30/tuncateTrash.

Authored-by: Udbhav30 <u.agrawal30@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-25 23:38:43 -07:00
yi.wu f510d21e93 [SPARK-32466][FOLLOW-UP][TEST][SQL] Regenerate the golden explain file for PlanStabilitySuite
### What changes were proposed in this pull request?

This PR regenerates the golden explain file based on the fix: https://github.com/apache/spark/pull/29537

### Why are the changes needed?

Eliminates the personal related information (e.g., local directories) in the explain plan.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Checked manually.

Closes #29546 from Ngone51/follow-up-gen-golden-file.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-26 14:46:14 +09:00
Yuming Wang 1354cf0842 [SPARK-32620][SQL] Reset the numPartitions metric when DPP is enabled
### What changes were proposed in this pull request?

This pr reset the `numPartitions` metric when DPP is enabled. Otherwise, it is always a [static value](18cac6a9f0/sql/core/src/main/scala/org/apache/spark/sql/execution/DataSourceScanExec.scala (L215)).

### Why are the changes needed?

Fix metric issue.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test and manual test

For [this test case](18cac6a9f0/sql/core/src/test/scala/org/apache/spark/sql/DynamicPartitionPruningSuite.scala (L252-L280)).

Before this pr:
![image](https://user-images.githubusercontent.com/5399861/90301798-9310b480-ded4-11ea-9294-49bcaba46f83.png)

After this pr:
![image](https://user-images.githubusercontent.com/5399861/90301709-0fef5e80-ded4-11ea-942d-4d45d1dd15bc.png)

Closes #29436 from wangyum/SPARK-32620.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
2020-08-25 18:46:10 -07:00
Sean Owen a9d4e60a90 [SPARK-32614][SQL] Don't apply comment processing if 'comment' unset for CSV
### What changes were proposed in this pull request?

Spark's CSV source can optionally ignore lines starting with a comment char. Some code paths check to see if it's set before applying comment logic (i.e. not set to default of `\0`), but many do not, including the one that passes the option to Univocity. This means that rows beginning with a null char were being treated as comments even when 'disabled'.

### Why are the changes needed?

To avoid dropping rows that start with a null char when this is not requested or intended. See JIRA for an example.

### Does this PR introduce _any_ user-facing change?

Nothing beyond the effect of the bug fix.

### How was this patch tested?

Existing tests plus new test case.

Closes #29516 from srowen/SPARK-32614.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-26 00:25:58 +09:00
yi.wu b78b776c9e [SPARK-32466][SQL][FOLLOW-UP] Normalize Location info in explain plan
### What changes were proposed in this pull request?

1. Extract `SQLQueryTestSuite.replaceNotIncludedMsg` to `PlanTest`.

2. Reuse `replaceNotIncludedMsg` to normalize the explain plan that generated in `PlanStabilitySuite`.

### Why are the changes needed?

This's a follow-up of https://github.com/apache/spark/pull/29270.
Eliminates the personal related information (e.g., local directories) in the explain plan.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Updated test.

Closes #29537 from Ngone51/follow-up-plan-stablity.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-25 21:03:44 +09:00
fqaiser94@gmail.com 3f1e56d4ca [SPARK-32641][SQL] withField + getField should return null if original struct was null
### What changes were proposed in this pull request?

There is a bug in the way the optimizer rule in `SimplifyExtractValueOps` is currently written in master branch which yields incorrect results in scenarios like the following:
```
sql("SELECT CAST(NULL AS struct<a:int,b:int>) struct_col")
.select($"struct_col".withField("d", lit(4)).getField("d").as("d"))

// currently returns this:
+---+
|d  |
+---+
|4  |
+---+

// when in fact it should return this:
+----+
|d   |
+----+
|null|
+----+
```
The changes in this PR will fix this bug.

### Why are the changes needed?

To fix the aforementioned bug. Optimizer rules should improve the performance of the  query but yield exactly the same results.

### Does this PR introduce _any_ user-facing change?

Yes, this bug will no longer occur.
That said, this isn't something to be concerned about as this bug was introduced in Spark 3.1 and Spark 3.1 has yet to be released.

### How was this patch tested?

Unit tests were added. Jenkins must pass them.

Closes #29522 from fqaiser94/SPARK-32641.

Authored-by: fqaiser94@gmail.com <fqaiser94@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-25 04:59:37 +00:00
Liang-Chi Hsieh cee48a9661 [SPARK-32646][SQL][TEST-HADOOP2.7][TEST-HIVE1.2] ORC predicate pushdown should work with case-insensitive analysis
### What changes were proposed in this pull request?

This PR proposes to fix ORC predicate pushdown under case-insensitive analysis case. The field names in pushed down predicates don't need to match in exact letter case with physical field names in ORC files, if we enable case-insensitive analysis.

This is re-submitted for #29457.  Because #29457 has a hive-1.2 error and there were some tests failed with hive-1.2 profile at the same time, #29457 was reverted to unblock others.

### Why are the changes needed?

Currently ORC predicate pushdown doesn't work with case-insensitive analysis. A predicate "a < 0" cannot pushdown to ORC file with field name "A" under case-insensitive analysis.

But Parquet predicate pushdown works with this case. We should make ORC predicate pushdown work with case-insensitive analysis too.

### Does this PR introduce _any_ user-facing change?

Yes, after this PR, under case-insensitive analysis, ORC predicate pushdown will work.

### How was this patch tested?

Unit tests.

Closes #29530 from viirya/fix-orc-pushdown.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-25 13:47:52 +09:00
Nicholas Chammas f540031419 [SPARK-31000][PYTHON][SQL] Add ability to set table description via Catalog.createTable()
### What changes were proposed in this pull request?

This PR enhances `Catalog.createTable()` to allow users to set the table's description. This corresponds to the following SQL syntax:

```sql
CREATE TABLE ...
COMMENT 'this is a fancy table';
```

### Why are the changes needed?

This brings the Scala/Python catalog APIs a bit closer to what's already possible via SQL.

### Does this PR introduce any user-facing change?

Yes, it adds a new parameter to `Catalog.createTable()`.

### How was this patch tested?

Existing unit tests:

```sh
./python/run-tests \
  --python-executables python3.7 \
  --testnames 'pyspark.sql.tests.test_catalog,pyspark.sql.tests.test_context'
```

```
$ ./build/sbt
testOnly org.apache.spark.sql.internal.CatalogSuite org.apache.spark.sql.CachedTableSuite org.apache.spark.sql.hive.MetastoreDataSourcesSuite org.apache.spark.sql.hive.execution.HiveDDLSuite
```

Closes #27908 from nchammas/SPARK-31000-table-description.

Authored-by: Nicholas Chammas <nicholas.chammas@liveramp.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-25 13:42:31 +09:00
Terry Kim e3a88a9767 [SPARK-32516][SQL] 'path' option cannot coexist with load()'s path parameters
### What changes were proposed in this pull request?

This PR proposes to make the behavior consistent for the `path` option when loading dataframes with a single path (e.g, `option("path", path).format("parquet").load(path)` vs. `option("path", path).parquet(path)`) by disallowing `path` option to coexist with `load`'s path parameters.

### Why are the changes needed?

The current behavior is inconsistent:
```scala
scala> Seq(1).toDF.write.mode("overwrite").parquet("/tmp/test")

scala> spark.read.option("path", "/tmp/test").format("parquet").load("/tmp/test").show
+-----+
|value|
+-----+
|    1|
+-----+

scala> spark.read.option("path", "/tmp/test").parquet("/tmp/test").show
+-----+
|value|
+-----+
|    1|
|    1|
+-----+
```

### Does this PR introduce _any_ user-facing change?

Yes, now if the `path` option is specified along with `load`'s path parameters, it would fail:
```scala
scala> Seq(1).toDF.write.mode("overwrite").parquet("/tmp/test")

scala> spark.read.option("path", "/tmp/test").format("parquet").load("/tmp/test").show
org.apache.spark.sql.AnalysisException: There is a path option set and load() is called with path parameters. Either remove the path option or move it into the load() parameters.;
  at org.apache.spark.sql.DataFrameReader.verifyPathOptionDoesNotExist(DataFrameReader.scala:310)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
  ... 47 elided

scala> spark.read.option("path", "/tmp/test").parquet("/tmp/test").show
org.apache.spark.sql.AnalysisException: There is a path option set and load() is called with path parameters. Either remove the path option or move it into the load() parameters.;
  at org.apache.spark.sql.DataFrameReader.verifyPathOptionDoesNotExist(DataFrameReader.scala:310)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:250)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:778)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:756)
  ... 47 elided
```

The user can restore the previous behavior by setting `spark.sql.legacy.pathOptionBehavior.enabled` to `true`.

### How was this patch tested?

Added a test

Closes #29328 from imback82/dfw_option.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-24 16:30:30 +00:00
Cheng Su 08b951b1cb [SPARK-32649][SQL] Optimize BHJ/SHJ inner/semi join with empty hashed relation
### What changes were proposed in this pull request?

For broadcast hash join and shuffled hash join, whenever the build side hashed relation turns out to be empty. We don't need to execute stream side plan at all, and can return an empty iterator (for inner join and left semi join), because we know for sure that none of stream side rows can be outputted as there's no match.

### Why are the changes needed?

A very minor optimization for rare use case, but in case build side turns out to be empty, we can leverage it to short-cut stream side to save CPU and IO.

Example broadcast hash join query similar to `JoinBenchmark` with empty hashed relation:

```
  def broadcastHashJoinLongKey(): Unit = {
    val N = 20 << 20
    val M = 1 << 16

    val dim = broadcast(spark.range(0).selectExpr("id as k", "cast(id as string) as v"))
    codegenBenchmark("Join w long", N) {
      val df = spark.range(N).join(dim, (col("id") % M) === col("k"))
      assert(df.queryExecution.sparkPlan.find(_.isInstanceOf[BroadcastHashJoinExec]).isDefined)
      df.noop()
    }
  }
```

Comparing wall clock time for enabling and disabling this PR (for non-codegen code path). Seeing like 8x improvement.

```
Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15.4
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
Join w long:                              Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Join PR disabled                                    637            646          12         32.9          30.4       1.0X
Join PR enabled                                      77             78           2        271.8           3.7       8.3X
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit test in `JoinSuite`.

Closes #29484 from c21/empty-relation.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-24 08:53:26 +00:00
Liang-Chi Hsieh 11c6a23c13 [SPARK-32352][SQL][FOLLOW-UP][TEST-HADOOP2.7][TEST-HIVE1.2] Exclude partition columns from data columns
### What changes were proposed in this pull request?

This PR fixes a bug of #29406. #29406 partially pushes down data filter even if it mixed in partition filters. But in some cases partition columns might be in data columns too. It will possibly push down a predicate with partition column to datasource.

### Why are the changes needed?

The test "org.apache.spark.sql.hive.orc.HiveOrcHadoopFsRelationSuite.save()/load() - partitioned table - simple queries - partition columns in data" is currently failed with hive-1.2 profile in master branch.

```
[info] - save()/load() - partitioned table - simple queries - partition columns in data *** FAILED *** (1 second, 457 milliseconds)
[info]   java.util.NoSuchElementException: key not found: p1
[info]   at scala.collection.immutable.Map$Map2.apply(Map.scala:138)
[info]   at org.apache.spark.sql.hive.orc.OrcFilters$.buildLeafSearchArgument(OrcFilters.scala:250)
[info]   at org.apache.spark.sql.hive.orc.OrcFilters$.convertibleFiltersHelper$1(OrcFilters.scala:143)
[info]   at org.apache.spark.sql.hive.orc.OrcFilters$.$anonfun$convertibleFilters$4(OrcFilters.scala:146)
[info]   at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
[info]   at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
[info]   at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
[info]   at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38)
[info]   at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
[info]   at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
[info]   at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
[info]   at org.apache.spark.sql.hive.orc.OrcFilters$.convertibleFilters(OrcFilters.scala:145)
[info]   at org.apache.spark.sql.hive.orc.OrcFilters$.createFilter(OrcFilters.scala:83)
[info]   at org.apache.spark.sql.hive.orc.OrcFileFormat.buildReader(OrcFileFormat.scala:142)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #29526 from viirya/SPARK-32352-followup.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-08-24 14:06:08 +09:00
mingjial b9585cde31 [SPARK-32609][TEST] Add Tests for Incorrect exchange reuse with DataSourceV2
### What changes were proposed in this pull request?
Copy  to master branch the unit test added for branch-2.4(https://github.com/apache/spark/pull/29430).

### Why are the changes needed?
The unit test will pass at master branch, indicating that issue reported in https://issues.apache.org/jira/browse/SPARK-32609 is already fixed at master branch. But adding this unit test for future possible failure catch.

### Does this PR introduce _any_ user-facing change?
no.

### How was this patch tested?
sbt test run

Closes #29435 from mingjialiu/master.

Authored-by: mingjial <mingjial@google.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-23 17:40:59 -07:00
majsiel 772c70615a [SPARK-32648][SS] Remove unused DELETE_ACTION in FileStreamSinkLog
### What changes were proposed in this pull request?
Removing unused DELETE_ACTION in FileStreamSinkLog.

### Why are the changes needed?
DELETE_ACTION is not used nowhere in the code.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Tests where not added, because code was removed.

Closes #29505 from michal-wieleba/SPARK-32648.

Authored-by: majsiel <majsiel@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-23 17:17:39 -07:00
Max Gekk 1c798f973f [SPARK-32594][SQL][FOLLOWUP][TEST-HADOOP2.7][TEST-HIVE1.2] Override get() and use Julian days in DaysWritable
### What changes were proposed in this pull request?
Override `def get: Date` in `DaysWritable` use the `daysToMillis(int d)` from the parent class `DateWritable` instead of `long daysToMillis(int d, boolean doesTimeMatter)`.

### Why are the changes needed?
It fixes failures of `HiveSerDeReadWriteSuite` with the profile `hive-1.2`. In that case, the parent class `DateWritable` has different implementation before the commit to Hive da3ed68eda. In particular, `get()` calls `new Date(daysToMillis(daysSinceEpoch))` instead of overrided `def get(doesTimeMatter: Boolean): Date` in the child class. The `get()` method returns wrong result `1970-01-01` because it uses not updated `daysSinceEpoch`.

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
By running the test suite `HiveSerDeReadWriteSuite`:
```
$ build/sbt -Phive-1.2 -Phadoop-2.7 "test:testOnly org.apache.spark.sql.hive.execution.HiveSerDeReadWriteSuite"
```
and
```
$ build/sbt -Phive-2.3 -Phadoop-2.7 "test:testOnly org.apache.spark.sql.hive.execution.HiveSerDeReadWriteSuite"
```

Closes #29523 from MaxGekk/insert-date-into-hive-table-1.2.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2020-08-23 12:43:30 -07:00
angerszhu 9808c15eec [SPARK-32608][SQL][FOLLOW-UP][TEST-HADOOP2.7][TEST-HIVE1.2] Script Transform ROW FORMAT DELIMIT value should format value
### What changes were proposed in this pull request?
As mentioned in https://github.com/apache/spark/pull/29428#issuecomment-678735163 by viirya ,
fix bug in UT, since in script transformation no-serde mode, output of decimal is same in both hive-1.2/hive-2.3

### Why are the changes needed?
FIX UT

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
EXISTED UT

Closes #29520 from AngersZhuuuu/SPARK-32608-FOLLOW.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2020-08-23 08:08:55 -07:00
Liang-Chi Hsieh aa0b0b87e2 Revert "[SPARK-32646][SQL] ORC predicate pushdown should work with case-insensitive analysis"
### What changes were proposed in this pull request?

This reverts commit e277ef1a83.

### Why are the changes needed?

Because master and branch-3.0 both have few tests failed under hive-1.2 profile. And the PR #29457 missed a change in hive-1.2 code that causes compilation error. So it will make debugging the failed tests harder. I'd like revert #29457 first to unblock it.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test

Closes #29519 from viirya/revert-SPARK-32646.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2020-08-23 00:40:53 -07:00
xuewei.linxuewei f258718535 [SPARK-32678][SQL] Rename EmptyHashedRelationWithAllNullKeys and simplify NAAJ generated code
### What changes were proposed in this pull request?
Some Code refine.

1. rename EmptyHashedRelationWithAllNullKeys to HashedRelationWithAllNullKeys.
2. simplify generated code for BHJ NAAJ.

### Why are the changes needed?
Refine code and naming to avoid confusing understanding.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing test.

Closes #29503 from leanken/leanken-SPARK-32678.

Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2020-08-22 22:32:39 -07:00
Yuanjian Li 8b26c69ce7 [SPARK-31792][SS][DOC][FOLLOW-UP] Rephrase the description for some operations
### What changes were proposed in this pull request?
Rephrase the description for some operations to make it clearer.

### Why are the changes needed?
Add more detail in the document.

### Does this PR introduce _any_ user-facing change?
No, document only.

### How was this patch tested?
Document only.

Closes #29269 from xuanyuanking/SPARK-31792-follow.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
2020-08-22 21:32:23 +09:00
Robert (Bobby) Evans 12f4331b9e [SPARK-32672][SQL] Fix data corruption in boolean bit set compression
## What changes were proposed in this pull request?

This fixed SPARK-32672 a data corruption.  Essentially the BooleanBitSet CompressionScheme would miss nulls at the end of a CompressedBatch.  The values would then default to false.

### Why are the changes needed?
It fixes data corruption

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
I manually tested it against the original issue that was producing errors for me.  I also added in a unit test.

Closes #29506 from revans2/SPARK-32672.

Authored-by: Robert (Bobby) Evans <bobby@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-22 11:07:14 +09:00
Brandon Jiang 1450b5e095 [MINOR][DOCS] fix typo for docs,log message and comments
### What changes were proposed in this pull request?
Fix typo for docs, log messages and comments

### Why are the changes needed?
typo fix to increase readability

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
manual test has been performed to test the updated

Closes #29443 from brandonJY/spell-fix-doc.

Authored-by: Brandon Jiang <Brandon.jiang.a@outlook.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-08-22 06:45:35 +09:00
Liang-Chi Hsieh e277ef1a83 [SPARK-32646][SQL] ORC predicate pushdown should work with case-insensitive analysis
### What changes were proposed in this pull request?

This PR proposes to fix ORC predicate pushdown under case-insensitive analysis case. The field names in pushed down predicates don't need to match in exact letter case with physical field names in ORC files, if we enable case-insensitive analysis.

### Why are the changes needed?

Currently ORC predicate pushdown doesn't work with case-insensitive analysis. A predicate "a < 0" cannot pushdown to ORC file with field name "A" under case-insensitive analysis.

But Parquet predicate pushdown works with this case. We should make ORC predicate pushdown work with case-insensitive analysis too.

### Does this PR introduce _any_ user-facing change?

Yes, after this PR, under case-insensitive analysis, ORC predicate pushdown will work.

### How was this patch tested?

Unit tests.

Closes #29457 from viirya/fix-orc-pushdown.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-21 07:57:24 +00:00
angerszhu c75a82794f [SPARK-32667][SQL] Script transform 'default-serde' mode should pad null value to filling column
### What changes were proposed in this pull request?
Hive no serde mode when  column less then output specified column, it will pad null value to it, spark should do this also.
```
hive> SELECT TRANSFORM(a, b)
    >   ROW FORMAT DELIMITED
    >   FIELDS TERMINATED BY '|'
    >   LINES TERMINATED BY '\n'
    >   NULL DEFINED AS 'NULL'
    > USING 'cat' as (a string, b string, c string, d string)
    >   ROW FORMAT DELIMITED
    >   FIELDS TERMINATED BY '|'
    >   LINES TERMINATED BY '\n'
    >   NULL DEFINED AS 'NULL'
    > FROM (
    > select 1 as a, 2 as b
    > ) tmp ;
OK
1	2	NULL	NULL
Time taken: 24.626 seconds, Fetched: 1 row(s)
```

### Why are the changes needed?
Keep save behavior with hive data.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #29500 from AngersZhuuuu/SPARK-32667.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-21 07:37:11 +00:00
Wenchen Fan 8b119f1663 [SPARK-32640][SQL] Downgrade Janino to fix a correctness bug
### What changes were proposed in this pull request?

This PR reverts https://github.com/apache/spark/pull/27860 to downgrade Janino, as the new version has a bug.

### Why are the changes needed?

The symptom is about NaN comparison. For code below
```
if (double_value <= 0.0) {
  ...
} else {
  ...
}
```

If `double_value` is NaN, `NaN <= 0.0` is false and we should go to the else branch. However, current Spark goes to the if branch and causes correctness issues like SPARK-32640.

One way to fix it is:
```
boolean cond = double_value <= 0.0;
if (cond) {
  ...
} else {
  ...
}
```

I'm not familiar with Janino so I don't know what's going on there.

### Does this PR introduce _any_ user-facing change?

Yes, fix correctness bugs.

### How was this patch tested?

a new test

Closes #29495 from cloud-fan/revert.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-20 13:26:39 -07:00
Wenchen Fan d378dc5f6d [SPARK-28863][SQL][FOLLOWUP] Do not reuse the physical plan
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/29469

Instead of passing the physical plan to the fallbacked v1 source directly and skipping analysis, optimization, planning altogether, this PR proposes to pass the optimized plan.

### Why are the changes needed?

It's a bit risky to pass the physical plan directly. When the fallbacked v1 source applies more operations to the input DataFrame, it will re-apply the post-planning physical rules like `CollapseCodegenStages`, `InsertAdaptiveSparkPlan`, etc., which is very tricky.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing test suite with some new tests

Closes #29489 from cloud-fan/follow.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-20 15:23:25 +00:00
Takeshi Yamamuro d80d0ced9a [SPARK-32665][SQL][TEST] Deletes orphan directories under a warehouse dir in SQLQueryTestSuite
### What changes were proposed in this pull request?

In case that a last `SQLQueryTestSuite` test run is killed, it will fail in a next run because of a following reason:
```
[info] org.apache.spark.sql.SQLQueryTestSuite *** ABORTED *** (17 seconds, 483 milliseconds)
[info]   org.apache.spark.sql.AnalysisException: Can not create the managed table('`testdata`'). The associated location('file:/Users/maropu/Repositories/spark/spark-master/sql/core/spark-warehouse/org.apache.spark.sql.SQLQueryTestSuite/testdata') already exists.;
[info]   at org.apache.spark.sql.catalyst.catalog.SessionCatalog.validateTableLocation(SessionCatalog.scala:355)
[info]   at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:170)
[info]   at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
```
This PR intends to add code to deletes orphan directories under a warehouse dir in `SQLQueryTestSuite` before creating test tables.

### Why are the changes needed?

To improve test convenience

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manually checked

Closes #29488 from maropu/DeleteDirs.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-20 06:12:05 -07:00
angerszhu 6dae11d034 [SPARK-32607][SQL] Script Transformation ROW FORMAT DELIMITED TOK_TABLEROWFORMATLINES only support '\n'
### What changes were proposed in this pull request?
Scrip Transform no-serde (`ROW FORMAT DELIMITED`) mode `LINE TERMINNATED BY `
only support `\n`.

Tested in hive :
Hive 1.1
![image](https://user-images.githubusercontent.com/46485123/90309510-ce82a180-df1b-11ea-96ab-56e2b3229489.png)

Hive 2.3.7
![image](https://user-images.githubusercontent.com/46485123/90309504-c88cc080-df1b-11ea-853e-8f65e9ed2375.png)

### Why are the changes needed?
Strictly limit the use method to ensure the accuracy of data

### Does this PR introduce _any_ user-facing change?
User use Scrip Transform no-serde (ROW FORMAT DELIMITED) mode  with `LINE TERMINNATED BY `
not equal `'\n'`. will throw error

### How was this patch tested?
Added UT

Closes #29438 from AngersZhuuuu/SPARK-32607.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-20 12:03:02 +00:00
Jungtaek Lim (HeartSaVioR) e6795cd341 [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue
### What changes were proposed in this pull request?

In many operations on CompactibleFileStreamLog reads a metadata log file and materializes all entries into memory. As the nature of the compact operation, CompactibleFileStreamLog may have a huge compact log file with bunch of entries included, and for now they're just monotonically increasing, which means the amount of memory to materialize also grows incrementally. This leads pressure on GC.

This patch proposes to streamline the logic on file stream source and sink whenever possible to avoid memory issue. To make this possible we have to break the existing behavior of excluding entries - now the `compactLogs` method is called with all entries, which forces us to materialize all entries into memory. This is hopefully no effect on end users, because only file stream sink has a condition to exclude entries, and the condition has been never true. (DELETE_ACTION has been never set.)

Based on the observation, this patch also changes the existing UT a bit which simulates the situation where "A" file is added, and another batch marks the "A" file as deleted. This situation simply doesn't work with the change, but as I mentioned earlier it hasn't been used. (I'm not sure the UT is from the actual run. I guess not.)

### Why are the changes needed?

The memory issue (OOME) is reported by both JIRA issue and user mailing list.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

* Existing UTs
* Manual test done

The manual test leverages the simple apps which continuously writes the file stream sink metadata log.

bea7680e4c

The test is configured to have a batch metadata log file at 1.9M (10,000 entries) whereas other Spark configuration is set to the default. (compact interval = 10) The app runs as driver, and the heap memory on driver is set to 3g.

> before the patch

<img width="1094" alt="Screen Shot 2020-06-23 at 3 37 44 PM" src="https://user-images.githubusercontent.com/1317309/85375841-d94f3480-b571-11ea-817b-c6b48b34888a.png">

It only ran for 40 mins, with the latest compact batch file size as 1.3G. The process struggled with GC, and after some struggling, it threw OOME.

> after the patch

<img width="1094" alt="Screen Shot 2020-06-23 at 3 53 29 PM" src="https://user-images.githubusercontent.com/1317309/85375901-eff58b80-b571-11ea-837e-30d107f677f9.png">

It sustained 2 hours run (manually stopped as it's expected to run more), with the latest compact batch file size as 2.2G. The actual memory usage didn't even go up to 1.2G, and be cleaned up soon without outstanding GC activity.

Closes #28904 from HeartSaVioR/SPARK-30462.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-20 02:26:38 -07:00
ulysses 7048fff230 [SPARK-31999][SQL][FOLLOWUP] Adds negative test cases with typos
### What changes were proposed in this pull request?

Address the [#comment](https://github.com/apache/spark/pull/28840#discussion_r471172006).

### Why are the changes needed?

Make code robust.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

ut.

Closes #29453 from ulysses-you/SPARK-31999-FOLLOWUP.

Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-20 05:33:12 +00:00
Burak Yavuz 278d0dd25b [SPARK-28863][SQL] Introduce AlreadyPlanned to prevent reanalysis of V1FallbackWriters
### What changes were proposed in this pull request?

This PR introduces a LogicalNode AlreadyPlanned, and related physical plan and preparation rule.

With the DataSourceV2 write operations, we have a way to fallback to the V1 writer APIs using InsertableRelation. The gross part is that we're in physical land, but the InsertableRelation takes a logical plan, so we have to pass the logical plans to these physical nodes, and then potentially go through re-planning. This re-planning can cause issues for an already optimized plan.

A useful primitive could be specifying that a plan is ready for execution through a logical node AlreadyPlanned. This would wrap a physical plan, and then we can go straight to execution.

### Why are the changes needed?

To avoid having a physical plan that is disconnected from the physical plan that is being executed in V1WriteFallback execution. When a physical plan node executes a logical plan, the inner query is not connected to the running physical plan. The physical plan that actually runs is not visible through the Spark UI and its metrics are not exposed. In some cases, the EXPLAIN plan doesn't show it.

### Does this PR introduce _any_ user-facing change?

Nope

### How was this patch tested?

V1FallbackWriterSuite tests that writes still work

Closes #29469 from brkyvz/alreadyAnalyzed2.

Lead-authored-by: Burak Yavuz <brkyvz@gmail.com>
Co-authored-by: Burak Yavuz <burak@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-19 16:25:35 +00:00
Terry Kim 3d1dce75d9 [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources
### What changes were proposed in this pull request?

When CSV/JSON datasources infer schema (e.g, `def inferSchema(files: Seq[FileStatus])`, they use the `files` along with the original options. `files` in `inferSchema` could have been deduced from the "path" option if the option was present, so this can cause issues (e.g., reading more data, listing the path again) since the "path" option is **added** to the `files`.

### Why are the changes needed?

The current behavior can cause the following issue:
```scala
class TestFileFilter extends PathFilter {
  override def accept(path: Path): Boolean = path.getParent.getName != "p=2"
}

val path = "/tmp"
val df = spark.range(2)
df.write.json(path + "/p=1")
df.write.json(path + "/p=2")

val extraOptions = Map(
  "mapred.input.pathFilter.class" -> classOf[TestFileFilter].getName,
  "mapreduce.input.pathFilter.class" -> classOf[TestFileFilter].getName
)

// This works fine.
assert(spark.read.options(extraOptions).json(path).count == 2)

// The following with "path" option fails with the following:
// assertion failed: Conflicting directory structures detected. Suspicious paths
//	file:/tmp
//	file:/tmp/p=1
assert(spark.read.options(extraOptions).format("json").option("path", path).load.count() === 2)
```

### Does this PR introduce _any_ user-facing change?

Yes, the above failure doesn't happen and you get the consistent experience when you use `spark.read.csv(path)` or `spark.read.format("csv").option("path", path).load`.

### How was this patch tested?

Updated existing tests.

Closes #29437 from imback82/path_bug.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-19 16:23:22 +00:00
angerszhu 03e2de99ab [SPARK-32608][SQL] Script Transform ROW FORMAT DELIMIT value should format value
### What changes were proposed in this pull request?
For SQL
```
SELECT TRANSFORM(a, b, c)
  ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ','
  LINES TERMINATED BY '\n'
  NULL DEFINED AS 'null'
  USING 'cat' AS (a, b, c)
  ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ','
  LINES TERMINATED BY '\n'
  NULL DEFINED AS 'NULL'
FROM testData
```
The correct

TOK_TABLEROWFORMATFIELD should be `, `nut actually ` ','`

TOK_TABLEROWFORMATLINES should be `\n`  but actually` '\n'`

### Why are the changes needed?
Fix string value format

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #29428 from AngersZhuuuu/SPARK-32608.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-19 08:31:58 +00:00
Wenchen Fan f33b64a656 [SPARK-32652][SQL] ObjectSerializerPruning fails for RowEncoder
### What changes were proposed in this pull request?

Update `ObjectSerializerPruning.alignNullTypeInIf`, to consider the isNull check generated in `RowEncoder`, which is `Invoke(inputObject, "isNullAt", BooleanType, Literal(index) :: Nil)`.

### Why are the changes needed?

Query fails if we don't fix this bug, due to type mismatch in `If`.

### Does this PR introduce _any_ user-facing change?

Yes, the failed query can run after this fix.

### How was this patch tested?

new tests

Closes #29467 from cloud-fan/bug.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-19 13:50:29 +09:00
Liang-Chi Hsieh b33066f42b [SPARK-32622][SQL][TEST] Add case-sensitivity test for ORC predicate pushdown
### What changes were proposed in this pull request?

During working on SPARK-25557, we found that ORC predicate pushdown doesn't have case-sensitivity test. This PR proposes to add case-sensitivity test for ORC predicate pushdown.

### Why are the changes needed?

Increasing test coverage for ORC predicate pushdown.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass Jenkins tests.

Closes #29427 from viirya/SPARK-25557-followup3.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-17 13:19:49 -07:00
xuewei.linxuewei 108b1dc723 [SPARK-32615][SQL] Fix AQE aggregateMetrics java.util.NoSuchElementException
### What changes were proposed in this pull request?
Found java.util.NoSuchElementException in UT log of AdaptiveQueryExecSuite. During AQE, when sub-plan changed, LiveExecutionData is using the new sub-plan SQLMetrics to override the old ones, But in the final aggregateMetrics, since the plan was updated, the old metrics will throw NoSuchElementException when it try to match with the new metricTypes. To sum up, we need to filter out those outdated metrics to avoid throwing java.util.NoSuchElementException,  which cause SparkUI SQL Tab abnormally rendered.

### Why are the changes needed?
SQL Metrics is not correct for some AQE cases, and it break SparkUI SQL Tab when it comes to NAAJ rewritten to LocalRelation case.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
* Added case in SQLAppStatusListenerSuite.
* Run AdaptiveQueryExecSuite with no "java.util.NoSuchElementException".
* Validation on Spark Web UI

Closes #29431 from leanken/leanken-SPARK-32615.

Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-17 14:28:10 +00:00
yi.wu 9f2893cf2c [SPARK-32466][TEST][SQL] Add PlanStabilitySuite to detect SparkPlan regression
### What changes were proposed in this pull request?

This PR proposes to detect possible regression inside `SparkPlan`. To achieve this goal, this PR added a base test suite called  `PlanStabilitySuite`. The basic workflow of this test suite is similar to  `SQLQueryTestSuite`. It also uses `SPARK_GENERATE_GOLDEN_FILES` to decide whether it should regenerate the golden files or compare to the golden result for each input query. The difference is, `PlanStabilitySuite` uses the serialized explain result(.txt format) of the `SparkPlan` as the output of a query, instead of the data result.

And since `SparkPlan` is non-deterministic for various reasons, e.g.,  expressions ids changes, expression order changes, we'd reduce the plan to a simplified version that only contains node names and references. And we only identify those important nodes, e.g., `Exchange`, `SubqueryExec`, in the simplified plan.

And we'd reuse TPC-DS queries(v1.4, v2.7, modified) to test plans' stability. Currently, one TPC-DS query can only have one corresponding simplified golden plan.

This PR also did a few refactor, which extracts `TPCDSBase` from `TPCDSQuerySuite`. So,  `PlanStabilitySuite` can use the TPC-DS queries as well.

### Why are the changes needed?

Nowadays, Spark is getting more and more complex. Any changes might cause regression unintentionally. Spark already has some benchmark to catch the performance regression. But, yet, it doesn't have a way to detect the regression inside `SparkPlan`. It would be good if we could detect the possible regression early during the compile phase before the runtime phase.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added `PlanStabilitySuite` and it's subclasses.

Closes #29270 from Ngone51/plan-stable.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-17 14:22:12 +00:00
Wenchen Fan b94c67b502 Revert "[SPARK-32511][SQL] Add dropFields method to Column class"
This reverts commit 0c850c71e7.
2020-08-17 13:18:46 +08:00
Cheng Su 8f0fef1843 [SPARK-32399][SQL] Full outer shuffled hash join
### What changes were proposed in this pull request?

Add support for full outer join inside shuffled hash join. Currently if the query is a full outer join, we only use sort merge join as the physical operator. However it can be CPU and IO intensive in case input table is large for sort merge join. Shuffled hash join on the other hand saves the sort CPU and IO compared to sort merge join, especially when table is large.

This PR implements the full outer join as followed:
* Process rows from stream side by looking up hash relation, and mark the matched rows from build side by:
  * for joining with unique key, a `BitSet` is used to record matched rows from build side (`key index` to represent each row)
  * for joining with non-unique key, a `HashSet[Long]` is  used to record matched rows from build side (`key index` + `value index` to represent each row).
`key index` is defined as the index into key addressing array `longArray` in `BytesToBytesMap`.
`value index` is defined as the iterator index of values for same key.

* Process rows from build side by iterating hash relation, and filter out rows from build side being looked up already (done in `ShuffledHashJoinExec.fullOuterJoin`)

For context, this PR was originally implemented as followed (up to commit e3322766d4):
1. Construct hash relation from build side, with extra boolean value at the end of row to track look up information (done in `ShuffledHashJoinExec.buildHashedRelation` and `UnsafeHashedRelation.apply`).
2. Process rows from stream side by looking up hash relation, and mark the matched rows from build side be looked up (done in `ShuffledHashJoinExec.fullOuterJoin`).
3. Process rows from build side by iterating hash relation, and filter out rows from build side being looked up already (done in `ShuffledHashJoinExec.fullOuterJoin`).

See discussion of pros and cons between these two approaches [here](https://github.com/apache/spark/pull/29342#issuecomment-672275450), [here](https://github.com/apache/spark/pull/29342#issuecomment-672288194) and [here](https://github.com/apache/spark/pull/29342#issuecomment-672640531).

TODO: codegen for full outer shuffled hash join can be implemented in another followup PR.

### Why are the changes needed?

As implementation in this PR, full outer shuffled hash join will have overhead to iterate build side twice (once for building hash map, and another for outputting non-matching rows), and iterate stream side once. However, full outer sort merge join needs to iterate both sides twice, and sort the large table can be more CPU and IO intensive. So full outer shuffled hash join can be more efficient than sort merge join when stream side is much more larger than build side.

For example query below, full outer SHJ saved 30% wall clock time compared to full outer SMJ.

```
def shuffleHashJoin(): Unit = {
    val N: Long = 4 << 22
    withSQLConf(
      SQLConf.SHUFFLE_PARTITIONS.key -> "2",
      SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "20000000") {
      codegenBenchmark("shuffle hash join", N) {
        val df1 = spark.range(N).selectExpr(s"cast(id as string) as k1")
        val df2 = spark.range(N / 10).selectExpr(s"cast(id * 10 as string) as k2")
        val df = df1.join(df2, col("k1") === col("k2"), "full_outer")
        df.noop()
    }
  }
}
```

```
Running benchmark: shuffle hash join
  Running case: shuffle hash join off
  Stopped after 2 iterations, 16602 ms
  Running case: shuffle hash join on
  Stopped after 5 iterations, 31911 ms

Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15.4
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
shuffle hash join:                        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
shuffle hash join off                              7900           8301         567          2.1         470.9       1.0X
shuffle hash join on                               6250           6382          95          2.7         372.5       1.3X
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit test in `JoinSuite.scala`, `AbstractBytesToBytesMapSuite.java` and `HashedRelationSuite.scala`.

Closes #29342 from c21/full-outer-shj.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-08-17 08:06:19 +09:00
alexander-daskalov 10edeafc69 [MINOR][SQL] Fixed approx_count_distinct rsd param description
### What changes were proposed in this pull request?

In the docs concerning the approx_count_distinct I have changed the description of the rsd parameter from **_maximum estimation error allowed_** to _**maximum relative standard deviation allowed**_

### Why are the changes needed?

Maximum estimation error allowed can be misleading. You can set the target relative standard deviation, which affects the estimation error, but on given runs the estimation error can still be above the rsd parameter.

### Does this PR introduce _any_ user-facing change?

This PR should make it easier for users reading the docs to understand that the rsd parameter in approx_count_distinct doesn't cap the estimation error, but just sets the target deviation instead,

### How was this patch tested?

No tests, as no code changes were made.

Closes #29424 from Comonut/fix-approx_count_distinct-rsd-param-description.

Authored-by: alexander-daskalov <alexander.daskalov@adevinta.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-08-14 22:10:41 +09:00
Huaxin Gao 14003d4c30 [SPARK-32590][SQL] Remove fullOutput from RowDataSourceScanExec
### What changes were proposed in this pull request?
Remove `fullOutput` from `RowDataSourceScanExec`

### Why are the changes needed?
`RowDataSourceScanExec` requires the full output instead of the scan output after column pruning. However, in v2 code path, we don't have the full output anymore so we just pass the pruned output. `RowDataSourceScanExec.fullOutput` is actually meaningless so we should remove it.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existing tests

Closes #29415 from huaxingao/rm_full_output.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-14 08:08:16 +00:00
ulysses 339eec5f32 [SPARK-20680][SQL][FOLLOW-UP] Add HiveVoidType in HiveClientImpl
### What changes were proposed in this pull request?

Discussion with [comment](https://github.com/apache/spark/pull/29244#issuecomment-671746329).

Add `HiveVoidType` class in `HiveClientImpl` then we can replace `NullType` to `HiveVoidType` before we call hive client.

### Why are the changes needed?

Better compatible with hive.

More details in [#29244](https://github.com/apache/spark/pull/29244).

### Does this PR introduce _any_ user-facing change?

Yes, user can create view with null type in Hive.

### How was this patch tested?

New test.

Closes #29423 from ulysses-you/add-HiveVoidType.

Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-14 06:59:15 +00:00
fqaiser94@gmail.com 0c850c71e7 [SPARK-32511][SQL] Add dropFields method to Column class
### What changes were proposed in this pull request?

Added a new `dropFields` method to the `Column` class.
This method should allow users to drop a `StructField` in a `StructType` column (with similar semantics to the `drop` method on `Dataset`).

### Why are the changes needed?

Often Spark users have to work with deeply nested data e.g. to fix a data quality issue with an existing `StructField`. To do this with the existing Spark APIs, users have to rebuild the entire struct column.

For example, let's say you have the following deeply nested data structure which has a data quality issue (`5` is missing):
```
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

val data = spark.createDataFrame(sc.parallelize(
      Seq(Row(Row(Row(1, 2, 3), Row(Row(4, null, 6), Row(7, 8, 9), Row(10, 11, 12)), Row(13, 14, 15))))),
      StructType(Seq(
        StructField("a", StructType(Seq(
          StructField("a", StructType(Seq(
            StructField("a", IntegerType),
            StructField("b", IntegerType),
            StructField("c", IntegerType)))),
          StructField("b", StructType(Seq(
            StructField("a", StructType(Seq(
              StructField("a", IntegerType),
              StructField("b", IntegerType),
              StructField("c", IntegerType)))),
            StructField("b", StructType(Seq(
              StructField("a", IntegerType),
              StructField("b", IntegerType),
              StructField("c", IntegerType)))),
            StructField("c", StructType(Seq(
              StructField("a", IntegerType),
              StructField("b", IntegerType),
              StructField("c", IntegerType))))
          ))),
          StructField("c", StructType(Seq(
            StructField("a", IntegerType),
            StructField("b", IntegerType),
            StructField("c", IntegerType))))
        )))))).cache

data.show(false)
+---------------------------------+
|a                                |
+---------------------------------+
|[[1, 2, 3], [[4,, 6], [7, 8, 9]]]|
+---------------------------------+
```
Currently, to drop the missing value users would have to do something like this:
```
val result = data.withColumn("a",
  struct(
    $"a.a",
    struct(
      struct(
        $"a.b.a.a",
        $"a.b.a.c"
      ).as("a"),
      $"a.b.b",
      $"a.b.c"
    ).as("b"),
    $"a.c"
  ))

result.show(false)
+---------------------------------------------------------------+
|a                                                              |
+---------------------------------------------------------------+
|[[1, 2, 3], [[4, 6], [7, 8, 9], [10, 11, 12]], [13, 14, 15]]|
+---------------------------------------------------------------+
```
As you can see above, with the existing methods users must call the `struct` function and list all fields, including fields they don't want to change. This is not ideal as:
>this leads to complex, fragile code that cannot survive schema evolution.
[SPARK-16483](https://issues.apache.org/jira/browse/SPARK-16483)

In contrast, with the method added in this PR, a user could simply do something like this to get the same result:
```
val result = data.withColumn("a", 'a.dropFields("b.a.b"))
result.show(false)
+---------------------------------------------------------------+
|a                                                              |
+---------------------------------------------------------------+
|[[1, 2, 3], [[4, 6], [7, 8, 9], [10, 11, 12]], [13, 14, 15]]|
+---------------------------------------------------------------+

```

This is the second of maybe 3 methods that could be added to the `Column` class to make it easier to manipulate nested data.
Other methods under discussion in [SPARK-22231](https://issues.apache.org/jira/browse/SPARK-22231) include `withFieldRenamed`.
However, this should be added in a separate PR.

### Does this PR introduce _any_ user-facing change?

Only one minor change. If the user submits the following query:
```
df.withColumn("a", $"a".withField(null, null))
```
instead of throwing:
```
java.lang.IllegalArgumentException: requirement failed: fieldName cannot be null
```
it will now throw:
```
java.lang.IllegalArgumentException: requirement failed: col cannot be null
```
I don't believe its should be an issue to change this because:
- neither message is incorrect
- Spark 3.1.0 has yet to be released

but please feel free to correct me if I am wrong.

### How was this patch tested?

New unit tests were added. Jenkins must pass them.

### Related JIRAs:
More discussion on this topic can be found here:
- https://issues.apache.org/jira/browse/SPARK-22231
- https://issues.apache.org/jira/browse/SPARK-16483

Closes #29322 from fqaiser94/SPARK-32511.

Lead-authored-by: fqaiser94@gmail.com <fqaiser94@gmail.com>
Co-authored-by: fqaiser94 <fqaiser94@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-13 03:28:25 +00:00
Chen Zhang 08d86ebc05 [MINOR] Update URL of the parquet project in code comment
### What changes were proposed in this pull request?
Update URL of the parquet project in code comment.

### Why are the changes needed?
The original url is not available.

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
No test needed.

Closes #29416 from izchen/Update-Parquet-URL.

Authored-by: Chen Zhang <izchen@126.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-13 11:54:43 +09:00
angerszhu 643cd876e4 [SPARK-32352][SQL] Partially push down support data filter if it mixed in partition filters
### What changes were proposed in this pull request?
We support partially push partition filters since SPARK-28169. We can also support partially push down data filters if it mixed in partition filters and data filters. For example:
```
spark.sql(
  s"""
     |CREATE TABLE t(i INT, p STRING)
     |USING parquet
     |PARTITIONED BY (p)""".stripMargin)

spark.range(0, 1000).selectExpr("id as col").createOrReplaceTempView("temp")
for (part <- Seq(1, 2, 3, 4)) {
  sql(s"""
         |INSERT OVERWRITE TABLE t PARTITION (p='$part')
         |SELECT col FROM temp""".stripMargin)
}

spark.sql("SELECT * FROM t WHERE  WHERE (p = '1' AND i = 1) OR (p = '2' and i = 2)").explain()
```

We can also push down ```i = 1 or i = 2 ```

### Why are the changes needed?
Extract more data filter to FileSourceScanExec

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
Added UT

Closes #29406 from AngersZhuuuu/SPARK-32352.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-12 12:18:33 +00:00
Tin Hang To a418548dad [SPARK-31703][SQL] Parquet RLE float/double are read incorrectly on big endian platforms
### What changes were proposed in this pull request?
This PR fixes the issue introduced during SPARK-26985.

SPARK-26985 changes the `putDoubles()` and `putFloats()` methods to respect the platform's endian-ness.  However, that causes the RLE paths in VectorizedRleValuesReader.java to read the RLE entries in parquet as BIG_ENDIAN on big endian platforms (i.e., as is), even though parquet data is always in little endian format.

The comments in `WriteableColumnVector.java` say those methods are used for "ieee formatted doubles in platform native endian" (or floats), but since the data in parquet is always in little endian format, use of those methods appears to be inappropriate.

To demonstrate the problem with spark-shell:

```scala
import org.apache.spark._
import org.apache.spark.sql._
import org.apache.spark.sql.types._

var data = Seq(
  (1.0, 0.1),
  (2.0, 0.2),
  (0.3, 3.0),
  (4.0, 4.0),
  (5.0, 5.0))

var df = spark.createDataFrame(data).write.mode(SaveMode.Overwrite).parquet("/tmp/data.parquet2")
var df2 = spark.read.parquet("/tmp/data.parquet2")
df2.show()
```

result:

```scala
+--------------------+--------------------+
|                  _1|                  _2|
+--------------------+--------------------+
|           3.16E-322|-1.54234871366845...|
|         2.0553E-320|         2.0553E-320|
|          2.561E-320|          2.561E-320|
|4.66726145843124E-62|         1.0435E-320|
|        3.03865E-319|-1.54234871366757...|
+--------------------+--------------------+
```

Also tests in ParquetIOSuite that involve float/double data would fail, e.g.,

- basic data types (without binary)
- read raw Parquet file

/examples/src/main/python/mllib/isotonic_regression_example.py would fail as well.

Purposed code change is to add `putDoublesLittleEndian()` and `putFloatsLittleEndian()` methods for parquet to invoke, just like the existing `putIntsLittleEndian()` and `putLongsLittleEndian()`.  On little endian platforms they would call `putDoubles()` and `putFloats()`, on big endian they would read the entries as little endian like pre-SPARK-26985.

No new unit-test is introduced as the existing ones are actually sufficient.

### Why are the changes needed?
RLE float/double data in parquet files will not be read back correctly on big endian platforms.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
All unit tests (mvn test) were ran and OK.

Closes #29383 from tinhto-000/SPARK-31703.

Authored-by: Tin Hang To <tinto@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-12 06:39:10 +00:00
angerszhu 4cf8c1d07d [SPARK-32400][SQL] Improve test coverage of HiveScriptTransformationExec
### What changes were proposed in this pull request?

1. Extract common test case (no serde) to BasicScriptTransformationExecSuite
2. Add more test case for no serde mode about supported data type and behavior in `BasicScriptTransformationExecSuite`
3. Add more test case for hive serde mode about supported type and behavior in `HiveScriptTransformationExecSuite`

### Why are the changes needed?
Improve test coverage of Script Transformation

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
Added UT

Closes #29401 from AngersZhuuuu/SPARK-32400.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-12 06:02:42 +00:00
Max Gekk 0477d23467 [SPARK-32594][SQL] Fix serialization of dates inserted to Hive tables
### What changes were proposed in this pull request?
Fix `DaysWritable` by overriding parent's method `def get(doesTimeMatter: Boolean): Date` from `DateWritable` instead of `Date get()` because the former one uses the first one. The bug occurs because `HiveOutputWriter.write()` call `def get(doesTimeMatter: Boolean): Date` transitively with default implementation from the parent class  `DateWritable` which doesn't respect date rebases and uses not initialized `daysSinceEpoch` (0 which `1970-01-01`).

### Why are the changes needed?
The changes fix the bug:
```sql
spark-sql> CREATE TABLE table1 (d date);
spark-sql> INSERT INTO table1 VALUES (date '2020-08-11');
spark-sql> SELECT * FROM table1;
1970-01-01
```
The expected result of the last SQL statement must be **2020-08-11** but got **1970-01-01**.

### Does this PR introduce _any_ user-facing change?
Yes. After the fix, `INSERT` work correctly:
```sql
spark-sql> SELECT * FROM table1;
2020-08-11
```

### How was this patch tested?
Add new test to `HiveSerDeReadWriteSuite`

Closes #29409 from MaxGekk/insert-date-into-hive-table.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-12 13:32:16 +09:00
xuewei.linxuewei c37357a092 [SPARK-32573][SQL] Anti Join Improvement with EmptyHashedRelation and EmptyHashedRelationWithAllNullKeys
### What changes were proposed in this pull request?
In [SPARK-32290](https://issues.apache.org/jira/browse/SPARK-32290), we introduced several new types of HashedRelation.

* EmptyHashedRelation
* EmptyHashedRelationWithAllNullKeys

They were all limited to used only in NAAJ scenario. These new HashedRelation could be applied to other scenario for performance improvements.

* EmptyHashedRelation could also be used in Normal AntiJoin for fast stop
* While AQE is on and buildSide is EmptyHashedRelationWithAllNullKeys, can convert NAAJ to a Empty LocalRelation to skip meaningless data iteration since in Single-Key NAAJ, if null key exists in BuildSide, will drop all records in streamedSide.

This Patch including two changes.

* using EmptyHashedRelation to do fast stop for common anti join as well
* In AQE, eliminate BroadcastHashJoin(NAAJ) if buildSide is a EmptyHashedRelationWithAllNullKeys

### Why are the changes needed?
LeftAntiJoin could apply `fast stop` when BuildSide is EmptyHashedRelation, While within AQE with EmptyHashedRelationWithAllNullKeys, we can eliminate the NAAJ. This should be a performance improvement in AntiJoin.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

* added case in AdaptiveQueryExecSuite.
* added case in HashedRelationSuite.
* Make sure SubquerySuite JoinSuite SQLQueryTestSuite passed.

Closes #29389 from leanken/leanken-SPARK-32573.

Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-11 06:23:51 +00:00
allisonwang-db 1b7443bd9a [SPARK-32216][SQL] Remove redundant ProjectExec
### What changes were proposed in this pull request?
This PR added a physical rule to remove redundant project nodes. A `ProjectExec` is redundant when
1. It has the same output attributes and order as its child's output when ordering of these attributes is required.
2. It has the same output attributes as its child's output when attribute output ordering is not required.

For example:
After Filter:
```
== Physical Plan ==
*(1) Project [a#14L, b#15L, c#16, key#17]
+- *(1) Filter (isnotnull(a#14L) AND (a#14L > 5))
   +- *(1) ColumnarToRow
      +- FileScan parquet [a#14L,b#15L,c#16,key#17]
```
The `Project a#14L, b#15L, c#16, key#17` is redundant because its output is exactly the same as filter's output.

Before Aggregate:
```
== Physical Plan ==
*(2) HashAggregate(keys=[key#17], functions=[sum(a#14L), last(b#15L, false)], output=[sum_a#39L, key#17, last_b#41L])
+- Exchange hashpartitioning(key#17, 5), true, [id=#77]
   +- *(1) HashAggregate(keys=[key#17], functions=[partial_sum(a#14L), partial_last(b#15L, false)], output=[key#17, sum#49L, last#50L, valueSet#51])
      +- *(1) Project [key#17, a#14L, b#15L]
         +- *(1) Filter (isnotnull(a#14L) AND (a#14L > 100))
            +- *(1) ColumnarToRow
               +- FileScan parquet [a#14L,b#15L,key#17]
```
The `Project key#17, a#14L, b#15L` is redundant because hash aggregate doesn't require child plan's output to be in a specific order.

### Why are the changes needed?

It removes unnecessary query nodes and makes query plan cleaner.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests

Closes #29031 from allisonwang-db/remove-project.

Lead-authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com>
Co-authored-by: allisonwang-db <allison.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-11 03:14:15 +00:00