Commit graph

5079 commits

Author SHA1 Message Date
Ali Afroozeh 0a6043f683 [SPARK-32755][SQL] Maintain the order of expressions in AttributeSet and ExpressionSet
### What changes were proposed in this pull request?
This PR changes `AttributeSet` and `ExpressionSet` to maintain the insertion order of the elements. More specifically, we:
- change the underlying data structure of `AttributeSet` from `HashSet` to `LinkedHashSet` to maintain the insertion order.
- `ExpressionSet` already uses a list to keep track of the expressions, however, since it is extending Scala's immutable.Set class, operations such as map and flatMap are delegated to the immutable.Set itself. This means that the result of these operations is not an instance of ExpressionSet anymore, rather it's a implementation picked up by the parent class. We also remove this inheritance from `immutable.Set `and implement the needed methods directly. ExpressionSet has a very specific semantics and it does not make sense to extend `immutable.Set` anyway.
- change the `PlanStabilitySuite` to not sort the attributes, to be able to catch changes in the order of expressions in different runs.

### Why are the changes needed?
Expressions identity is based on the `ExprId` which is an auto-incremented number. This means that the same query can yield a query plan with different expression ids in different runs. `AttributeSet` and `ExpressionSet` internally use a `HashSet` as the underlying data structure, and therefore cannot guarantee the a fixed order of operations in different runs. This can be problematic in cases we like to check for plan changes in different runs.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Passes `PlanStabilitySuite` after regenerating the golden files.

Closes #29598 from dbaliafroozeh/FixOrderOfExpressions.

Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com>
Signed-off-by: herman <herman@databricks.com>
2020-09-03 13:56:03 +02:00
Yuanjian Li 95f1e9549b [SPARK-32782][SS] Refactor StreamingRelationV2 and move it to catalyst
### What changes were proposed in this pull request?
Move StreamingRelationV2 to the catalyst module and bind with the Table interface.

### Why are the changes needed?
Currently, the StreamingRelationV2 is bind with TableProvider. Since the V2 relation is not bound with `DataSource`, to make it more flexible and have better expansibility, it should be moved to the catalyst module and bound with the Table interface. We did a similar thing for DataSourceV2Relation.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing UT.

Closes #29633 from xuanyuanking/SPARK-32782.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-03 16:04:36 +09:00
Kent Yao 1fba286407 [SPARK-32781][SQL] Non-ASCII characters are mistakenly omitted in the middle of intervals
### What changes were proposed in this pull request?

This PR fails the interval values parsing when they contain non-ASCII characters which are silently omitted right now.

e.g. the case below should be invalid

```
select interval 'interval中文 1 day'
```

### Why are the changes needed?

bugfix, intervals should fail when containing invalid characters

### Does this PR introduce _any_ user-facing change?

yes,

#### before

select interval 'interval中文 1 day'  results 1 day, now it fails with

```
org.apache.spark.sql.catalyst.parser.ParseException

Cannot parse the INTERVAL value: interval中文 1 day
```

### How was this patch tested?

new tests

Closes #29632 from yaooqinn/SPARK-32781.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-03 04:56:40 +00:00
angerszhu 5e6173ebef [SPARK-31670][SQL] Trim unnecessary Struct field alias in Aggregate/GroupingSets
### What changes were proposed in this pull request?
Struct field both in GROUP BY and Aggregate Expresison with CUBE/ROLLUP/GROUPING SET will failed when analysis.

```
test("SPARK-31670") {
  withTable("t1") {
      sql(
        """
          |CREATE TEMPORARY VIEW t(a, b, c) AS
          |SELECT * FROM VALUES
          |('A', 1, NAMED_STRUCT('row_id', 1, 'json_string', '{"i": 1}')),
          |('A', 2, NAMED_STRUCT('row_id', 2, 'json_string', '{"i": 1}')),
          |('A', 2, NAMED_STRUCT('row_id', 2, 'json_string', '{"i": 2}')),
          |('B', 1, NAMED_STRUCT('row_id', 3, 'json_string', '{"i": 1}')),
          |('C', 3, NAMED_STRUCT('row_id', 4, 'json_string', '{"i": 1}'))
        """.stripMargin)

      checkAnswer(
        sql(
          """
            |SELECT a, c.json_string, SUM(b)
            |FROM t
            |GROUP BY a, c.json_string
            |WITH CUBE
            |""".stripMargin),
        Row("A", "{\"i\": 1}", 3) :: Row("A", "{\"i\": 2}", 2) :: Row("A", null, 5) ::
          Row("B", "{\"i\": 1}", 1) :: Row("B", null, 1) ::
          Row("C", "{\"i\": 1}", 3) :: Row("C", null, 3) ::
          Row(null, "{\"i\": 1}", 7) :: Row(null, "{\"i\": 2}", 2) :: Row(null, null, 9) :: Nil)

  }
}
```
Error 
```
[info] - SPARK-31670 *** FAILED *** (2 seconds, 857 milliseconds)
[info]   Failed to analyze query: org.apache.spark.sql.AnalysisException: expression 't.`c`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;
[info]   Aggregate [a#247, json_string#248, spark_grouping_id#246L], [a#247, c#223.json_string AS json_string#241, sum(cast(b#222 as bigint)) AS sum(b)#243L]
[info]   +- Expand [List(a#221, b#222, c#223, a#244, json_string#245, 0), List(a#221, b#222, c#223, a#244, null, 1), List(a#221, b#222, c#223, null, json_string#245, 2), List(a#221, b#222, c#223, null, null, 3)], [a#221, b#222, c#223, a#247, json_string#248, spark_grouping_id#246L]
[info]      +- Project [a#221, b#222, c#223, a#221 AS a#244, c#223.json_string AS json_string#245]
[info]         +- SubqueryAlias t
[info]            +- Project [col1#218 AS a#221, col2#219 AS b#222, col3#220 AS c#223]
[info]               +- Project [col1#218, col2#219, col3#220]
[info]                  +- LocalRelation [col1#218, col2#219, col3#220]
[info]
```
For Struct type Field, when we resolve it, it will construct with Alias. When struct field in GROUP BY with CUBE/ROLLUP etc,  struct field in groupByExpression and aggregateExpression will be resolved with different exprId as below
```
'Aggregate [cube(a#221, c#223.json_string AS json_string#240)], [a#221, c#223.json_string AS json_string#241, sum(cast(b#222 as bigint)) AS sum(b)#243L]
+- SubqueryAlias t
   +- Project [col1#218 AS a#221, col2#219 AS b#222, col3#220 AS c#223]
      +- Project [col1#218, col2#219, col3#220]
         +- LocalRelation [col1#218, col2#219, col3#220]
```
This makes `ResolveGroupingAnalytics.constructAggregateExprs()` failed to replace aggreagteExpression use expand groupByExpression attribute since there exprId is not same. then error happened.

### Why are the changes needed?
Fix analyze bug

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
Added UT

Closes #28490 from AngersZhuuuu/SPARK-31670.

Lead-authored-by: angerszhu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-02 13:49:09 +00:00
Karol Chmist 7511e43c50 [SPARK-32756][SQL] Fix CaseInsensitiveMap usage for Scala 2.13
### What changes were proposed in this pull request?

This is a follow-up of #29160. This allows Spark SQL project to compile for Scala 2.13.

### Why are the changes needed?

It's needed for #28545

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

I compiled with Scala 2.13. It fails in `Spark REPL` project, which will be fixed by #28545

Closes #29584 from karolchmist/SPARK-32364-scala-2.13.

Authored-by: Karol Chmist <info+github@chmist.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-09-02 08:27:00 -05:00
liwensun f0851e95c6 [SPARK-32776][SS] Limit in streaming should not be optimized away by PropagateEmptyRelation
### What changes were proposed in this pull request?

PropagateEmptyRelation will not be applied to LIMIT operators in streaming queries.

### Why are the changes needed?

Right now, the limit operator in a streaming query may get optimized away when the relation is empty. This can be problematic for stateful streaming, as this empty batch will not write any state store files, and the next batch will fail when trying to read these state store files and throw a file not found error.

We should not let PropagateEmptyRelation optimize away the Limit operator for streaming queries.

This PR is intended as a small and safe fix for PropagateEmptyRelation. A fundamental fix that can prevent this from happening again in the future and in other optimizer rules is more desirable, but that's a much larger task.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
unit tests.

Closes #29623 from liwensun/spark-32776.

Authored-by: liwensun <liwen.sun@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-02 18:05:06 +09:00
Zhenhua Wang 2a88a20271 [SPARK-32754][SQL][TEST] Unify to assertEqualJoinPlans for join reorder suites
### What changes were proposed in this pull request?

Now three join reorder suites(`JoinReorderSuite`, `StarJoinReorderSuite`, `StarJoinCostBasedReorderSuite`) all contain an `assertEqualPlans` method and the logic is almost the same. We can extract the method to a single place for code simplicity.

### Why are the changes needed?

To reduce code redundancy.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Covered by existing tests.

Closes #29594 from wzhfy/unify_assertEqualPlans_joinReorder.

Authored-by: Zhenhua Wang <wzh_zju@163.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-09-01 09:08:35 -07:00
Kris Mok 6e5bc39e17 [SPARK-32624][SQL][FOLLOWUP] Fix regression in CodegenContext.addReferenceObj on nested Scala types
### What changes were proposed in this pull request?

Use `CodeGenerator.typeName()` instead of `Class.getCanonicalName()` in `CodegenContext.addReferenceObj()` for getting the runtime class name for an object.

### Why are the changes needed?

https://github.com/apache/spark/pull/29439 fixed a bug in `CodegenContext.addReferenceObj()` for `Array[Byte]` (i.e. Spark SQL's `BinaryType`) objects, but unfortunately it introduced a regression for some nested Scala types.

For example, for `implicitly[Ordering[UTF8String]]`, after that PR `CodegenContext.addReferenceObj()` would return `((null) references[0] /* ... */)`. The actual type for `implicitly[Ordering[UTF8String]]` is `scala.math.LowPriorityOrderingImplicits$$anon$3` in Scala 2.12.10, and `Class.getCanonicalName()` returns `null` for that class.

On the other hand, `Class.getName()` is safe to use for all non-array types, and Janino will happily accept the type name returned from `Class.getName()` for nested types. `CodeGenerator.typeName()` happens to do the right thing by correctly handling arrays and otherwise use `Class.getName()`. So it's a better alternative than `Class.getCanonicalName()`.

Side note: rule of thumb for using Java reflection in Spark: it may be tempting to use `Class.getCanonicalName()`, but for functions that may need to handle Scala types, please avoid it due to potential issues with nested Scala types.
Instead, use `Class.getName()` or utility functions in `org.apache.spark.util.Utils` (e.g. `Utils.getSimpleName()` or `Utils.getFormattedClassName()` etc).

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added new unit test case for the regression case in `CodeGenerationSuite`.

Closes #29602 from rednaxelafx/spark-32624-followup.

Authored-by: Kris Mok <kris.mok@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-01 15:15:11 +09:00
Chao Sun 94d313b061 [SPARK-32721][SQL][FOLLOWUP] Simplify if clauses with null and boolean
### What changes were proposed in this pull request?

This is a follow-up on SPARK-32721 and PR #29567. In the previous PR we missed two more cases that can be optimized:
```
if(p, false, null) ==> and(not(p), null)
if(p, true, null) ==> or(p, null)
```

### Why are the changes needed?

By transforming if to boolean conjunctions or disjunctions, we can enable more filter pushdown to datasources.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added unit tests.

Closes #29603 from sunchao/SPARK-32721-2.

Authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2020-09-01 06:06:25 +00:00
Chao Sun 1453a09a63 [SPARK-32721][SQL] Simplify if clauses with null and boolean
### What changes were proposed in this pull request?

The following if clause:
```sql
if(p, null, false)
```
can be simplified to:
```sql
and(p, null)
```
Similarly, the clause:
```sql
if(p, null, true)
```
can be simplified to
```sql
or(not(p), null)
```
iff the predicate `p` is non-nullable, i.e., can be evaluated to either true or false, but not null.

### Why are the changes needed?

Converting if to or/and clauses can better push filters down.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit tests.

Closes #29567 from sunchao/SPARK-32721.

Authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2020-08-31 20:59:54 +00:00
Huaxin Gao 806140de40 [SPARK-32592][SQL] Make DataFrameReader.table take the specified options
### What changes were proposed in this pull request?
pass specified options in DataFrameReader.table to JDBCTableCatalog.loadTable

### Why are the changes needed?
Currently, `DataFrameReader.table` ignores the specified options. The options specified like the following are lost.
```
    val df = spark.read
      .option("partitionColumn", "id")
      .option("lowerBound", "0")
      .option("upperBound", "3")
      .option("numPartitions", "2")
      .table("h2.test.people")
```
We need to make `DataFrameReader.table` take the specified options.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manually test for now. Will add a test after V2 JDBC read is implemented.

Closes #29535 from huaxingao/table_options.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-31 13:21:15 +00:00
Kent Yao 6dacba7fa0 [SPARK-32733][SQL] Add extended information - arguments/examples/since/notes of expressions to the remarks field of GetFunctionsOperation
### What changes were proposed in this pull request?

This PR adds extended information of a function including arguments, examples, notes and the since field to the SparkGetFunctionOperation

### Why are the changes needed?

better user experience, it will help JDBC users to have a better understanding of our builtin functions

### Does this PR introduce _any_ user-facing change?

Yes, BI tools and JDBC users will get full information on a spark function instead of only fragmentary usage info.

e.g. date_part

#### before

```
date_part(field, source) - Extracts a part of the date/timestamp or interval source.
```
#### after

```
    Usage:
      date_part(field, source) - Extracts a part of the date/timestamp or interval source.

    Arguments:
      * field - selects which part of the source should be extracted, and supported string values are as same as the fields of the equivalent function `EXTRACT`.
      * source - a date/timestamp or interval column from where `field` should be extracted

    Examples:
      > SELECT date_part('YEAR', TIMESTAMP '2019-08-12 01:00:00.123456');
       2019
      > SELECT date_part('week', timestamp'2019-08-12 01:00:00.123456');
       33
      > SELECT date_part('doy', DATE'2019-08-12');
       224
      > SELECT date_part('SECONDS', timestamp'2019-10-01 00:00:01.000001');
       1.000001
      > SELECT date_part('days', interval 1 year 10 months 5 days);
       5
      > SELECT date_part('seconds', interval 5 hours 30 seconds 1 milliseconds 1 microseconds);
       30.001001

    Note:
      The date_part function is equivalent to the SQL-standard function `EXTRACT(field FROM source)`

    Since: 3.0.0

```

### How was this patch tested?

New tests

Closes #29577 from yaooqinn/SPARK-32733.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-31 11:03:01 +09:00
Udbhav30 065f17386d [SPARK-32481][CORE][SQL] Support truncate table to move data to trash
### What changes were proposed in this pull request?
Instead of deleting the data, we can move the data to trash.
Based on the configuration provided by the user it will be deleted permanently from the trash.

### Why are the changes needed?
Instead of directly deleting the data, we can provide flexibility to move data to the trash and then delete it permanently.

### Does this PR introduce _any_ user-facing change?
Yes, After truncate table the data is not permanently deleted now.
It is first moved to the trash and then after the given time deleted permanently;

### How was this patch tested?
new UTs added

Closes #29552 from Udbhav30/truncate.

Authored-by: Udbhav30 <u.agrawal30@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-30 10:25:32 -07:00
Wenchen Fan ccc0250a08 [SPARK-32718][SQL] Remove unnecessary keywords for interval units
### What changes were proposed in this pull request?

Remove the YEAR, MONTH, DAY, HOUR, MINUTE, SECOND keywords. They are not useful in the parser, as we need to support plural like YEARS, so the parser has to accept the general identifier as interval unit anyway.

### Why are the changes needed?

These keywords are reserved in ANSI. If Spark has these keywords, then they become reserved under ANSI mode. This makes Spark not able to run TPCDS queries as they use YEAR as alias name.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Added `TPCDSQueryANSISuite`, to make sure Spark with ANSI mode can run TPCDS queries.

Closes #29560 from cloud-fan/keyword.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-29 14:06:01 -07:00
Takeshi Yamamuro 0cb91b8c18 [SPARK-32704][SQL] Logging plan changes for execution
### What changes were proposed in this pull request?

Since we only log plan changes for analyzer/optimizer now, this PR intends to add code to log plan changes in the preparation phase in `QueryExecution` for execution.
```
scala> spark.sql("SET spark.sql.optimizer.planChangeLog.level=WARN")
scala> spark.range(10).groupBy("id").count().queryExecution.executedPlan
...
20/08/26 09:32:36 WARN PlanChangeLogger:
=== Applying Rule org.apache.spark.sql.execution.CollapseCodegenStages ===
!HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, count#23L])              *(1) HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, count#23L])
!+- HashAggregate(keys=[id#19L], functions=[partial_count(1)], output=[id#19L, count#27L])   +- *(1) HashAggregate(keys=[id#19L], functions=[partial_count(1)], output=[id#19L, count#27L])
!   +- Range (0, 10, step=1, splits=4)                                                          +- *(1) Range (0, 10, step=1, splits=4)

20/08/26 09:32:36 WARN PlanChangeLogger:
=== Result of Batch Preparations ===
!HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, count#23L])              *(1) HashAggregate(keys=[id#19L], functions=[count(1)], output=[id#19L, count#23L])
!+- HashAggregate(keys=[id#19L], functions=[partial_count(1)], output=[id#19L, count#27L])   +- *(1) HashAggregate(keys=[id#19L], functions=[partial_count(1)], output=[id#19L, count#27L])
!   +- Range (0, 10, step=1, splits=4)                                                          +- *(1) Range (0, 10, step=1, splits=4)
```

### Why are the changes needed?

Easy debugging for executed plans

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit tests.

Closes #29544 from maropu/PlanLoggingInPreparations.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-28 16:35:47 +00:00
Kent Yao 0626901bcb [SPARK-32729][SQL][DOCS] Add missing since version for math functions
### What changes were proposed in this pull request?

Add missing since version for math functions, including
SPARK-8223 shiftright/shiftleft
SPARK-8215 pi
SPARK-8212 e
SPARK-6829 sin/asin/sinh/cos/acos/cosh/tan/atan/tanh/ceil/floor/rint/cbrt/signum/isignum/Fsignum/Lsignum/degrees/radians/log/log10/log1p/exp/expm1/pow/hypot/atan2
SPARK-8209 conv
SPARK-8213 factorial
SPARK-20751 cot
SPARK-2813 sqrt
SPARK-8227 unhex
SPARK-8218 log(a,b)
SPARK-8207 bin
SPARK-8214 hex
SPARK-8206 round
SPARK-14614 bround

### Why are the changes needed?

fix SQL docs
### Does this PR introduce _any_ user-facing change?

yes, doc updated

### How was this patch tested?

passing doc generation.

Closes #29571 from yaooqinn/minor.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-29 00:30:31 +09:00
yi.wu c3b9404253 [SPARK-32717][SQL] Add a AQEOptimizer for AdaptiveSparkPlanExec
### What changes were proposed in this pull request?

This PR proposes to add a specific `AQEOptimizer` for the `AdaptiveSparkPlanExec` instead of implementing an anonymous `RuleExecutor`. At the same time, this PR also adds the configuration `spark.sql.adaptive.optimizer.excludedRules`, which follows the same pattern of `Optimizer`, to make the `AQEOptimizer` more flexible for users and developers.

### Why are the changes needed?

Currently, `AdaptiveSparkPlanExec` has implemented an anonymous `RuleExecutor` to apply the AQE optimize rules on the plan. However, the anonymous class usually could be inconvenient to maintain and extend for the long term.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

It's a pure refactor so pass existing tests should be ok.

Closes #29559 from Ngone51/impro-aqe-optimizer.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-28 21:23:53 +09:00
HyukjinKwon c154629171 [SPARK-32183][DOCS][PYTHON] User Guide - PySpark Usage Guide for Pandas with Apache Arrow
### What changes were proposed in this pull request?

This PR proposes to move Arrow usage guide from Spark documentation site to PySpark documentation site (at "User Guide").

Here is the demo for reviewing quicker: https://hyukjin-spark.readthedocs.io/en/stable/user_guide/arrow_pandas.html

### Why are the changes needed?

To have a single place for PySpark users, and better documentation.

### Does this PR introduce _any_ user-facing change?

Yes, it will move https://spark.apache.org/docs/latest/sql-pyspark-pandas-with-arrow.html to our PySpark documentation.

### How was this patch tested?

```bash
cd docs
SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve --watch
```

and

```bash
cd python/docs
make clean html
```

Closes #29548 from HyukjinKwon/SPARK-32183.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-28 15:09:06 +09:00
Terry Kim baaa756dee [SPARK-32516][SQL][FOLLOWUP] 'path' option cannot coexist with path parameter for DataFrameWriter.save(), DataStreamReader.load() and DataStreamWriter.start()
### What changes were proposed in this pull request?

This is a follow up PR to #29328 to apply the same constraint where `path` option cannot coexist with path parameter to `DataFrameWriter.save()`, `DataStreamReader.load()` and `DataStreamWriter.start()`.

### Why are the changes needed?

The current behavior silently overwrites the `path` option if path parameter is passed to `DataFrameWriter.save()`, `DataStreamReader.load()` and `DataStreamWriter.start()`.

For example,
```
Seq(1).toDF.write.option("path", "/tmp/path1").parquet("/tmp/path2")
```
will write the result to `/tmp/path2`.

### Does this PR introduce _any_ user-facing change?

Yes, if `path` option coexists with path parameter to any of the above methods, it will throw `AnalysisException`:
```
scala> Seq(1).toDF.write.option("path", "/tmp/path1").parquet("/tmp/path2")
org.apache.spark.sql.AnalysisException: There is a 'path' option set and save() is called with a  path parameter. Either remove the path option, or call save() without the parameter. To ignore this check, set 'spark.sql.legacy.pathOptionBehavior.enabled' to 'true'.;
```

The user can restore the previous behavior by setting `spark.sql.legacy.pathOptionBehavior.enabled` to `true`.

### How was this patch tested?

Added new tests.

Closes #29543 from imback82/path_option.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-27 06:21:04 +00:00
Dongjoon Hyun 2dee4352a0 Revert "[SPARK-32481][CORE][SQL] Support truncate table to move data to trash"
This reverts commit 5c077f0580.
2020-08-26 11:24:35 -07:00
Udbhav30 5c077f0580 [SPARK-32481][CORE][SQL] Support truncate table to move data to trash
### What changes were proposed in this pull request?
Instead of deleting the data, we can move the data to trash.
Based on the configuration provided by the user it will be deleted permanently from the trash.

### Why are the changes needed?
Instead of directly deleting the data, we can provide flexibility to move data to the trash and then delete it permanently.

### Does this PR introduce _any_ user-facing change?
Yes, After truncate table the data is not permanently deleted now.
It is first moved to the trash and then after the given time deleted permanently;

### How was this patch tested?
new UTs added

Closes #29387 from Udbhav30/tuncateTrash.

Authored-by: Udbhav30 <u.agrawal30@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-25 23:38:43 -07:00
Sean Owen a9d4e60a90 [SPARK-32614][SQL] Don't apply comment processing if 'comment' unset for CSV
### What changes were proposed in this pull request?

Spark's CSV source can optionally ignore lines starting with a comment char. Some code paths check to see if it's set before applying comment logic (i.e. not set to default of `\0`), but many do not, including the one that passes the option to Univocity. This means that rows beginning with a null char were being treated as comments even when 'disabled'.

### Why are the changes needed?

To avoid dropping rows that start with a null char when this is not requested or intended. See JIRA for an example.

### Does this PR introduce _any_ user-facing change?

Nothing beyond the effect of the bug fix.

### How was this patch tested?

Existing tests plus new test case.

Closes #29516 from srowen/SPARK-32614.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-26 00:25:58 +09:00
fqaiser94@gmail.com 3f1e56d4ca [SPARK-32641][SQL] withField + getField should return null if original struct was null
### What changes were proposed in this pull request?

There is a bug in the way the optimizer rule in `SimplifyExtractValueOps` is currently written in master branch which yields incorrect results in scenarios like the following:
```
sql("SELECT CAST(NULL AS struct<a:int,b:int>) struct_col")
.select($"struct_col".withField("d", lit(4)).getField("d").as("d"))

// currently returns this:
+---+
|d  |
+---+
|4  |
+---+

// when in fact it should return this:
+----+
|d   |
+----+
|null|
+----+
```
The changes in this PR will fix this bug.

### Why are the changes needed?

To fix the aforementioned bug. Optimizer rules should improve the performance of the  query but yield exactly the same results.

### Does this PR introduce _any_ user-facing change?

Yes, this bug will no longer occur.
That said, this isn't something to be concerned about as this bug was introduced in Spark 3.1 and Spark 3.1 has yet to be released.

### How was this patch tested?

Unit tests were added. Jenkins must pass them.

Closes #29522 from fqaiser94/SPARK-32641.

Authored-by: fqaiser94@gmail.com <fqaiser94@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-25 04:59:37 +00:00
Yesheng Ma 3eee915b47 [MINOR][SQL] Add missing documentation for LongType mapping
### What changes were proposed in this pull request?

Added Java docs for Long data types in the Row class.

### Why are the changes needed?

The Long datatype is somehow missing in Row.scala's `apply` and `get` methods.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing UTs.

Closes #29534 from yeshengm/docs-fix.

Authored-by: Yesheng Ma <kimi.ysma@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-25 11:20:01 +09:00
yangjie01 a30bb0cfda [SPARK-32550][SQL][FOLLOWUP] Eliminate negative impact on HyperLogLogSuite
### What changes were proposed in this pull request?
Change to use `dataTypes.foreach` instead of get the element use specified index in `def this(dataTypes: Seq[DataType]) `constructor of `SpecificInternalRow` because the random access performance is unsatisfactory if the input argument not a `IndexSeq`.

This pr followed srowen's  advice.

### Why are the changes needed?
I found that SPARK-32550 had some negative impact on performance, the typical cases is "deterministic cardinality estimation" in `HyperLogLogPlusPlusSuite` when rsd is 0.001, we found the code that is significantly slower is line 41 in `HyperLogLogPlusPlusSuite`: `new SpecificInternalRow(hll.aggBufferAttributes.map(_.dataType)) `

08b951b1cb/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlusSuite.scala (L40-L44)

The size of "hll.aggBufferAttributes" in this case is 209716, the results of comparison before and after spark-32550 merged are as follows, The unit is ns:

  | After   SPARK-32550 createBuffer | After   SPARK-32550 end to end | Before   SPARK-32550 createBuffer | Before   SPARK-32550 end to end
-- | -- | -- | -- | --
rsd 0.001, n   1000 | 52715513243 | 53004810687 | 195807999 | 773977677
rsd 0.001, n   5000 | 51881246165 | 52519358215 | 13689949 | 249974855
rsd 0.001, n   10000 | 52234282788 | 52374639172 | 14199071 | 183452846
rsd 0.001, n   50000 | 55503517122 | 55664035449 | 15219394 | 584477125
rsd 0.001, n   100000 | 51862662845 | 52116774177 | 19662834 | 166483678
rsd 0.001, n   500000 | 51619226715 | 52183189526 | 178048012 | 16681330
rsd 0.001, n   1000000 | 54861366981 | 54976399142 | 226178708 | 18826340
rsd 0.001, n   5000000 | 52023602143 | 52354615149 | 388173579 | 15446409
rsd 0.001, n   10000000 | 53008591660 | 53601392304 | 533454460 | 16033032

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
`mvn test -pl sql/catalyst -DwildcardSuites=org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlusSuite -Dtest=none`

**Before**:

```
Run completed in 8 minutes, 18 seconds.
Total number of tests run: 5
Suites: completed 2, aborted 0
Tests: succeeded 5, failed 0, canceled 0, ignored 0, pending 0
```

**After**
```
Run completed in 7 seconds, 65 milliseconds.
Total number of tests run: 5
Suites: completed 2, aborted 0
Tests: succeeded 5, failed 0, canceled 0, ignored 0, pending 0
```

Closes #29529 from LuciferYang/revert-spark-32550.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-25 11:13:01 +09:00
Terry Kim e3a88a9767 [SPARK-32516][SQL] 'path' option cannot coexist with load()'s path parameters
### What changes were proposed in this pull request?

This PR proposes to make the behavior consistent for the `path` option when loading dataframes with a single path (e.g, `option("path", path).format("parquet").load(path)` vs. `option("path", path).parquet(path)`) by disallowing `path` option to coexist with `load`'s path parameters.

### Why are the changes needed?

The current behavior is inconsistent:
```scala
scala> Seq(1).toDF.write.mode("overwrite").parquet("/tmp/test")

scala> spark.read.option("path", "/tmp/test").format("parquet").load("/tmp/test").show
+-----+
|value|
+-----+
|    1|
+-----+

scala> spark.read.option("path", "/tmp/test").parquet("/tmp/test").show
+-----+
|value|
+-----+
|    1|
|    1|
+-----+
```

### Does this PR introduce _any_ user-facing change?

Yes, now if the `path` option is specified along with `load`'s path parameters, it would fail:
```scala
scala> Seq(1).toDF.write.mode("overwrite").parquet("/tmp/test")

scala> spark.read.option("path", "/tmp/test").format("parquet").load("/tmp/test").show
org.apache.spark.sql.AnalysisException: There is a path option set and load() is called with path parameters. Either remove the path option or move it into the load() parameters.;
  at org.apache.spark.sql.DataFrameReader.verifyPathOptionDoesNotExist(DataFrameReader.scala:310)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:232)
  ... 47 elided

scala> spark.read.option("path", "/tmp/test").parquet("/tmp/test").show
org.apache.spark.sql.AnalysisException: There is a path option set and load() is called with path parameters. Either remove the path option or move it into the load() parameters.;
  at org.apache.spark.sql.DataFrameReader.verifyPathOptionDoesNotExist(DataFrameReader.scala:310)
  at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:250)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:778)
  at org.apache.spark.sql.DataFrameReader.parquet(DataFrameReader.scala:756)
  ... 47 elided
```

The user can restore the previous behavior by setting `spark.sql.legacy.pathOptionBehavior.enabled` to `true`.

### How was this patch tested?

Added a test

Closes #29328 from imback82/dfw_option.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-24 16:30:30 +00:00
yangjie01 25c7d0fe6a [SPARK-32526][SQL] Pass all test of sql/catalyst module in Scala 2.13
### What changes were proposed in this pull request?
The purpose of this pr is to resolve [SPARK-32526](https://issues.apache.org/jira/browse/SPARK-32526), all remaining failed cases are fixed.

The main change of this pr as follow:

- Change of `ExecutorAllocationManager.scala` for core module compilation in Scala 2.13, it's a blocking problem

- Change `Seq[_]` to `scala.collection.Seq[_]` refer to failed cases

- Added different expected plan of `Test 4: Star with several branches` of StarJoinCostBasedReorderSuite  for Scala 2.13 because the candidates plans:

```
Join Inner, (d1_pk#5 = f1_fk1#0)
:- Join Inner, (f1_fk2#1 = d2_pk#8)
:  :- Join Inner, (f1_fk3#2 = d3_pk#11)
```
and

```
Join Inner, (f1_fk2#1 = d2_pk#8)
:- Join Inner, (d1_pk#5 = f1_fk1#0)
:  :- Join Inner, (f1_fk3#2 = d3_pk#11)
```

have same cost `Cost(200,9200)`, but `HashMap` is rewritten in scala 2.13 and The order of iterations leads to different results.

This pr fix test cases as follow:

- LiteralExpressionSuite (1 FAILED -> PASS)
- StarJoinCostBasedReorderSuite ( 1 FAILED-> PASS)
- ObjectExpressionsSuite( 2 FAILED-> PASS)
- ScalaReflectionSuite (1 FAILED-> PASS)
- RowEncoderSuite (10 FAILED-> PASS)
- ExpressionEncoderSuite  (ABORTED-> PASS)

### Why are the changes needed?
We need to support a Scala 2.13 build.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
<!--
- Scala 2.12: Pass the Jenkins or GitHub Action

- Scala 2.13: Do the following:

```
dev/change-scala-version.sh 2.13
mvn clean install -DskipTests  -pl sql/catalyst -Pscala-2.13 -am
mvn test -pl sql/catalyst -Pscala-2.13
```

**Before**
```
Tests: succeeded 4035, failed 17, canceled 0, ignored 6, pending 0
*** 1 SUITE ABORTED ***
*** 15 TESTS FAILED ***
```

**After**

```
Tests: succeeded 4338, failed 0, canceled 0, ignored 6, pending 0
All tests passed.
```

Closes #29434 from LuciferYang/sql-catalyst-tests.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-08-22 09:24:16 -05:00
Brandon Jiang 1450b5e095 [MINOR][DOCS] fix typo for docs,log message and comments
### What changes were proposed in this pull request?
Fix typo for docs, log messages and comments

### Why are the changes needed?
typo fix to increase readability

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
manual test has been performed to test the updated

Closes #29443 from brandonJY/spell-fix-doc.

Authored-by: Brandon Jiang <Brandon.jiang.a@outlook.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-08-22 06:45:35 +09:00
Wenchen Fan 3dca81e4f5 [SPARK-32669][SQL][TEST] Expression unit tests should explore all cases that can lead to null result
### What changes were proposed in this pull request?

Add document to `ExpressionEvalHelper`, and ask people to explore all the cases that can lead to null results (including null in struct fields, array elements and map values).

This PR also fixes `ComplexTypeSuite.GetArrayStructFields` to explore all the null cases.

### Why are the changes needed?

It happened several times that we hit correctness bugs caused by wrong expression nullability. When writing unit tests, we usually don't test the nullability flag directly, and it's too late to add such tests for all expressions.

In https://github.com/apache/spark/pull/22375, we extended the expression test framework, which checks the nullability flag when the expected result/field/element is null.

This requires the test cases to explore all the cases that can lead to null results

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

I reverted 5d296ed39e locally, and `ComplexTypeSuite` can catch the bug.

Closes #29493 from cloud-fan/small.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-08-22 06:23:46 +09:00
Wenchen Fan 8b119f1663 [SPARK-32640][SQL] Downgrade Janino to fix a correctness bug
### What changes were proposed in this pull request?

This PR reverts https://github.com/apache/spark/pull/27860 to downgrade Janino, as the new version has a bug.

### Why are the changes needed?

The symptom is about NaN comparison. For code below
```
if (double_value <= 0.0) {
  ...
} else {
  ...
}
```

If `double_value` is NaN, `NaN <= 0.0` is false and we should go to the else branch. However, current Spark goes to the if branch and causes correctness issues like SPARK-32640.

One way to fix it is:
```
boolean cond = double_value <= 0.0;
if (cond) {
  ...
} else {
  ...
}
```

I'm not familiar with Janino so I don't know what's going on there.

### Does this PR introduce _any_ user-facing change?

Yes, fix correctness bugs.

### How was this patch tested?

a new test

Closes #29495 from cloud-fan/revert.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-20 13:26:39 -07:00
Yuming Wang 409fea30cc [SPARK-32624][SQL] Use getCanonicalName to fix byte[] compile issue
### What changes were proposed in this pull request?
```scala
scala> Array[Byte](1, 2).getClass.getName
res13: String = [B

scala> Array[Byte](1, 2).getClass.getCanonicalName
res14: String = byte[]
```

This pr replace `getClass.getName` with `getClass.getCanonicalName` in `CodegenContext.addReferenceObj` to fix `byte[]` compile issue:
```
...
/* 030 */       value_1 = org.apache.spark.sql.catalyst.util.TypeUtils.compareBinary(value_2, (([B) references[0] /* min */)) >= 0 && org.apache.spark.sql.catalyst.util.TypeUtils.compareBinary(value_2, (([B) references[1] /* max */)) <= 0;
/* 031 */     }
/* 032 */     return !isNull_1 && value_1;
/* 033 */   }
/* 034 */
/* 035 */
/* 036 */ }

20:49:54.886 WARN org.apache.spark.sql.catalyst.expressions.Predicate: Expr codegen error and falling back to interpreter mode
java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 30, Column 81: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 30, Column 81: Unexpected token "[" in primary
...
```

### Why are the changes needed?

Fix compile issue when compiling generated code.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #29439 from wangyum/SPARK-32624.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
2020-08-19 05:20:26 -07:00
angerszhu 03e2de99ab [SPARK-32608][SQL] Script Transform ROW FORMAT DELIMIT value should format value
### What changes were proposed in this pull request?
For SQL
```
SELECT TRANSFORM(a, b, c)
  ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ','
  LINES TERMINATED BY '\n'
  NULL DEFINED AS 'null'
  USING 'cat' AS (a, b, c)
  ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ','
  LINES TERMINATED BY '\n'
  NULL DEFINED AS 'NULL'
FROM testData
```
The correct

TOK_TABLEROWFORMATFIELD should be `, `nut actually ` ','`

TOK_TABLEROWFORMATLINES should be `\n`  but actually` '\n'`

### Why are the changes needed?
Fix string value format

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #29428 from AngersZhuuuu/SPARK-32608.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-19 08:31:58 +00:00
Samir Khan e15ae60a53 [SPARK-32550][SQL] Make SpecificInternalRow constructors faster by using while loops instead of maps
### What changes were proposed in this pull request?
Change maps in two constructors of SpecificInternalRow to while loops.

### Why are the changes needed?
This was originally noticed with https://github.com/apache/spark/pull/29353 and https://github.com/apache/spark/pull/29354 and will have impacts on performance of reading ORC and Avro files. Ran AvroReadBenchmarks with the new cases of nested and array'd structs in https://github.com/apache/spark/pull/29352. Haven't run benchmarks for ORC but can do that if needed.

**Before:**
```
Nested Struct Scan:                       Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
Nested Struct                                     74674          75319         912          0.0      142429.1       1.0X

Array of Struct Scan:                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
Array of Structs                                  34193          34339         206          0.0       65217.9       1.0X
```
**After:**
```
Nested Struct Scan:                       Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
Nested Struct                                     48451          48619         237          0.0       92413.2       1.0X

Array of Struct Scan:                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
Array of Structs                                  18518          18683         234          0.0       35319.6       1.0X
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Ran AvroReadBenchmarks with the new cases of nested and array'd structs in https://github.com/apache/spark/pull/29352.

Closes #29366 from msamirkhan/spark-32550.

Lead-authored-by: Samir Khan <muhammad.samir.khan@gmail.com>
Co-authored-by: skhan04 <samirkhan@verizonmedia.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-19 14:57:34 +09:00
Wenchen Fan f33b64a656 [SPARK-32652][SQL] ObjectSerializerPruning fails for RowEncoder
### What changes were proposed in this pull request?

Update `ObjectSerializerPruning.alignNullTypeInIf`, to consider the isNull check generated in `RowEncoder`, which is `Invoke(inputObject, "isNullAt", BooleanType, Literal(index) :: Nil)`.

### Why are the changes needed?

Query fails if we don't fix this bug, due to type mismatch in `If`.

### Does this PR introduce _any_ user-facing change?

Yes, the failed query can run after this fix.

### How was this patch tested?

new tests

Closes #29467 from cloud-fan/bug.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-19 13:50:29 +09:00
Wenchen Fan b94c67b502 Revert "[SPARK-32511][SQL] Add dropFields method to Column class"
This reverts commit 0c850c71e7.
2020-08-17 13:18:46 +08:00
Cheng Su 8f0fef1843 [SPARK-32399][SQL] Full outer shuffled hash join
### What changes were proposed in this pull request?

Add support for full outer join inside shuffled hash join. Currently if the query is a full outer join, we only use sort merge join as the physical operator. However it can be CPU and IO intensive in case input table is large for sort merge join. Shuffled hash join on the other hand saves the sort CPU and IO compared to sort merge join, especially when table is large.

This PR implements the full outer join as followed:
* Process rows from stream side by looking up hash relation, and mark the matched rows from build side by:
  * for joining with unique key, a `BitSet` is used to record matched rows from build side (`key index` to represent each row)
  * for joining with non-unique key, a `HashSet[Long]` is  used to record matched rows from build side (`key index` + `value index` to represent each row).
`key index` is defined as the index into key addressing array `longArray` in `BytesToBytesMap`.
`value index` is defined as the iterator index of values for same key.

* Process rows from build side by iterating hash relation, and filter out rows from build side being looked up already (done in `ShuffledHashJoinExec.fullOuterJoin`)

For context, this PR was originally implemented as followed (up to commit e3322766d4):
1. Construct hash relation from build side, with extra boolean value at the end of row to track look up information (done in `ShuffledHashJoinExec.buildHashedRelation` and `UnsafeHashedRelation.apply`).
2. Process rows from stream side by looking up hash relation, and mark the matched rows from build side be looked up (done in `ShuffledHashJoinExec.fullOuterJoin`).
3. Process rows from build side by iterating hash relation, and filter out rows from build side being looked up already (done in `ShuffledHashJoinExec.fullOuterJoin`).

See discussion of pros and cons between these two approaches [here](https://github.com/apache/spark/pull/29342#issuecomment-672275450), [here](https://github.com/apache/spark/pull/29342#issuecomment-672288194) and [here](https://github.com/apache/spark/pull/29342#issuecomment-672640531).

TODO: codegen for full outer shuffled hash join can be implemented in another followup PR.

### Why are the changes needed?

As implementation in this PR, full outer shuffled hash join will have overhead to iterate build side twice (once for building hash map, and another for outputting non-matching rows), and iterate stream side once. However, full outer sort merge join needs to iterate both sides twice, and sort the large table can be more CPU and IO intensive. So full outer shuffled hash join can be more efficient than sort merge join when stream side is much more larger than build side.

For example query below, full outer SHJ saved 30% wall clock time compared to full outer SMJ.

```
def shuffleHashJoin(): Unit = {
    val N: Long = 4 << 22
    withSQLConf(
      SQLConf.SHUFFLE_PARTITIONS.key -> "2",
      SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "20000000") {
      codegenBenchmark("shuffle hash join", N) {
        val df1 = spark.range(N).selectExpr(s"cast(id as string) as k1")
        val df2 = spark.range(N / 10).selectExpr(s"cast(id * 10 as string) as k2")
        val df = df1.join(df2, col("k1") === col("k2"), "full_outer")
        df.noop()
    }
  }
}
```

```
Running benchmark: shuffle hash join
  Running case: shuffle hash join off
  Stopped after 2 iterations, 16602 ms
  Running case: shuffle hash join on
  Stopped after 5 iterations, 31911 ms

Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15.4
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
shuffle hash join:                        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
shuffle hash join off                              7900           8301         567          2.1         470.9       1.0X
shuffle hash join on                               6250           6382          95          2.7         372.5       1.3X
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit test in `JoinSuite.scala`, `AbstractBytesToBytesMapSuite.java` and `HashedRelationSuite.scala`.

Closes #29342 from c21/full-outer-shj.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-08-17 08:06:19 +09:00
Yuming Wang c280c7f529 [SPARK-32625][SQL] Log error message when falling back to interpreter mode
### What changes were proposed in this pull request?

This pr log the error message when falling back to interpreter mode.

### Why are the changes needed?

Not all error messages are in `CodeGenerator`, such as:
```
21:48:44.612 WARN org.apache.spark.sql.catalyst.expressions.Predicate: Expr codegen error and falling back to interpreter mode
java.lang.IllegalArgumentException: Can not interpolate org.apache.spark.sql.types.Decimal into code block.
	at org.apache.spark.sql.catalyst.expressions.codegen.Block$BlockHelper$.$anonfun$code$1(javaCode.scala:240)
	at org.apache.spark.sql.catalyst.expressions.codegen.Block$BlockHelper$.$anonfun$code$1$adapted(javaCode.scala:236)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.

Closes #29440 from wangyum/SPARK-32625.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-15 12:31:32 -07:00
yi.wu c6be2074cc [SPARK-32616][SQL] Window operators should be added determinedly
### What changes were proposed in this pull request?

Use the `LinkedHashMap` instead of `immutable.Map` to hold the `Window` expressions in `ExtractWindowExpressions.addWindow`.

### Why are the changes needed?

This is a bug fix for https://github.com/apache/spark/pull/29270. In that PR, the generated plan(especially for the queries q47, q49, q57) on Jenkins always can not match the golden plan generated on my laptop.

It happens because `ExtractWindowExpressions.addWindow` now uses `immutable.Map` to hold the `Window` expressions by the key `(spec.partitionSpec, spec.orderSpec, WindowFunctionType.functionType(expr))` and converts the map to `Seq` at the end. Then, the `Seq` is used to add Window operators on top of the child plan. However, for the same query, the order of Windows expression inside the `Seq` could be undetermined when the expression id changes(which can affect the key). As a result, the same query could have different plans because of the undetermined order of Window operators.

Therefore, we use `LinkedHashMap`, which records the insertion order of entries, to make the adding order determined.

### Does this PR introduce _any_ user-facing change?

Maybe yes, users now always see the same plan for the same queries with multiple Window operators.

### How was this patch tested?

It's really hard to make a reproduce demo. I just tested manually with https://github.com/apache/spark/pull/29270 and it looks good.

Closes #29432 from Ngone51/fix-addWindow.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-14 13:29:48 +00:00
alexander-daskalov 10edeafc69 [MINOR][SQL] Fixed approx_count_distinct rsd param description
### What changes were proposed in this pull request?

In the docs concerning the approx_count_distinct I have changed the description of the rsd parameter from **_maximum estimation error allowed_** to _**maximum relative standard deviation allowed**_

### Why are the changes needed?

Maximum estimation error allowed can be misleading. You can set the target relative standard deviation, which affects the estimation error, but on given runs the estimation error can still be above the rsd parameter.

### Does this PR introduce _any_ user-facing change?

This PR should make it easier for users reading the docs to understand that the rsd parameter in approx_count_distinct doesn't cap the estimation error, but just sets the target deviation instead,

### How was this patch tested?

No tests, as no code changes were made.

Closes #29424 from Comonut/fix-approx_count_distinct-rsd-param-description.

Authored-by: alexander-daskalov <alexander.daskalov@adevinta.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-08-14 22:10:41 +09:00
yangjie01 6ae2cb2db3 [SPARK-32526][SQL] Fix some test cases of sql/catalyst module in scala 2.13
### What changes were proposed in this pull request?
The purpose of this pr is to partial resolve [SPARK-32526](https://issues.apache.org/jira/browse/SPARK-32526), total of 88 failed and 2 aborted test cases were fixed, the related suite as follow:

- `DataSourceV2AnalysisBaseSuite` related test cases (71 FAILED -> Pass)
- `TreeNodeSuite` (1 FAILED -> Pass)
- `MetadataSuite `(1 FAILED -> Pass)
- `InferFiltersFromConstraintsSuite `(3 FAILED -> Pass)
- `StringExpressionsSuite ` (1 FAILED -> Pass)
- `JacksonParserSuite ` (1 FAILED -> Pass)
- `HigherOrderFunctionsSuite `(1 FAILED -> Pass)
- `ExpressionParserSuite` (1 FAILED -> Pass)
- `CollectionExpressionsSuite `(6 FAILED -> Pass)
- `SchemaUtilsSuite` (2 FAILED -> Pass)
- `ExpressionSetSuite `(ABORTED -> Pass)
- `ArrayDataIndexedSeqSuite `(ABORTED -> Pass)

The main change of this pr as following:

- `Optimizer` and `Analyzer` are changed to pass compile, `ArrayBuffer` is not a `Seq` in scala 2.13, call `toSeq` method manually to compatible with Scala 2.12

- `m.mapValues().view.force` pattern return a `Map` in scala 2.12 but return a `IndexedSeq` in scala 2.13, call `toMap` method manually to compatible with Scala 2.12. `TreeNode` are changed to pass `DataSourceV2AnalysisBaseSuite` related test cases and `TreeNodeSuite` failed case.

- call `toMap` method of `Metadata#hash` method `case map` branch because `map.mapValues` return `Map` in Scala 2.12 and return `MapView` in Scala 2.13.

- `impl` contact method of `ExpressionSet` in Scala 2.13 version refer to `ExpressionSet` in Scala 2.12 to support `+ + ` method conform to `ExpressionSet` semantics

- `GenericArrayData` not accept `ArrayBuffer` input, call `toSeq` when use `ArrayBuffer` construction `GenericArrayData`   for Scala version compatibility

-  Call `toSeq` in `RandomDataGenerator#randomRow` method to ensure contents of `fields` is `Seq` not `ArrayBuffer`

-  Call `toSeq` Let `JacksonParser#parse` still return a `Seq` because the check method of `JacksonParserSuite#"skipping rows using pushdown filters"` dependence on `Seq` type
- Call `toSeq` in `AstBuilder#visitFunctionCall`, otherwise `ctx.argument.asScala.map(expression)` is `Buffer` in Scala 2.13

- Add a `LongType` match to `ArraySetLike.nullValueHolder`

- Add a `sorted` to ensure `duplicateColumns` string in `SchemaUtils.checkColumnNameDuplication` method error message have a deterministic order

### Why are the changes needed?
We need to support a Scala 2.13 build.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- Scala 2.12: Pass the Jenkins or GitHub Action

- Scala 2.13: Do the following:

```
dev/change-scala-version.sh 2.13
mvn clean install -DskipTests  -pl sql/catalyst -Pscala-2.13 -am
mvn test -pl sql/catalyst -Pscala-2.13
```

**Before**
```
Tests: succeeded 3853, failed 103, canceled 0, ignored 6, pending 0
*** 3 SUITES ABORTED ***
*** 103 TESTS FAILED ***
```

**After**

```
Tests: succeeded 4035, failed 17, canceled 0, ignored 6, pending 0
*** 1 SUITE ABORTED ***
*** 15 TESTS FAILED ***
```

Closes #29370 from LuciferYang/fix-DataSourceV2AnalysisBaseSuite.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-08-13 11:46:30 -05:00
fqaiser94@gmail.com 0c850c71e7 [SPARK-32511][SQL] Add dropFields method to Column class
### What changes were proposed in this pull request?

Added a new `dropFields` method to the `Column` class.
This method should allow users to drop a `StructField` in a `StructType` column (with similar semantics to the `drop` method on `Dataset`).

### Why are the changes needed?

Often Spark users have to work with deeply nested data e.g. to fix a data quality issue with an existing `StructField`. To do this with the existing Spark APIs, users have to rebuild the entire struct column.

For example, let's say you have the following deeply nested data structure which has a data quality issue (`5` is missing):
```
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

val data = spark.createDataFrame(sc.parallelize(
      Seq(Row(Row(Row(1, 2, 3), Row(Row(4, null, 6), Row(7, 8, 9), Row(10, 11, 12)), Row(13, 14, 15))))),
      StructType(Seq(
        StructField("a", StructType(Seq(
          StructField("a", StructType(Seq(
            StructField("a", IntegerType),
            StructField("b", IntegerType),
            StructField("c", IntegerType)))),
          StructField("b", StructType(Seq(
            StructField("a", StructType(Seq(
              StructField("a", IntegerType),
              StructField("b", IntegerType),
              StructField("c", IntegerType)))),
            StructField("b", StructType(Seq(
              StructField("a", IntegerType),
              StructField("b", IntegerType),
              StructField("c", IntegerType)))),
            StructField("c", StructType(Seq(
              StructField("a", IntegerType),
              StructField("b", IntegerType),
              StructField("c", IntegerType))))
          ))),
          StructField("c", StructType(Seq(
            StructField("a", IntegerType),
            StructField("b", IntegerType),
            StructField("c", IntegerType))))
        )))))).cache

data.show(false)
+---------------------------------+
|a                                |
+---------------------------------+
|[[1, 2, 3], [[4,, 6], [7, 8, 9]]]|
+---------------------------------+
```
Currently, to drop the missing value users would have to do something like this:
```
val result = data.withColumn("a",
  struct(
    $"a.a",
    struct(
      struct(
        $"a.b.a.a",
        $"a.b.a.c"
      ).as("a"),
      $"a.b.b",
      $"a.b.c"
    ).as("b"),
    $"a.c"
  ))

result.show(false)
+---------------------------------------------------------------+
|a                                                              |
+---------------------------------------------------------------+
|[[1, 2, 3], [[4, 6], [7, 8, 9], [10, 11, 12]], [13, 14, 15]]|
+---------------------------------------------------------------+
```
As you can see above, with the existing methods users must call the `struct` function and list all fields, including fields they don't want to change. This is not ideal as:
>this leads to complex, fragile code that cannot survive schema evolution.
[SPARK-16483](https://issues.apache.org/jira/browse/SPARK-16483)

In contrast, with the method added in this PR, a user could simply do something like this to get the same result:
```
val result = data.withColumn("a", 'a.dropFields("b.a.b"))
result.show(false)
+---------------------------------------------------------------+
|a                                                              |
+---------------------------------------------------------------+
|[[1, 2, 3], [[4, 6], [7, 8, 9], [10, 11, 12]], [13, 14, 15]]|
+---------------------------------------------------------------+

```

This is the second of maybe 3 methods that could be added to the `Column` class to make it easier to manipulate nested data.
Other methods under discussion in [SPARK-22231](https://issues.apache.org/jira/browse/SPARK-22231) include `withFieldRenamed`.
However, this should be added in a separate PR.

### Does this PR introduce _any_ user-facing change?

Only one minor change. If the user submits the following query:
```
df.withColumn("a", $"a".withField(null, null))
```
instead of throwing:
```
java.lang.IllegalArgumentException: requirement failed: fieldName cannot be null
```
it will now throw:
```
java.lang.IllegalArgumentException: requirement failed: col cannot be null
```
I don't believe its should be an issue to change this because:
- neither message is incorrect
- Spark 3.1.0 has yet to be released

but please feel free to correct me if I am wrong.

### How was this patch tested?

New unit tests were added. Jenkins must pass them.

### Related JIRAs:
More discussion on this topic can be found here:
- https://issues.apache.org/jira/browse/SPARK-22231
- https://issues.apache.org/jira/browse/SPARK-16483

Closes #29322 from fqaiser94/SPARK-32511.

Lead-authored-by: fqaiser94@gmail.com <fqaiser94@gmail.com>
Co-authored-by: fqaiser94 <fqaiser94@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-13 03:28:25 +00:00
stczwd 60fa8e304d [SPARK-31694][SQL] Add SupportsPartitions APIs on DataSourceV2
### What changes were proposed in this pull request?
There are no partition Commands, such as AlterTableAddPartition supported in DatasourceV2, it is widely used in mysql or hive or other datasources. Thus it is necessary to defined Partition API to support these Commands.

We defined the partition API as part of Table API, as it will change table data sometimes. And a partition is composed of identifier and properties, while identifier is defined with InternalRow and properties is defined as a Map.

### Does this PR introduce _any_ user-facing change?
Yes. This PR will enable user to use some partition commands

### How was this patch tested?
run all tests and add some partition api tests

Closes #28617 from stczwd/SPARK-31694.

Authored-by: stczwd <qcsd2011@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-12 17:25:47 +00:00
Yuming Wang 5d130f0360 [SPARK-32586][SQL] Fix NumberFormatException error message when ansi is enabled
### What changes were proposed in this pull request?

This pr fixes the error message of `NumberFormatException` when casting invalid input to FractionalType and enabling **ansi**:
```
spark-sql> set spark.sql.ansi.enabled=true;
spark.sql.ansi.enabled	true
spark-sql> create table SPARK_32586 using parquet as select 's' s;
spark-sql> select * from SPARK_32586 where s > 1.13D;
java.lang.NumberFormatException: invalid input syntax for type numeric: columnartorow_value_0
```

After this pr:
```
spark-sql> select * from SPARK_32586 where s > 1.13D;
java.lang.NumberFormatException: invalid input syntax for type numeric: s
```

### Why are the changes needed?

Improve error message.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #29405 from wangyum/SPARK-32586.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-12 13:16:57 +09:00
gengjiaan e7c1204f6c [SPARK-32540][SQL] Eliminate the filter clause in aggregate
### What changes were proposed in this pull request?
Spark SQL supported filter clause in aggregate, for example:
`select sum(distinct id) filter (where sex = 'man') from student;`
But sometimes we can eliminate the filter clause in aggregate.
`SELECT COUNT(DISTINCT 1) FILTER (WHERE true) FROM testData;`
could be transformed to
`SELECT COUNT(DISTINCT 1) FROM testData;`
`SELECT COUNT(DISTINCT 1) FILTER (WHERE false) FROM testData;`
could be transformed to
`SELECT 0 FROM testData;`

### Why are the changes needed?
Optimize the filter clause in aggregation

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
New test.

Closes #29369 from beliefer/eliminate-filter-clause.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-11 16:20:19 +00:00
xuewei.linxuewei c37357a092 [SPARK-32573][SQL] Anti Join Improvement with EmptyHashedRelation and EmptyHashedRelationWithAllNullKeys
### What changes were proposed in this pull request?
In [SPARK-32290](https://issues.apache.org/jira/browse/SPARK-32290), we introduced several new types of HashedRelation.

* EmptyHashedRelation
* EmptyHashedRelationWithAllNullKeys

They were all limited to used only in NAAJ scenario. These new HashedRelation could be applied to other scenario for performance improvements.

* EmptyHashedRelation could also be used in Normal AntiJoin for fast stop
* While AQE is on and buildSide is EmptyHashedRelationWithAllNullKeys, can convert NAAJ to a Empty LocalRelation to skip meaningless data iteration since in Single-Key NAAJ, if null key exists in BuildSide, will drop all records in streamedSide.

This Patch including two changes.

* using EmptyHashedRelation to do fast stop for common anti join as well
* In AQE, eliminate BroadcastHashJoin(NAAJ) if buildSide is a EmptyHashedRelationWithAllNullKeys

### Why are the changes needed?
LeftAntiJoin could apply `fast stop` when BuildSide is EmptyHashedRelation, While within AQE with EmptyHashedRelationWithAllNullKeys, we can eliminate the NAAJ. This should be a performance improvement in AntiJoin.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

* added case in AdaptiveQueryExecSuite.
* added case in HashedRelationSuite.
* Make sure SubquerySuite JoinSuite SQLQueryTestSuite passed.

Closes #29389 from leanken/leanken-SPARK-32573.

Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-11 06:23:51 +00:00
allisonwang-db 1b7443bd9a [SPARK-32216][SQL] Remove redundant ProjectExec
### What changes were proposed in this pull request?
This PR added a physical rule to remove redundant project nodes. A `ProjectExec` is redundant when
1. It has the same output attributes and order as its child's output when ordering of these attributes is required.
2. It has the same output attributes as its child's output when attribute output ordering is not required.

For example:
After Filter:
```
== Physical Plan ==
*(1) Project [a#14L, b#15L, c#16, key#17]
+- *(1) Filter (isnotnull(a#14L) AND (a#14L > 5))
   +- *(1) ColumnarToRow
      +- FileScan parquet [a#14L,b#15L,c#16,key#17]
```
The `Project a#14L, b#15L, c#16, key#17` is redundant because its output is exactly the same as filter's output.

Before Aggregate:
```
== Physical Plan ==
*(2) HashAggregate(keys=[key#17], functions=[sum(a#14L), last(b#15L, false)], output=[sum_a#39L, key#17, last_b#41L])
+- Exchange hashpartitioning(key#17, 5), true, [id=#77]
   +- *(1) HashAggregate(keys=[key#17], functions=[partial_sum(a#14L), partial_last(b#15L, false)], output=[key#17, sum#49L, last#50L, valueSet#51])
      +- *(1) Project [key#17, a#14L, b#15L]
         +- *(1) Filter (isnotnull(a#14L) AND (a#14L > 100))
            +- *(1) ColumnarToRow
               +- FileScan parquet [a#14L,b#15L,key#17]
```
The `Project key#17, a#14L, b#15L` is redundant because hash aggregate doesn't require child plan's output to be in a specific order.

### Why are the changes needed?

It removes unnecessary query nodes and makes query plan cleaner.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests

Closes #29031 from allisonwang-db/remove-project.

Lead-authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com>
Co-authored-by: allisonwang-db <allison.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-11 03:14:15 +00:00
Yuanjian Li b03761e330 [SPARK-32456][SS] Check the Distinct by assuming it as Aggregate for Structured Streaming
### What changes were proposed in this pull request?
Check the Distinct nodes by assuming it as Aggregate in `UnsupportOperationChecker` for streaming.

### Why are the changes needed?
We want to fix 2 things here:

1. Give better error message for Distinct related operations in append mode that doesn't have a watermark

We use the union streams as the example, distinct in SQL has the same issue. Since the union clause in SQL has the requirement of deduplication, the parser will generate `Distinct(Union)` and the optimizer rule `ReplaceDistinctWithAggregate` will change it to `Aggregate(Union)`. This logic is of both batch and streaming queries. However, in the streaming, the aggregation will be wrapped by state store operations so we need extra checking logic in `UnsupportOperationChecker`.

Before this change, the SS union queries in Append mode will get the following confusing error when the watermark is lacking.
```
java.util.NoSuchElementException: None.get
	at scala.None$.get(Option.scala:529)
	at scala.None$.get(Option.scala:527)
	at org.apache.spark.sql.execution.streaming.StateStoreSaveExec.$anonfun$doExecute$9(statefulOperators.scala:346)
	at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
	at org.apache.spark.util.Utils$.timeTakenMs(Utils.scala:561)
	at org.apache.spark.sql.execution.streaming.StateStoreWriter.timeTakenMs(statefulOperators.scala:112)
...
```

2. Make `Distinct` in complete mode runnable.

Before this fix, the distinct in complete mode will throw the exception:
```
Complete output mode not supported when there are no streaming aggregations on streaming DataFrames/Datasets;
```

### Does this PR introduce _any_ user-facing change?
Yes, return a better error message.

### How was this patch tested?
New UT added.

Closes #29256 from xuanyuanking/SPARK-32456.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-10 14:01:31 +09:00
allisonwang-db 924c161544 [SPARK-32337][SQL] Show initial plan in AQE plan tree string
### What changes were proposed in this pull request?
This PR adds initial plan in `AdaptiveSparkPlanExec` and generates tree string for both current plan and initial plan. When the adaptive plan is not final, `Current Plan` will be used to indicate current physical plan, and `Final Plan` will be used when the adaptive plan is final. The difference between `Current Plan` and `Final Plan` here is that current plan indicates an intermediate state. The plan is subject to further transformations, while final plan represents an end state, which means the plan will no longer be changed.

Examples:

Before this change:
```
AdaptiveSparkPlan isFinalPlan=true
+- *(3) BroadcastHashJoin
   :- BroadcastQueryStage 2
       ...
```
`EXPLAIN FORMATTED`
```
== Physical Plan ==
AdaptiveSparkPlan (9)
+- BroadcastHashJoin Inner BuildRight (8)
   :- Project (3)
   :  +- Filter (2)
```

After this change
```
AdaptiveSparkPlan isFinalPlan=true
+- == Final Plan ==
   *(3) BroadcastHashJoin
   :- BroadcastQueryStage 2
   :  +- BroadcastExchange
           ...
+- == Initial Plan ==
   SortMergeJoin
   :- Sort
   :  +- Exchange
           ...
```

`EXPLAIN FORMATTED`
```
== Physical Plan ==
AdaptiveSparkPlan (9)
+- == Current Plan ==
   BroadcastHashJoin Inner BuildRight (8)
   :- Project (3)
   :  +- Filter (2)
+- == Initial Plan ==
   BroadcastHashJoin Inner BuildRight (8)
   :- Project (3)
   :  +- Filter (2)
```

### Why are the changes needed?
It provides better visibility into the plan change introduced by AQE.

### Does this PR introduce _any_ user-facing change?
Yes. It changed the AQE plan output string.

### How was this patch tested?
Unit test

Closes #29137 from allisonwang-db/aqe-plan.

Lead-authored-by: allisonwang-db <allison.wang@databricks.com>
Co-authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-10 04:49:37 +00:00
Liang-Chi Hsieh 7b6e1d5cec [SPARK-25557][SQL] Nested column predicate pushdown for ORC
### What changes were proposed in this pull request?

We added nested column predicate pushdown for Parquet in #27728. This patch extends the feature support to ORC.

### Why are the changes needed?

Extending the feature to ORC for feature parity. Better performance for handling nested predicate pushdown.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests.

Closes #28761 from viirya/SPARK-25557.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-07 08:07:41 -07:00
Wenchen Fan d5682c13a2 [SPARK-32528][SQL][TEST] The analyze method should make sure the plan is analyzed
### What changes were proposed in this pull request?

This PR updates the `analyze` method to make sure the plan can be resolved. It also fixes some miswritten optimizer tests.

### Why are the changes needed?

It's error-prone if the `analyze` method can return an unresolved plan.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

test only

Closes #29349 from cloud-fan/test.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-07 07:36:08 +00:00
Max Gekk 3a437ed22b [SPARK-32501][SQL] Convert null to "null" in structs, maps and arrays while casting to strings
### What changes were proposed in this pull request?
Convert `NULL` elements of maps, structs and arrays to the `"null"` string while converting maps/struct/array values to strings. The SQL config `spark.sql.legacy.omitNestedNullInCast.enabled` controls the behaviour. When it is `true`, `NULL` elements of structs/maps/arrays will be omitted otherwise, when it is `false`, `NULL` elements will be converted to `"null"`.

### Why are the changes needed?
1. It is impossible to distinguish empty string and null, for instance:
```scala
scala> Seq(Seq(""), Seq(null)).toDF().show
+-----+
|value|
+-----+
|   []|
|   []|
+-----+
```
2. Inconsistent NULL conversions for top-level values and nested columns, for instance:
```scala
scala> sql("select named_struct('c', null), null").show
+---------------------+----+
|named_struct(c, NULL)|NULL|
+---------------------+----+
|                   []|null|
+---------------------+----+
```
3. `.show()` is different from conversions to Hive strings, and as a consequence its output is different from `spark-sql` (sql tests):
```sql
spark-sql> select named_struct('c', null) as struct;
{"c":null}
```
```scala
scala> sql("select named_struct('c', null) as struct").show
+------+
|struct|
+------+
|    []|
+------+
```

4. It is impossible to distinguish empty struct/array from struct/array with null in the current implementation:
```scala
scala> Seq[Seq[String]](Seq(), Seq(null)).toDF.show()
+-----+
|value|
+-----+
|   []|
|   []|
+-----+
```

### Does this PR introduce _any_ user-facing change?
Yes, before:
```scala
scala> Seq(Seq(""), Seq(null)).toDF().show
+-----+
|value|
+-----+
|   []|
|   []|
+-----+
```

After:
```scala
scala> Seq(Seq(""), Seq(null)).toDF().show
+------+
| value|
+------+
|    []|
|[null]|
+------+
```

### How was this patch tested?
By existing test suite `CastSuite`.

Closes #29311 from MaxGekk/nested-null-to-string.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-05 12:03:36 +00:00
Max Gekk 7eb6f45688 [SPARK-32499][SQL] Use {} in conversions maps and structs to strings
### What changes were proposed in this pull request?
Change casting of map and struct values to strings by using the `{}` brackets instead of `[]`. The behavior is controlled by the SQL config `spark.sql.legacy.castComplexTypesToString.enabled`. When it is `true`, `CAST` wraps maps and structs by `[]` in casting to strings. Otherwise, if this is `false`, which is the default, maps and structs are wrapped by `{}`.

### Why are the changes needed?
- To distinguish structs/maps from arrays.
- To make `show`'s output consistent with Hive and conversions to Hive strings.
- To display dataframe content in the same form by `spark-sql` and `show`
- To be consistent with the `*.sql` tests

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
By existing test suite `CastSuite`.

Closes #29308 from MaxGekk/show-struct-map.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-04 14:57:09 +00:00
fqaiser94@gmail.com 6d69068057 [SPARK-32521][SQL] Bug-fix: WithFields Expression should not be foldable
### What changes were proposed in this pull request?

Make WithFields Expression not foldable.

### Why are the changes needed?

The following query currently fails on master brach:
```
sql("SELECT named_struct('a', 1, 'b', 2) a")
.select($"a".withField("c", lit(3)).as("a"))
.show(false)
// java.lang.UnsupportedOperationException: Cannot evaluate expression: with_fields(named_struct(a, 1, b, 2), c, 3)
```
This happens because the Catalyst optimizer sees that the WithFields Expression is foldable and tries to statically evaluate the WithFields Expression (via the ConstantFolding rule), however it cannot do so because WithFields Expression is Unevaluable.

### Does this PR introduce _any_ user-facing change?

Yes, queries like the one shared above will now succeed.
That said, this bug was introduced in Spark 3.1.0 which has yet to be released.

### How was this patch tested?

A new unit test was added.

Closes #29338 from fqaiser94/SPARK-32521.

Lead-authored-by: fqaiser94@gmail.com <fqaiser94@gmail.com>
Co-authored-by: fqaiser94 <fqaiser94@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-04 12:11:04 +00:00
gengjiaan 1597d8fcd4 [SPARK-30276][SQL] Support Filter expression allows simultaneous use of DISTINCT
### What changes were proposed in this pull request?
This PR is related to https://github.com/apache/spark/pull/26656.
https://github.com/apache/spark/pull/26656 only support use FILTER clause on aggregate expression without DISTINCT.
This PR will enhance this feature when one or more DISTINCT aggregate expressions which allows the use of the FILTER clause.
Such as:
```
select sum(distinct id) filter (where sex = 'man') from student;
select class_id, sum(distinct id) filter (where sex = 'man') from student group by class_id;
select count(id) filter (where class_id = 1), sum(distinct id) filter (where sex = 'man') from student;
select class_id, count(id) filter (where class_id = 1), sum(distinct id) filter (where sex = 'man') from student group by class_id;
select sum(distinct id), sum(distinct id) filter (where sex = 'man') from student;
select class_id, sum(distinct id), sum(distinct id) filter (where sex = 'man') from student group by class_id;
select class_id, count(id), count(id) filter (where class_id = 1), sum(distinct id), sum(distinct id) filter (where sex = 'man') from student group by class_id;
```

### Why are the changes needed?
Spark SQL only support use FILTER clause on aggregate expression without DISTINCT.
This PR support Filter expression allows simultaneous use of DISTINCT

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
Exists and new UT

Closes #29291 from beliefer/support-distinct-with-filter.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-04 04:41:19 +00:00
Max Gekk f3b10f526b [SPARK-32290][SQL][FOLLOWUP] Add version for the SQL config spark.sql.optimizeNullAwareAntiJoin
### What changes were proposed in this pull request?
Add the version `3.1.0` for the SQL config `spark.sql.optimizeNullAwareAntiJoin`.

### Why are the changes needed?
To inform users when the config was added, for example on the page http://spark.apache.org/docs/latest/configuration.html.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By compiling and running `./dev/scalastyle`.

Closes #29335 from MaxGekk/leanken-SPARK-32290-followup.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-03 16:05:54 +00:00
Takeshi Yamamuro c6109ba918 [SPARK-32257][SQL] Reports explicit errors for invalid usage of SET/RESET command
### What changes were proposed in this pull request?

This PR modified the parser code to handle invalid usages of a SET/RESET command.
For example;
```
SET spark.sql.ansi.enabled true
```
The above SQL command does not change the configuration value and it just tries to display the value of the configuration
`spark.sql.ansi.enabled true`. This PR disallows using special characters including spaces in the configuration name and reports a user-friendly error instead. In the error message, it tells users a workaround to use quotes or a string literal if they still needs to specify a configuration with them. 

Before this PR:
```
scala> sql("SET spark.sql.ansi.enabled true").show(1, -1)
+---------------------------+-----------+
|key                        |value      |
+---------------------------+-----------+
|spark.sql.ansi.enabled true|<undefined>|
+---------------------------+-----------+
```

After this PR:
```
scala> sql("SET spark.sql.ansi.enabled true")
org.apache.spark.sql.catalyst.parser.ParseException:
Expected format is 'SET', 'SET key', or 'SET key=value'. If you want to include special characters in key, please use quotes, e.g., SET `ke y`=value.(line 1, pos 0)

== SQL ==
SET spark.sql.ansi.enabled true
^^^
```

### Why are the changes needed?

For better user-friendly errors.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added tests in `SparkSqlParserSuite`.

Closes #29146 from maropu/SPARK-32257.

Lead-authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-03 13:00:07 +00:00
Kent Yao 3deb59d5c2 [SPARK-31709][SQL] Proper base path for database/table location when it is a relative path
### What changes were proposed in this pull request?

Currently, the user home directory is used as the base path for the database and table locations when their locationa are specified with a relative paths, e.g.
```sql
> set spark.sql.warehouse.dir;
spark.sql.warehouse.dir	file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200512/spark-warehouse/
spark-sql> create database loctest location 'loctestdbdir';

spark-sql> desc database loctest;
Database Name	loctest
Comment
Location	file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200512/loctestdbdir
Owner	kentyao

spark-sql> create table loctest(id int) location 'loctestdbdir';
spark-sql> desc formatted loctest;
id	int	NULL

# Detailed Table Information
Database	default
Table	loctest
Owner	kentyao
Created Time	Thu May 14 16:29:05 CST 2020
Last Access	UNKNOWN
Created By	Spark 3.1.0-SNAPSHOT
Type	EXTERNAL
Provider	parquet
Location	file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200512/loctestdbdir
Serde Library	org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat	org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat	org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
```
The user home is not always warehouse-related, unchangeable in runtime, and shared both by database and table as the parent directory. Meanwhile, we use the table path as the parent directory for relative partition locations.

The config `spark.sql.warehouse.dir` represents `the default location for managed databases and tables`.
For databases, the case above seems not to follow its semantics, because it should use ` `spark.sql.warehouse.dir` as the base path instead.

For tables, it seems to be right but here I suggest enriching the meaning that lets it also be the for external tables with relative paths for locations.

With changes in this PR,

The location of a database will be `warehouseDir/dbpath` when `dbpath` is relative.
The location of a table will be `dbpath/tblpath` when `tblpath` is relative.

### Why are the changes needed?

bugfix and improvement

Firstly, the databases with relative locations should be created under the default location specified by `spark.sql.warehouse.dir`.

Secondly, the external tables with relative paths may also follow this behavior for consistency.

At last, the behavior for database, tables and partitions with relative paths to choose base paths should be the same.

### Does this PR introduce _any_ user-facing change?

Yes, this PR changes the `createDatabase`, `alterDatabase`, `createTable` and `alterTable` APIs and related DDLs. If the LOCATION clause is followed by a relative path, the root path will be `spark.sql.warehouse.dir` for databases, and `spark.sql.warehouse.dir` / `dbPath` for tables.

e.g.

#### after
```sql
spark-sql> desc database loctest;
Database Name	loctest
Comment
Location	file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-SPARK-31709/spark-warehouse/loctest
Owner	kentyao
spark-sql> use loctest;
spark-sql> create table loctest(id int) location 'loctest';
20/05/14 18:18:02 WARN InMemoryFileIndex: The directory file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-SPARK-31709/loctest was not found. Was it deleted very recently?
20/05/14 18:18:02 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
20/05/14 18:18:03 WARN HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist
20/05/14 18:18:03 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
20/05/14 18:18:03 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
spark-sql> desc formatted loctest;
id	int	NULL

# Detailed Table Information
Database	loctest
Table	loctest
Owner	kentyao
Created Time	Thu May 14 18:18:03 CST 2020
Last Access	UNKNOWN
Created By	Spark 3.1.0-SNAPSHOT
Type	EXTERNAL
Provider	parquet
Location	file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-SPARK-31709/spark-warehouse/loctest/loctest
Serde Library	org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat	org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat	org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
spark-sql> alter table loctest set location 'loctest2'
         > ;
spark-sql> desc formatted loctest;
id	int	NULL

# Detailed Table Information
Database	loctest
Table	loctest
Owner	kentyao
Created Time	Thu May 14 18:18:03 CST 2020
Last Access	UNKNOWN
Created By	Spark 3.1.0-SNAPSHOT
Type	EXTERNAL
Provider	parquet
Location	file:/Users/kentyao/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-SPARK-31709/spark-warehouse/loctest/loctest2
Serde Library	org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
InputFormat	org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
OutputFormat	org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
```
### How was this patch tested?

Add unit tests.

Closes #28527 from yaooqinn/SPARK-31709.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-03 12:48:22 +00:00
beliefer 42f9ee4c7d [SPARK-24884][SQL] Support regexp function regexp_extract_all
### What changes were proposed in this pull request?
`regexp_extract_all` is a very useful function expanded the capabilities of `regexp_extract`.
There are some description of this function.
```
SELECT regexp_extract('1a 2b 14m', '\d+', 0); -- 1
SELECT regexp_extract_all('1a 2b 14m', '\d+', 0); -- [1, 2, 14]
SELECT regexp_extract('1a 2b 14m', '(\d+)([a-z]+)', 2); -- 'a'
SELECT regexp_extract_all('1a 2b 14m', '(\d+)([a-z]+)', 2); -- ['a', 'b', 'm']
```
There are some mainstream database support the syntax.
**Presto:**
https://prestodb.io/docs/current/functions/regexp.html

**Pig:**
https://pig.apache.org/docs/latest/api/org/apache/pig/builtin/REGEX_EXTRACT_ALL.html

Note: This PR pick up the work of https://github.com/apache/spark/pull/21985
### Why are the changes needed?
`regexp_extract_all` is a very useful function and make work easier.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
New UT

Closes #27507 from beliefer/support-regexp_extract_all.

Lead-authored-by: beliefer <beliefer@163.com>
Co-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-03 06:03:55 +00:00
Robert (Bobby) Evans 713124d5e3 [SPARK-32274][SQL] Make SQL cache serialization pluggable
### What changes were proposed in this pull request?

Add a config to let users change how SQL/Dataframe data is compressed when cached.

This adds a few new classes/APIs for use with this config.

1. `CachedBatch` is a trait used to tag data that is intended to be cached. It has a few APIs that lets us keep the compression/serialization of the data separate from the metrics about it.
2. `CachedBatchSerializer` provides the APIs that must be implemented to cache data.
    * `convertForCache` is an API that runs a cached spark plan and turns its result into an `RDD[CachedBatch]`.  The actual caching is done outside of this API
   * `buildFilter` is an API that takes a set of predicates and builds a filter function that can be used to filter the `RDD[CachedBatch]` returned by `convertForCache`
   * `decompressColumnar` decompresses an `RDD[CachedBatch]` into an `RDD[ColumnarBatch]` This is only used for a limited set of data types.  These data types may expand in the future.  If they do we can add in a new API with a default value that says which data types this serializer supports.
   * `decompressToRows` decompresses an `RDD[CachedBatch]` into an `RDD[InternalRow]` this API, like `decompressColumnar` decompresses the data in `CachedBatch` but turns it into `InternalRow`s, typically using code generation for performance reasons.

There is also an API that lets you reuse the current filtering based on min/max values. `SimpleMetricsCachedBatch` and `SimpleMetricsCachedBatchSerializer`.

### Why are the changes needed?

This lets users explore different types of compression and compression ratios.

### Does this PR introduce _any_ user-facing change?

This adds in a single config, and exposes some developer API classes described above.

### How was this patch tested?

I ran the unit tests around this and I also did some manual performance tests.  I could find any performance difference between the old and new code, and if there is any it is within error.

Closes #29067 from revans2/pluggable_cache_serializer.

Authored-by: Robert (Bobby) Evans <bobby@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-03 03:15:54 +00:00
Yuanjian Li 354313b6bc [SPARK-31894][SS][FOLLOW-UP] Rephrase the config doc
### What changes were proposed in this pull request?
Address comment in https://github.com/apache/spark/pull/28707#discussion_r461102749

### Why are the changes needed?
Hide the implementation details in the config doc.

### Does this PR introduce _any_ user-facing change?
Config doc change.

### How was this patch tested?
Document only.

Closes #29315 from xuanyuanking/SPARK-31894-follow.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-07-31 14:16:41 +00:00
Takeshi Yamamuro 30e3042dc5 [SPARK-32488][SQL] Use @parser::members and @lexer::members to avoid generating unused code
### What changes were proposed in this pull request?

This PR aims to update `SqlBse.g4` for avoiding generating unused code.
Currently, ANTLR generates unused methods and variables; `isValidDecimal` and `isHint` are only used in the generated lexer. This PR changed the code to use `parser::members` and `lexer::members` to avoid it.

### Why are the changes needed?

To reduce unnecessary code.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #29296 from maropu/UpdateSqlBase.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-07-30 07:51:27 +00:00
Max Gekk 99a855575c [SPARK-32431][SQL] Check duplicate nested columns in read from in-built datasources
### What changes were proposed in this pull request?
When `spark.sql.caseSensitive` is `false` (by default), check that there are not duplicate column names on the same level (top level or nested levels) in reading from in-built datasources Parquet, ORC, Avro and JSON. If such duplicate columns exist, throw the exception:
```
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema:
```

### Why are the changes needed?
To make handling of duplicate nested columns is similar to handling of duplicate top-level columns i. e. output the same error when `spark.sql.caseSensitive` is `false`:
```Scala
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `camelcase`
```

Checking of top-level duplicates was introduced by https://github.com/apache/spark/pull/17758.

### Does this PR introduce _any_ user-facing change?
Yes. For the example from SPARK-32431:

ORC:
```scala
java.io.IOException: Error reading file: file:/private/var/folders/p3/dfs6mf655d7fnjrsjvldh0tc0000gn/T/spark-c02c2f9a-0cdc-4859-94fc-b9c809ca58b1/part-00001-63e8c3f0-7131-4ec9-be02-30b3fdd276f4-c000.snappy.orc
	at org.apache.orc.impl.RecordReaderImpl.nextBatch(RecordReaderImpl.java:1329)
	at org.apache.orc.mapreduce.OrcMapreduceRecordReader.ensureBatch(OrcMapreduceRecordReader.java:78)
...
Caused by: java.io.EOFException: Read past end of RLE integer from compressed stream Stream for column 3 kind DATA position: 6 length: 6 range: 0 offset: 12 limit: 12 range 0 = 0 to 6 uncompressed: 3 to 3
	at org.apache.orc.impl.RunLengthIntegerReaderV2.readValues(RunLengthIntegerReaderV2.java:61)
	at org.apache.orc.impl.RunLengthIntegerReaderV2.next(RunLengthIntegerReaderV2.java:323)
```

JSON:
```scala
+------------+
|StructColumn|
+------------+
|        [,,]|
+------------+
```

Parquet:
```scala
+------------+
|StructColumn|
+------------+
|     [0,, 1]|
+------------+
```

Avro:
```scala
+------------+
|StructColumn|
+------------+
|        [,,]|
+------------+
```

After the changes, Parquet, ORC, JSON and Avro output the same error:
```scala
Found duplicate column(s) in the data schema: `camelcase`;
org.apache.spark.sql.AnalysisException: Found duplicate column(s) in the data schema: `camelcase`;
	at org.apache.spark.sql.util.SchemaUtils$.checkColumnNameDuplication(SchemaUtils.scala:112)
	at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtils.scala:51)
	at org.apache.spark.sql.util.SchemaUtils$.checkSchemaColumnNameDuplication(SchemaUtils.scala:67)
```

### How was this patch tested?
Run modified test suites:
```
$ build/sbt "sql/test:testOnly org.apache.spark.sql.FileBasedDataSourceSuite"
$ build/sbt "avro/test:testOnly org.apache.spark.sql.avro.*"
```
and added new UT to `SchemaUtilsSuite`.

Closes #29234 from MaxGekk/nested-case-insensitive-column.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-07-30 06:05:55 +00:00
Max Gekk d897825d2d [SPARK-32346][SQL] Support filters pushdown in Avro datasource
### What changes were proposed in this pull request?
In the PR, I propose to support pushed down filters in Avro datasource V1 and V2.
1. Added new SQL config `spark.sql.avro.filterPushdown.enabled` to control filters pushdown to Avro datasource. It is on by default.
2. Renamed `CSVFilters` to `OrderedFilters`.
3. `OrderedFilters` is used in `AvroFileFormat` (DSv1) and in `AvroPartitionReaderFactory` (DSv2)
4. Modified `AvroDeserializer` to return None from the `deserialize` method when pushdown filters return `false`.

### Why are the changes needed?
The changes improve performance on synthetic benchmarks up to **2** times on JDK 11:
```
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2  2.50GHz
Filters pushdown:                         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
w/o filters                                        9614           9669          54          0.1        9614.1       1.0X
pushdown disabled                                 10077          10141          66          0.1       10077.2       1.0X
w/ filters                                         4681           4713          29          0.2        4681.5       2.1X
```

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
- Added UT to `AvroCatalystDataConversionSuite` and `AvroSuite`
- Re-running `AvroReadBenchmark` using Amazon EC2:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge (spot instance) |
| AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) |
| Java | OpenJDK8/11 installed by`sudo add-apt-repository ppa:openjdk-r/ppa` & `sudo apt install openjdk-11-jdk`|

and `./dev/run-benchmarks`:
```python
#!/usr/bin/env python3

import os
from sparktestsupport.shellutils import run_cmd

benchmarks = [
  ['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark']
]

print('Set SPARK_GENERATE_BENCHMARK_FILES=1')
os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1'

for b in benchmarks:
    print("Run benchmark: %s" % b[1])
    run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])])
```

Closes #29145 from MaxGekk/avro-filters-pushdown.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-07-30 01:37:42 +08:00
Terry Kim 45b7212fd3 [SPARK-32401][SQL] Migrate function related commands to use UnresolvedFunc to resolve function identifier
### What changes were proposed in this pull request?

This PR proposes to migrate the following function related commands to use `UnresolvedFunc` to resolve function identifier:
- DROP FUNCTION
- DESCRIBE FUNCTION
- SHOW FUNCTIONS

`DropFunctionStatement`, `DescribeFunctionStatement` and `ShowFunctionsStatement` logical plans are replaced with `DropFunction`, `DescribeFunction` and `ShowFunctions` logical plans respectively, and each contains `UnresolvedFunc` as its child so that it can be resolved in `Analyzer`.

### Why are the changes needed?

Migrating to the new resolution framework, which resolves `UnresolvedFunc` in `Analyzer`.

### Does this PR introduce _any_ user-facing change?

The message of exception thrown when a catalog is resolved to v2 has been merged to:
`function is only supported in v1 catalog`

Previously, it printed out the command used. E.g.,:
`CREATE FUNCTION is only supported in v1 catalog`

### How was this patch tested?

Updated existing tests.

Closes #29198 from imback82/function_framework.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-07-29 03:55:48 +00:00
Michael Munday a3d80564ad [SPARK-32458][SQL][TESTS] Fix incorrectly sized row value reads
### What changes were proposed in this pull request?
Updates to tests to use correctly sized `getInt` or `getLong` calls.

### Why are the changes needed?
The reads were incorrectly sized (i.e. `putLong` paired with `getInt` and `putInt` paired with `getLong`). This causes test failures on big-endian systems.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Tests were run on a big-endian system (s390x). This change is unlikely to have any practical effect on little-endian systems.

Closes #29258 from mundaym/fix-row.

Authored-by: Michael Munday <mike.munday@ibm.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-28 10:36:20 -07:00
yi.wu ca1ecf7f9f [SPARK-32459][SQL] Support WrappedArray as customCollectionCls in MapObjects
### What changes were proposed in this pull request?

This PR supports `WrappedArray` as `customCollectionCls` in `MapObjects`.

### Why are the changes needed?

This helps fix the regression caused by SPARK-31826.  For the following test, it can pass in branch-3.0 but fail in master branch:

```scala
test("WrappedArray") {
    val myUdf = udf((a: WrappedArray[Int]) =>
      WrappedArray.make[Int](Array(a.head + 99)))
    checkAnswer(Seq(Array(1))
      .toDF("col")
      .select(myUdf(Column("col"))),
      Row(ArrayBuffer(100)))
  }
```

In SPARK-31826, we've changed the catalyst-to-scala converter from `CatalystTypeConverters` to `ExpressionEncoder.deserializer`. However, `CatalystTypeConverters` supports `WrappedArray` while `ExpressionEncoder.deserializer` doesn't.

### Does this PR introduce _any_ user-facing change?

No,  SPARK-31826 is merged into master and branch-3.1, which haven't been released.

### How was this patch tested?

Added a new test for `WrappedArray` in `UDFSuite`; Also updated `ObjectExpressionsSuite` for `MapObjects`.

Closes #29261 from Ngone51/fix-wrappedarray.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-07-28 12:24:15 +00:00
xuewei.linxuewei 12b9787a7f [SPARK-32290][SQL] SingleColumn Null Aware Anti Join Optimize
### What changes were proposed in this pull request?
Normally, a Null aware anti join will be planed into BroadcastNestedLoopJoin which is very time consuming, for instance, in TPCH Query 16.

```
select
    p_brand,
    p_type,
    p_size,
    count(distinct ps_suppkey) as supplier_cnt
from
    partsupp,
    part
where
    p_partkey = ps_partkey
    and p_brand <> 'Brand#45'
    and p_type not like 'MEDIUM POLISHED%'
    and p_size in (49, 14, 23, 45, 19, 3, 36, 9)
    and ps_suppkey not in (
        select
            s_suppkey
        from
            supplier
        where
            s_comment like '%Customer%Complaints%'
    )
group by
    p_brand,
    p_type,
    p_size
order by
    supplier_cnt desc,
    p_brand,
    p_type,
    p_size
```

In above query, will planed into

LeftAnti
    condition Or((ps_suppkey=s_suppkey), IsNull(ps_suppkey=s_suppkey))

Inside BroadcastNestedLoopJoinExec will perform O(M\*N), BUT if there is only single column in NAAJ, we can always change buildSide into a HashSet, and streamedSide just need to lookup in the HashSet, then the calculation will be optimized into O(M).

But this optimize is only targeting on null aware anti join with single column case, because multi-column support is much more complicated, we might be able to support multi-column in future.
After apply this patch, the TPCH Query 16 performance decrease from 41mins to 30s

The semantic of null-aware anti join is:

![image](https://user-images.githubusercontent.com/17242071/88077041-66a39a00-cbad-11ea-8fb6-c235c4d219b4.png)

### Why are the changes needed?
TPCH is a common benchmark for distributed compute engine, all other 21 Query works fine on Spark, except for Query 16, apply this patch will make Spark more competitive among all these popular engine. BTW, this patch has restricted rules and only apply on NAAJ Single Column case, which is safe enough.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
1. SQLQueryTestSuite with NOT IN keyword SQL, add CONFIG_DIM with spark.sql.optimizeNullAwareAntiJoin on and off
2. added case in org.apache.spark.sql.JoinSuite.
3. added case in org.apache.spark.sql.SubquerySuite.
3. Compare performance before and after applying this patch against TPCH Query 16.
4. config combination against e2e test with following

```
Map(
  "spark.sql.optimizeNullAwareAntiJoin" -> "true",
  "spark.sql.adaptive.enabled" -> "false",
  "spark.sql.codegen.wholeStage" -> "false"
),
Map(
  "sspark.sql.optimizeNullAwareAntiJoin" -> "true",
  "spark.sql.adaptive.enabled" -> "false",
  "spark.sql.codegen.wholeStage" -> "true"
),
Map(
  "spark.sql.optimizeNullAwareAntiJoin" -> "true",
  "spark.sql.adaptive.enabled" -> "true",
  "spark.sql.codegen.wholeStage" -> "false"
),
Map(
  "spark.sql.optimizeNullAwareAntiJoin" -> "true",
  "spark.sql.adaptive.enabled" -> "true",
  "spark.sql.codegen.wholeStage" -> "true"
)
```

Closes #29104 from leanken/leanken-SPARK-32290.

Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-07-28 04:42:15 +00:00
Frank Yin 8323c8eb56 [SPARK-32059][SQL] Allow nested schema pruning thru window/sort plans
### What changes were proposed in this pull request?
This PR is intended to solve schema pruning not working with window functions, as described in SPARK-32059. It also solved schema pruning not working with `Sort`. It also generalizes with `Project->Filter->[any node can be pruned]`.

### Why are the changes needed?
This is needed because of performance issues with nested structures with querying using window functions as well as sorting.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Introduced two tests: 1) optimizer planning level 2) end-to-end tests with SQL queries.

Closes #28898 from frankyin-factual/master.

Authored-by: Frank Yin <frank@factual.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-07-28 10:00:21 +09:00
Kent Yao d315ebf3a7 [SPARK-32424][SQL] Fix silent data change for timestamp parsing if overflow happens
### What changes were proposed in this pull request?

When using `Seconds.toMicros` API to convert epoch seconds to microseconds,

```scala
 /**
     * Equivalent to
     * {link #convert(long, TimeUnit) MICROSECONDS.convert(duration, this)}.
     * param duration the duration
     * return the converted duration,
     * or {code Long.MIN_VALUE} if conversion would negatively
     * overflow, or {code Long.MAX_VALUE} if it would positively overflow.
     */
```
This PR change it to `Math.multiplyExact(epochSeconds, MICROS_PER_SECOND)`

### Why are the changes needed?

fix silent data change between 3.x and 2.x
```
 ~/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200722   bin/spark-sql -S -e "select to_timestamp('300000', 'y');"
+294247-01-10 12:00:54.775807
```
```
 kentyaohulk  ~/Downloads/spark/spark-2.4.5-bin-hadoop2.7  bin/spark-sql -S  -e "select to_timestamp('300000', 'y');"
284550-10-19 15:58:1010.448384
```

### Does this PR introduce _any_ user-facing change?

Yes, we will raise `ArithmeticException` instead of giving the wrong answer if overflow.

### How was this patch tested?

add unit test

Closes #29220 from yaooqinn/SPARK-32424.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-07-27 17:03:14 +00:00
Kent Yao d3596c04b0 [SPARK-32406][SQL] Make RESET syntax support single configuration reset
### What changes were proposed in this pull request?

This PR extends the RESET command to support reset SQL configuration one by one.

### Why are the changes needed?

Currently, the reset command only supports restore all of the runtime configurations to their defaults. In most cases, users do not want this,  but just want to restore one or a small group of settings.
The SET command can work as a workaround for this, but you have to keep the defaults in your mind or by temp variables, which turns out not very convenient to use.

Hive supports this:
https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients#HiveServer2Clients-BeelineExample

reset <key> | Resets the value of a particular configuration variable (key) to the default value.Note: If you misspell the variable name, Beeline will not show an error.
-- | --

PostgreSQL supports this too

https://www.postgresql.org/docs/9.1/sql-reset.html

### Does this PR introduce _any_ user-facing change?

yes, reset can restore one configuration now
### How was this patch tested?

add new unit tests.

Closes #29202 from yaooqinn/SPARK-32406.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-24 09:13:26 -07:00
Liang-Chi Hsieh 84efa04c57 [SPARK-32308][SQL] Move by-name resolution logic of unionByName from API code to analysis phase
### What changes were proposed in this pull request?

Currently the by-name resolution logic of `unionByName` is put in API code. This patch moves the logic to analysis phase.
See https://github.com/apache/spark/pull/28996#discussion_r453460284.

### Why are the changes needed?

Logically we should do resolution in analysis phase. This refactoring cleans up API method and makes consistent resolution.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests.

Closes #29107 from viirya/move-union-by-name.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-07-24 04:33:18 +00:00
Max Gekk 658e87471c [SPARK-30648][SQL][FOLLOWUP] Refactoring of JsonFilters: move config checking out
### What changes were proposed in this pull request?
Refactoring of `JsonFilters`:
- Add an assert to the `skipRow` method to check the input `index`
- Move checking of the SQL config `spark.sql.json.filterPushdown.enabled` from `JsonFilters` to `JacksonParser`.

### Why are the changes needed?
1. The assert should catch incorrect usage of `JsonFilters`
2. The config checking out of `JsonFilters` makes it consistent with `OrderedFilters` (see https://github.com/apache/spark/pull/29145).
3. `JsonFilters` can be used by other datasource in the future and don't depend from the JSON configs.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By existing tests suites:
```
$ build/sbt "sql/test:testOnly org.apache.spark.sql.execution.datasources.json.*"
$ build/sbt "test:testOnly org.apache.spark.sql.catalyst.json.*"
```

Closes #29206 from MaxGekk/json-filters-pushdown-followup.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-24 09:54:11 +09:00
Sean Owen be2eca22e9 [SPARK-32398][TESTS][CORE][STREAMING][SQL][ML] Update to scalatest 3.2.0 for Scala 2.13.3+
### What changes were proposed in this pull request?

Updates to scalatest 3.2.0. Though it looks large, it is 99% changes to the new location of scalatest classes.

### Why are the changes needed?

3.2.0+ has a fix that is required for Scala 2.13.3+ compatibility.

### Does this PR introduce _any_ user-facing change?

No, only affects tests.

### How was this patch tested?

Existing tests.

Closes #29196 from srowen/SPARK-32398.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-23 16:20:17 -07:00
Terry Kim 35345e30e5 [SPARK-32374][SQL] Disallow setting properties when creating temporary views
### What changes were proposed in this pull request?

Currently, you can specify properties when creating a temporary view. However, the specified properties are not used and can be misleading.

This PR propose to disallow specifying properties when creating temporary views.

### Why are the changes needed?

To avoid confusion by disallowing specifying unused properties.

### Does this PR introduce _any_ user-facing change?

Yes, now if you create a temporary view with properties, the operation will fail:
```
scala> sql("CREATE TEMPORARY VIEW tv TBLPROPERTIES('p1'='v1') AS SELECT 1 AS c1")
org.apache.spark.sql.catalyst.parser.ParseException:
Operation not allowed: CREATE TEMPORARY VIEW ... TBLPROPERTIES (property_name = property_value, ...)(line 1, pos 0)

== SQL ==
CREATE TEMPORARY VIEW tv TBLPROPERTIES('p1'='v1') AS SELECT 1 AS c1
^^^

```

### How was this patch tested?

Added tests

Closes #29167 from imback82/disable_properties_temp_view.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-07-23 14:32:10 +00:00
yi.wu a8e3de36e7 [SPARK-32280][SPARK-32372][SQL] ResolveReferences.dedupRight should only rewrite attributes for ancestor nodes of the conflict plan
### What changes were proposed in this pull request?

This PR refactors `ResolveReferences.dedupRight` to make sure it only rewrite attributes for ancestor nodes of the conflict plan.

### Why are the changes needed?

This is a bug fix.

```scala
sql("SELECT name, avg(age) as avg_age FROM person GROUP BY name")
  .createOrReplaceTempView("person_a")
sql("SELECT p1.name, p2.avg_age FROM person p1 JOIN person_a p2 ON p1.name = p2.name")
  .createOrReplaceTempView("person_b")
sql("SELECT * FROM person_a UNION SELECT * FROM person_b")
  .createOrReplaceTempView("person_c")
sql("SELECT p1.name, p2.avg_age FROM person_c p1 JOIN person_c p2 ON p1.name = p2.name").show()
```
When executing the above query, we'll hit the error:

```scala
[info]   Failed to analyze query: org.apache.spark.sql.AnalysisException: Resolved attribute(s) avg_age#231 missing from name#223,avg_age#218,id#232,age#234,name#233 in operator !Project [name#233, avg_age#231]. Attribute(s) with the same name appear in the operation: avg_age. Please check if the right attribute(s) are used.;;
...
```

The plan below is the problematic plan which is the right plan of a `Join` operator. And, it has conflict plans comparing to the left plan. In this problematic plan, the first `Aggregate` operator (the one under the first child of `Union`) becomes a conflict plan compares to the left one and has a rewrite attribute pair as  `avg_age#218` -> `avg_age#231`. With the current `dedupRight` logic, we'll first replace this `Aggregate` with a new one, and then rewrites the attribute `avg_age#218` from bottom to up. As you can see, projects with the attribute `avg_age#218` of the second child of the `Union` can also be replaced with `avg_age#231`(That means we also rewrite attributes for non-ancestor plans for the conflict plan). Ideally, the attribute `avg_age#218` in the second `Aggregate` operator (the one under the second child of `Union`) should also be replaced. But it didn't because it's an `Alias` while we only rewrite `Attribute` yet. Therefore, the project above the second `Aggregate` becomes unresolved.

```scala
:

:
+- SubqueryAlias p2
   +- SubqueryAlias person_c
      +- Distinct
         +- Union
            :- Project [name#233, avg_age#231]
            :  +- SubqueryAlias person_a
            :     +- Aggregate [name#233], [name#233, avg(cast(age#234 as bigint)) AS avg_age#231]
            :        +- SubqueryAlias person
            :           +- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).id AS id#232, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).name, true, false) AS name#233, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).age AS age#234]
            :              +- ExternalRDD [obj#165]
            +- Project [name#233 AS name#227, avg_age#231 AS avg_age#228]
               +- Project [name#233, avg_age#231]
                  +- SubqueryAlias person_b
                     +- !Project [name#233, avg_age#231]
                        +- Join Inner, (name#233 = name#223)
                           :- SubqueryAlias p1
                           :  +- SubqueryAlias person
                           :     +- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).id AS id#232, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).name, true, false) AS name#233, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).age AS age#234]
                           :        +- ExternalRDD [obj#165]
                           +- SubqueryAlias p2
                              +- SubqueryAlias person_a
                                 +- Aggregate [name#223], [name#223, avg(cast(age#224 as bigint)) AS avg_age#218]
                                    +- SubqueryAlias person
                                       +- SerializeFromObject [knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).id AS id#222, staticinvoke(class org.apache.spark.unsafe.types.UTF8String, StringType, fromString, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).name, true, false) AS name#223, knownnotnull(assertnotnull(input[0, org.apache.spark.sql.test.SQLTestData$Person, true])).age AS age#224]
                                          +- ExternalRDD [obj#165]
```

### Does this PR introduce _any_ user-facing change?

Yes, users would no longer hit the error after this fix.

### How was this patch tested?

Added test.

Closes #29166 from Ngone51/impr-dedup.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-07-23 14:24:47 +00:00
Wenchen Fan aa54dcf193 [SPARK-32251][SQL][TESTS][FOLLOWUP] improve SQL keyword test
### What changes were proposed in this pull request?

Improve the `SQLKeywordSuite` so that:
1. it checks keywords under default mode as well
2. it checks if there are typos in the doc (found one and fixed in this PR)

### Why are the changes needed?

better test coverage

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

Closes #29200 from cloud-fan/test.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-07-23 14:02:38 +00:00
Dongjoon Hyun aed8dbab1d [SPARK-32364][SQL][FOLLOWUP] Add toMap to return originalMap and documentation
### What changes were proposed in this pull request?

This is a follow-up of https://github.com/apache/spark/pull/29160. We already removed the indeterministicity. This PR aims the following for the existing code base.
1. Add an explicit document to `DataFrameReader/DataFrameWriter`.

2. Add `toMap` to `CaseInsensitiveMap` in order to return `originalMap: Map[String, T]` because it's more consistent with the existing `case-sensitive key names` behavior for the existing code pattern like `AppendData.byName(..., extraOptions.toMap)`. Previously, it was `HashMap.toMap`.

3. During (2), we need to change the following to keep the original logic using `CaseInsensitiveMap.++`.
```scala
- val params = extraOptions.toMap ++ connectionProperties.asScala.toMap
+ val params = extraOptions ++ connectionProperties.asScala
```

4. Additionally, use `.toMap` in the following because `dsOptions.asCaseSensitiveMap()` is used later.
```scala
- val options = sessionOptions ++ extraOptions
+ val options = sessionOptions.filterKeys(!extraOptions.contains(_)) ++ extraOptions.toMap
  val dsOptions = new CaseInsensitiveStringMap(options.asJava)
```

### Why are the changes needed?

`extraOptions.toMap` is used in several places (e.g. `DataFrameReader`) to hand over `Map[String, T]`. In this case, `CaseInsensitiveMap[T] private (val originalMap: Map[String, T])` had better return `originalMap`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the Jenkins or GitHub Action with the existing tests and newly add test case at `JDBCSuite`.

Closes #29191 from dongjoon-hyun/SPARK-32364-3.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-23 06:28:08 -07:00
LantaoJin 182566bf57 [SPARK-32237][SQL] Resolve hint in CTE
### What changes were proposed in this pull request?
This PR is to move `Substitution` rule before `Hints` rule in `Analyzer` to avoid hint in CTE not working.

### Why are the changes needed?
Below SQL in Spark3.0 will throw AnalysisException, but it works in Spark2.x
```sql
WITH cte AS (SELECT /*+ REPARTITION(3) */ T.id, T.data FROM $t1 T)
SELECT cte.id, cte.data FROM cte
```
```
Failed to analyze query: org.apache.spark.sql.AnalysisException: cannot resolve '`cte.id`' given input columns: [cte.data, cte.id]; line 3 pos 7;
'Project ['cte.id, 'cte.data]
+- SubqueryAlias cte
   +- Project [id#21L, data#22]
      +- SubqueryAlias T
         +- SubqueryAlias testcat.ns1.ns2.tbl
            +- RelationV2[id#21L, data#22] testcat.ns1.ns2.tbl

'Project ['cte.id, 'cte.data]
+- SubqueryAlias cte
   +- Project [id#21L, data#22]
      +- SubqueryAlias T
         +- SubqueryAlias testcat.ns1.ns2.tbl
            +- RelationV2[id#21L, data#22] testcat.ns1.ns2.tbl
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Add a unit test

Closes #29062 from LantaoJin/SPARK-32237.

Authored-by: LantaoJin <jinlantao@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-07-23 03:10:45 +00:00
Takuya UESHIN 46169823c0 [SPARK-30616][SQL][FOLLOW-UP] Use only config key name in the config doc
### What changes were proposed in this pull request?

This is a follow-up of #28852.

This PR to use only config name; otherwise the doc for the config entry shows the entire details of the referring configs.

### Why are the changes needed?

The doc for the newly introduced config entry shows the entire details of the referring configs.

### Does this PR introduce _any_ user-facing change?

The doc for the config entry will show only the referring config keys.

### How was this patch tested?

Existing tests.

Closes #29194 from ueshin/issues/SPARK-30616/fup.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-07-23 03:07:30 +00:00
ulysses 184074de22 [SPARK-31999][SQL] Add REFRESH FUNCTION command
### What changes were proposed in this pull request?

In Hive mode, permanent functions are shared with Hive metastore so that functions may be modified by other Hive client. With in long-lived spark scene, it's hard to update the change of function.

Here are 2 reasons:
* Spark cache the function in memory using `FunctionRegistry`.
* User may not know the location or classname of udf when using `replace function`.

Note that we use v2 command code path to add new command.

### Why are the changes needed?

Give a easy way to make spark function registry sync with Hive metastore.
Then we can call
```
refresh function functionName
```

### Does this PR introduce _any_ user-facing change?

Yes, new command.

### How was this patch tested?

New UT.

Closes #28840 from ulysses-you/SPARK-31999.

Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-07-22 19:05:50 +00:00
Dongjoon Hyun cd16a10475 [SPARK-32364][SQL] Use CaseInsensitiveMap for DataFrameReader/Writer options
### What changes were proposed in this pull request?

When a user have multiple options like `path`, `paTH`, and `PATH` for the same key `path`, `option/options` is non-deterministic because `extraOptions` is `HashMap`. This PR aims to use `CaseInsensitiveMap` instead of `HashMap` to fix this bug fundamentally.

### Why are the changes needed?

Like the following, DataFrame's `option/options` have been non-deterministic in terms of case-insensitivity because it stores the options at `extraOptions` which is using `HashMap` class.

```scala
spark.read
  .option("paTh", "1")
  .option("PATH", "2")
  .option("Path", "3")
  .option("patH", "4")
  .load("5")
...
org.apache.spark.sql.AnalysisException:
Path does not exist: file:/.../1;
```

### Does this PR introduce _any_ user-facing change?

Yes. However, this is a bug fix for the indeterministic cases.

### How was this patch tested?

Pass the Jenkins or GitHub Action with newly added test cases.

Closes #29160 from dongjoon-hyun/SPARK-32364.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-22 07:58:45 -07:00
Takeshi Yamamuro 04bf3511f1 [SPARK-21117][SQL][FOLLOWUP] Define prettyName for WidthBucket
### What changes were proposed in this pull request?

This PR is to define prettyName for `WidthBucket`.
This comes from the gatorsmile's suggestion: https://github.com/apache/spark/pull/28764#discussion_r457802957

### Why are the changes needed?

For a better name.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #29183 from maropu/SPARK-21117-FOLLOWUP.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-22 02:51:30 -07:00
Cheng Su 39181ff209 [SPARK-32286][SQL] Coalesce bucketed table for shuffled hash join if applicable
### What changes were proposed in this pull request?
Based on a follow up comment in https://github.com/apache/spark/pull/28123, where we can coalesce buckets for shuffled hash join as well. The note here is we only coalesce the buckets from shuffled hash join stream side (i.e. the side not building hash map), so we don't need to worry about OOM when coalescing multiple buckets in one task for building hash map.

> If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.

Refactor existing physical plan rule `CoalesceBucketsInSortMergeJoin` to `CoalesceBucketsInJoin`, for covering shuffled hash join as well.
Refactor existing unit test `CoalesceBucketsInSortMergeJoinSuite` to `CoalesceBucketsInJoinSuite`, for covering shuffled hash join as well.

### Why are the changes needed?
Avoid shuffle for joining different bucketed tables, is also useful for shuffled hash join. In production, we are seeing users to use shuffled hash join to join bucketed tables (set `spark.sql.join.preferSortMergeJoin`=false, to avoid sort), and this can help avoid shuffle if number of buckets are not same.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Added unit tests in `CoalesceBucketsInJoinSuite` for verifying shuffled hash join physical plan.

### Performance number per request from maropu

I was looking at TPCDS per suggestion from maropu. But I found most of queries from TPCDS are doing aggregate, and only several ones are doing join. None of input tables are bucketed. So I took the approach to test a modified version of `TPCDS q93` as

```
SELECT ss_ticket_number, sr_ticket_number
FROM store_sales
JOIN store_returns
ON ss_ticket_number = sr_ticket_number
```

And make `store_sales` and `store_returns` to be bucketed tables.

Physical query plan without coalesce:

```
ShuffledHashJoin [ss_ticket_number#109L], [sr_ticket_number#120L], Inner, BuildLeft
:- Exchange hashpartitioning(ss_ticket_number#109L, 4), true, [id=#67]
:  +- *(1) Project [ss_ticket_number#109L]
:     +- *(1) Filter isnotnull(ss_ticket_number#109L)
:        +- *(1) ColumnarToRow
:           +- FileScan parquet default.store_sales[ss_ticket_number#109L] Batched: true, DataFilters: [isnotnull(ss_ticket_number#109L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/chengsu/spark/spark-warehouse/store_sales], PartitionFilters: [], PushedFilters: [IsNotNull(ss_ticket_number)], ReadSchema: struct<ss_ticket_number:bigint>, SelectedBucketsCount: 2 out of 2
+- *(2) Project [sr_returned_date_sk#111L, sr_return_time_sk#112L, sr_item_sk#113L, sr_customer_sk#114L, sr_cdemo_sk#115L, sr_hdemo_sk#116L, sr_addr_sk#117L, sr_store_sk#118L, sr_reason_sk#119L, sr_ticket_number#120L, sr_return_quantity#121L, sr_return_amt#122, sr_return_tax#123, sr_return_amt_inc_tax#124, sr_fee#125, sr_return_ship_cost#126, sr_refunded_cash#127, sr_reversed_charge#128, sr_store_credit#129, sr_net_loss#130]
   +- *(2) Filter isnotnull(sr_ticket_number#120L)
      +- *(2) ColumnarToRow
         +- FileScan parquet default.store_returns[sr_returned_date_sk#111L,sr_return_time_sk#112L,sr_item_sk#113L,sr_customer_sk#114L,sr_cdemo_sk#115L,sr_hdemo_sk#116L,sr_addr_sk#117L,sr_store_sk#118L,sr_reason_sk#119L,sr_ticket_number#120L,sr_return_quantity#121L,sr_return_amt#122,sr_return_tax#123,sr_return_amt_inc_tax#124,sr_fee#125,sr_return_ship_cost#126,sr_refunded_cash#127,sr_reversed_charge#128,sr_store_credit#129,sr_net_loss#130] Batched: true, DataFilters: [isnotnull(sr_ticket_number#120L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/chengsu/spark/spark-warehouse/store_returns], PartitionFilters: [], PushedFilters: [IsNotNull(sr_ticket_number)], ReadSchema: struct<sr_returned_date_sk:bigint,sr_return_time_sk:bigint,sr_item_sk:bigint,sr_customer_sk:bigin..., SelectedBucketsCount: 4 out of 4
```

Physical query plan with coalesce:

```
ShuffledHashJoin [ss_ticket_number#109L], [sr_ticket_number#120L], Inner, BuildLeft
:- *(1) Project [ss_ticket_number#109L]
:  +- *(1) Filter isnotnull(ss_ticket_number#109L)
:     +- *(1) ColumnarToRow
:        +- FileScan parquet default.store_sales[ss_ticket_number#109L] Batched: true, DataFilters: [isnotnull(ss_ticket_number#109L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/chengsu/spark/spark-warehouse/store_sales], PartitionFilters: [], PushedFilters: [IsNotNull(ss_ticket_number)], ReadSchema: struct<ss_ticket_number:bigint>, SelectedBucketsCount: 2 out of 2
+- *(2) Project [sr_returned_date_sk#111L, sr_return_time_sk#112L, sr_item_sk#113L, sr_customer_sk#114L, sr_cdemo_sk#115L, sr_hdemo_sk#116L, sr_addr_sk#117L, sr_store_sk#118L, sr_reason_sk#119L, sr_ticket_number#120L, sr_return_quantity#121L, sr_return_amt#122, sr_return_tax#123, sr_return_amt_inc_tax#124, sr_fee#125, sr_return_ship_cost#126, sr_refunded_cash#127, sr_reversed_charge#128, sr_store_credit#129, sr_net_loss#130]
   +- *(2) Filter isnotnull(sr_ticket_number#120L)
      +- *(2) ColumnarToRow
         +- FileScan parquet default.store_returns[sr_returned_date_sk#111L,sr_return_time_sk#112L,sr_item_sk#113L,sr_customer_sk#114L,sr_cdemo_sk#115L,sr_hdemo_sk#116L,sr_addr_sk#117L,sr_store_sk#118L,sr_reason_sk#119L,sr_ticket_number#120L,sr_return_quantity#121L,sr_return_amt#122,sr_return_tax#123,sr_return_amt_inc_tax#124,sr_fee#125,sr_return_ship_cost#126,sr_refunded_cash#127,sr_reversed_charge#128,sr_store_credit#129,sr_net_loss#130] Batched: true, DataFilters: [isnotnull(sr_ticket_number#120L)], Format: Parquet, Location: InMemoryFileIndex[file:/Users/chengsu/spark/spark-warehouse/store_returns], PartitionFilters: [], PushedFilters: [IsNotNull(sr_ticket_number)], ReadSchema: struct<sr_returned_date_sk:bigint,sr_return_time_sk:bigint,sr_item_sk:bigint,sr_customer_sk:bigin..., SelectedBucketsCount: 4 out of 4 (Coalesced to 2)
```

Run time improvement as 50% of wall clock time:

```
Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15.4
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
shuffle hash join:                        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
shuffle hash join coalesce bucket off              1541           1664         106          1.9         535.1       1.0X
shuffle hash join coalesce bucket on               1060           1169          81          2.7         368.1       1.5X
```

Closes #29079 from c21/split-bucket.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-07-22 08:56:26 +09:00
Dongjoon Hyun 8c7d6f9733 [SPARK-32377][SQL] CaseInsensitiveMap should be deterministic for addition
### What changes were proposed in this pull request?

This PR aims to fix `CaseInsensitiveMap` to be deterministic for addition.

### Why are the changes needed?

```scala
import org.apache.spark.sql.catalyst.util.CaseInsensitiveMap
var m = CaseInsensitiveMap(Map.empty[String, String])
Seq(("paTh", "1"), ("PATH", "2"), ("Path", "3"), ("patH", "4"), ("path", "5")).foreach { kv =>
  m = (m + kv).asInstanceOf[CaseInsensitiveMap[String]]
  println(m.get("path"))
}
```

**BEFORE**
```
Some(1)
Some(2)
Some(3)
Some(4)
Some(1)
```

**AFTER**
```
Some(1)
Some(2)
Some(3)
Some(4)
Some(5)
```

### Does this PR introduce _any_ user-facing change?

Yes, but this is a bug fix on non-deterministic behavior.

### How was this patch tested?

Pass the newly added test case.

Closes #29172 from dongjoon-hyun/SPARK-32377.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-20 22:20:16 -07:00
gengjiaan 02114f96d6 [SPARK-32365][SQL] Add a boundary condition for negative index in regexp_extract
### What changes were proposed in this pull request?
The current implement of regexp_extract will throws a unprocessed exception show below:
SELECT regexp_extract('1a 2b 14m', 'd+' -1)

```
java.lang.IndexOutOfBoundsException: No group -1
java.util.regex.Matcher.group(Matcher.java:538)
org.apache.spark.sql.catalyst.expressions.RegExpExtract.nullSafeEval(regexpExpressions.scala:455)
org.apache.spark.sql.catalyst.expressions.TernaryExpression.eval(Expression.scala:704)
org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$1$$anonfun$applyOrElse$1.applyOrElse(expressions.scala:52)
org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$1$$anonfun$applyOrElse$1.applyOrElse(expressions.scala:45)
```

### Why are the changes needed?
Fix a bug `java.lang.IndexOutOfBoundsException: No group -1`

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
new UT

Closes #29161 from beliefer/regexp_extract-group-not-allow-less-than-zero.

Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-20 20:34:51 -07:00
Terry Kim e0ecb66f53 [SPARK-31869][SQL] BroadcastHashJoinExec can utilize the build side for its output partitioning
### What changes were proposed in this pull request?

Currently, the `BroadcastHashJoinExec`'s `outputPartitioning` only uses the streamed side's `outputPartitioning`. However, if the join type of `BroadcastHashJoinExec` is an inner-like join, the build side's info (the join keys) can be added to `BroadcastHashJoinExec`'s `outputPartitioning`.

 For example,
```Scala
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "500")
val t1 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i1", "j1")
val t2 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i2", "j2")
val t3 = (0 until 20).map(i => (i % 7, i % 11)).toDF("i3", "j3")
val t4 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i4", "j4")

// join1 is a sort merge join.
val join1 = t1.join(t2, t1("i1") === t2("i2"))

// join2 is a broadcast join where t3 is broadcasted.
val join2 = join1.join(t3, join1("i1") === t3("i3"))

// Join on the column from the broadcasted side (i3).
val join3 = join2.join(t4, join2("i3") === t4("i4"))

join3.explain
```
You see that `Exchange hashpartitioning(i2#103, 200)` is introduced because there is no output partitioning info from the build side.
```
== Physical Plan ==
*(6) SortMergeJoin [i3#29], [i4#40], Inner
:- *(4) Sort [i3#29 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(i3#29, 200), true, [id=#55]
:     +- *(3) BroadcastHashJoin [i1#7], [i3#29], Inner, BuildRight
:        :- *(3) SortMergeJoin [i1#7], [i2#18], Inner
:        :  :- *(1) Sort [i1#7 ASC NULLS FIRST], false, 0
:        :  :  +- Exchange hashpartitioning(i1#7, 200), true, [id=#28]
:        :  :     +- LocalTableScan [i1#7, j1#8]
:        :  +- *(2) Sort [i2#18 ASC NULLS FIRST], false, 0
:        :     +- Exchange hashpartitioning(i2#18, 200), true, [id=#29]
:        :        +- LocalTableScan [i2#18, j2#19]
:        +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))), [id=#34]
:           +- LocalTableScan [i3#29, j3#30]
+- *(5) Sort [i4#40 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(i4#40, 200), true, [id=#39]
      +- LocalTableScan [i4#40, j4#41]
```
This PR proposes to introduce output partitioning for the build side for `BroadcastHashJoinExec` if the streamed side has a `HashPartitioning` or a collection of `HashPartitioning`s.

There is a new internal config `spark.sql.execution.broadcastHashJoin.outputPartitioningExpandLimit`, which can limit the number of partitioning a `HashPartitioning` can expand to. It can be set to "0" to disable this feature.

### Why are the changes needed?

To remove unnecessary shuffle.

### Does this PR introduce _any_ user-facing change?

Yes, now the shuffle in the above example can be eliminated:
```
== Physical Plan ==
*(5) SortMergeJoin [i3#108], [i4#119], Inner
:- *(3) Sort [i3#108 ASC NULLS FIRST], false, 0
:  +- *(3) BroadcastHashJoin [i1#86], [i3#108], Inner, BuildRight
:     :- *(3) SortMergeJoin [i1#86], [i2#97], Inner
:     :  :- *(1) Sort [i1#86 ASC NULLS FIRST], false, 0
:     :  :  +- Exchange hashpartitioning(i1#86, 200), true, [id=#120]
:     :  :     +- LocalTableScan [i1#86, j1#87]
:     :  +- *(2) Sort [i2#97 ASC NULLS FIRST], false, 0
:     :     +- Exchange hashpartitioning(i2#97, 200), true, [id=#121]
:     :        +- LocalTableScan [i2#97, j2#98]
:     +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, false] as bigint))), [id=#126]
:        +- LocalTableScan [i3#108, j3#109]
+- *(4) Sort [i4#119 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(i4#119, 200), true, [id=#130]
      +- LocalTableScan [i4#119, j4#120]
```

### How was this patch tested?

Added new tests.

Closes #28676 from imback82/broadcast_join_output.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-07-20 14:25:51 +00:00
Gengliang Wang d0c83f372b [SPARK-32302][SQL] Partially push down disjunctive predicates through Join/Partitions
### What changes were proposed in this pull request?

In https://github.com/apache/spark/pull/28733 and #28805, CNF conversion is used to push down disjunctive predicates through join and partitions pruning.

It's a good improvement, however, converting all the predicates in CNF can lead to a very long result, even with grouping functions over expressions.  For example, for the following predicate
```
(p0 = '1' AND p1 = '1') OR (p0 = '2' AND p1 = '2') OR (p0 = '3' AND p1 = '3') OR (p0 = '4' AND p1 = '4') OR (p0 = '5' AND p1 = '5') OR (p0 = '6' AND p1 = '6') OR (p0 = '7' AND p1 = '7') OR (p0 = '8' AND p1 = '8') OR (p0 = '9' AND p1 = '9') OR (p0 = '10' AND p1 = '10') OR (p0 = '11' AND p1 = '11') OR (p0 = '12' AND p1 = '12') OR (p0 = '13' AND p1 = '13') OR (p0 = '14' AND p1 = '14') OR (p0 = '15' AND p1 = '15') OR (p0 = '16' AND p1 = '16') OR (p0 = '17' AND p1 = '17') OR (p0 = '18' AND p1 = '18') OR (p0 = '19' AND p1 = '19') OR (p0 = '20' AND p1 = '20')
```
will be converted into a long query(130K characters) in Hive metastore, and there will be error:
```
javax.jdo.JDOException: Exception thrown when executing query : SELECT DISTINCT 'org.apache.hadoop.hive.metastore.model.MPartition' AS NUCLEUS_TYPE,A0.CREATE_TIME,A0.LAST_ACCESS_TIME,A0.PART_NAME,A0.PART_ID,A0.PART_NAME AS NUCORDER0 FROM PARTITIONS A0 LEFT OUTER JOIN TBLS B0 ON A0.TBL_ID = B0.TBL_ID LEFT OUTER JOIN DBS C0 ON B0.DB_ID = C0.DB_ID WHERE B0.TBL_NAME = ? AND C0."NAME" = ? AND ((((((A0.PART_NAME LIKE '%/p1=1' ESCAPE '\' ) OR (A0.PART_NAME LIKE '%/p1=2' ESCAPE '\' )) OR (A0.PART_NAME LIKE '%/p1=3' ESCAPE '\' )) OR ((A0.PART_NAME LIKE '%/p1=4' ESCAPE '\' ) O ...
```

Essentially, we just need to traverse predicate and extract the convertible sub-predicates like what we did in https://github.com/apache/spark/pull/24598. There is no need to maintain the CNF result set.

### Why are the changes needed?

A better implementation for pushing down disjunctive and complex predicates. The pushed down predicates is always equal or shorter than the CNF result.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests

Closes #29101 from gengliangwang/pushJoin.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-07-20 14:17:31 +00:00
Anton Okolnychyi 0aca1a6ed4 [SPARK-32276][SQL] Remove redundant sorts before repartition nodes
### What changes were proposed in this pull request?

This PR proposes to remove redundant sorts before repartition nodes whenever the data is ordered after the repartitioning.

### Why are the changes needed?

It looks like our `EliminateSorts` rule can be extended further to remove sorts before repartition nodes that don't affect the final output ordering. It seems safe to perform the following rewrites:

- `Sort -> Repartition -> Sort -> Scan` as `Sort -> Repartition -> Scan`
- `Sort -> Repartition -> Project -> Sort -> Scan` as `Sort -> Repartition -> Project -> Scan`

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

More test cases.

Closes #29089 from aokolnychyi/spark-32276.

Authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-19 12:11:26 -07:00
Takeshi Yamamuro c7a68a920d [SPARK-32344][SQL] Unevaluable expr is set to FIRST/LAST ignoreNullsExpr in distinct aggregates
### What changes were proposed in this pull request?

This PR intends to fix a bug of distinct FIRST/LAST aggregates in v2.4.6/v3.0.0/master;
```
scala> sql("SELECT FIRST(DISTINCT v) FROM VALUES 1, 2, 3 t(v)").show()
...
Caused by: java.lang.UnsupportedOperationException: Cannot evaluate expression: false#37
  at org.apache.spark.sql.catalyst.expressions.Unevaluable$class.eval(Expression.scala:258)
  at org.apache.spark.sql.catalyst.expressions.AttributeReference.eval(namedExpressions.scala:226)
  at org.apache.spark.sql.catalyst.expressions.aggregate.First.ignoreNulls(First.scala:68)
  at org.apache.spark.sql.catalyst.expressions.aggregate.First.updateExpressions$lzycompute(First.scala:82)
  at org.apache.spark.sql.catalyst.expressions.aggregate.First.updateExpressions(First.scala:81)
  at org.apache.spark.sql.execution.aggregate.HashAggregateExec$$anonfun$15.apply(HashAggregateExec.scala:268)
```
A root cause of this bug is that the `Aggregation` strategy replaces a foldable boolean `ignoreNullsExpr` expr with a `Unevaluable` expr (`AttributeReference`) for distinct FIRST/LAST aggregate functions. But, this operation cannot be allowed because the `Analyzer` has checked that it must be foldabe;
ffdbbae1d4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/First.scala (L74-L76)
So, this PR proposes to change a vriable for `IGNORE NULLS`  from `Expression` to `Boolean` to avoid the case.

### Why are the changes needed?

Bugfix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added a test in `DataFrameAggregateSuite`.

Closes #29143 from maropu/SPARK-32344.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-19 11:11:42 +09:00
Sudharshann D f9f9309bec [SPARK-31579][SQL] replaced floorDiv to Div
### What changes were proposed in this pull request?

Replaced  floorDiv to just / in `localRebaseGregorianToJulianDays()` in `spark/sql/catalyst/util/RebaseDateTime.scala`

### Why are the changes needed?

Easier to understand the logic/code and a little more efficiency.

### Does this PR introduce _any_ user-facing change?

NO

### How was this patch tested?

Proof of concept [here](https://github.com/apache/spark/pull/28573/files). The operation `utcCal.getTimeInMillis / MILLIS_PER_DAY` results in an interger value already.

Closes #29008 from Sudhar287/SPARK-31579.

Authored-by: Sudharshann D <sudhar287@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-07-18 13:04:58 -05:00
Yaroslav Tkachenko 34baed8139 [SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache
### What changes were proposed in this pull request?
New `spark.sql.metadataCacheTTLSeconds` option that adds time-to-live cache behaviour to the existing caches in `FileStatusCache` and `SessionCatalog`.

### Why are the changes needed?
Currently Spark [caches file listing for tables](https://spark.apache.org/docs/2.4.4/sql-data-sources-parquet.html#metadata-refreshing) and requires issuing `REFRESH TABLE` any time the file listing has changed outside of Spark. Unfortunately, simply submitting `REFRESH TABLE` commands could be very cumbersome. Assuming frequently added files, hundreds of tables and dozens of users querying the data (and expecting up-to-date results), manually refreshing metadata for each table is not a solution.

This is a pretty common use-case for streaming ingestion of data, which can be done outside of Spark (with tools like Kafka Connect, etc.).

A similar feature exists in Presto: `hive.file-status-cache-expire-time` can be found [here](https://prestosql.io/docs/current/connector/hive.html#hive-configuration-properties).

### Does this PR introduce _any_ user-facing change?
Yes, it's controlled with the new `spark.sql.metadataCacheTTLSeconds` option.

When it's set to `-1` (by default), the behaviour of caches doesn't change, so it stays _backwards-compatible_.

Otherwise, you can specify a value in seconds, for example `spark.sql.metadataCacheTTLSeconds: 60` means 1-minute cache TTL.

### How was this patch tested?

Added new tests in:

- FileIndexSuite
- SessionCatalogSuite

Closes #28852 from sap1ens/SPARK-30616-metadata-cache-ttl.

Authored-by: Yaroslav Tkachenko <sapiensy@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-07-17 13:40:54 +00:00
Maxim Gekk c1f160e097 [SPARK-30648][SQL] Support filters pushdown in JSON datasource
### What changes were proposed in this pull request?
In the PR, I propose to support pushed down filters in JSON datasource. The reason of pushing a filter up to `JacksonParser` is to apply the filter as soon as all its attributes become available i.e. converted from JSON field values to desired values according to the schema. This allows to skip parsing of the rest of JSON record and conversions of other values if the filter returns `false`. This can improve performance when pushed filters are highly selective and conversion of JSON string fields to desired values are comparably expensive ( for example, the conversion to `TIMESTAMP` values).

The main idea behind of `JsonFilters` is to group pushdown filters by their references, convert the grouped filters to expressions, and then compile to predicates. The predicates are indexed by schema field positions. Each predicate has a state with reference counter to non-set row fields. As soon as the counter reaches `0`, it can be applied to the row because all its dependencies has been set. Before processing new row, predicate's reference counter is reset to total number of predicate references (dependencies in a row).

The common code shared between `CSVFilters` and `JsonFilters` is moved to the `StructFilters` class and its companion object.

### Why are the changes needed?
The changes improve performance on synthetic benchmarks up to **27 times** on JDK 8 and **25** times on JDK 11:
```
OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1044-aws
Intel(R) Xeon(R) CPU E5-2670 v2  2.50GHz
Filters pushdown:                         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
w/o filters                                       25230          25255          22          0.0      252299.6       1.0X
pushdown disabled                                 25248          25282          33          0.0      252475.6       1.0X
w/ filters                                          905            911           8          0.1        9047.9      27.9X
```

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
- Added new test suites `JsonFiltersSuite` and `JacksonParserSuite`.
- By new end-to-end and case sensitivity tests in `JsonSuite`.
- By `CSVFiltersSuite`, `UnivocityParserSuite` and `CSVSuite`.
- Re-running `CSVBenchmark` and `JsonBenchmark` using Amazon EC2:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge (spot instance) |
| AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) |
| Java | OpenJDK8/11 installed by`sudo add-apt-repository ppa:openjdk-r/ppa` & `sudo apt install openjdk-11-jdk`|

and `./dev/run-benchmarks`:
```python
#!/usr/bin/env python3

import os
from sparktestsupport.shellutils import run_cmd

benchmarks = [
    ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'],
    ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark']
]

print('Set SPARK_GENERATE_BENCHMARK_FILES=1')
os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1'

for b in benchmarks:
    print("Run benchmark: %s" % b[1])
    run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])])
```

Closes #27366 from MaxGekk/json-filters-pushdown.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-17 00:01:13 +09:00
Kent Yao bdeb626c5a [SPARK-32272][SQL] Add SQL standard command SET TIME ZONE
### What changes were proposed in this pull request?

This PR adds the SQL standard command - `SET TIME ZONE` to the current default time zone displacement for the current SQL-session, which is the same as the existing `set spark.sql.session.timeZone=xxx'.

All in all, this PR adds syntax as following,

```
SET TIME ZONE LOCAL;
SET TIME ZONE 'valid time zone';  -- zone offset or region
SET TIME ZONE INTERVAL XXXX; -- xxx must in [-18, + 18] hours, * this range is bigger than ansi  [-14, + 14]
```

### Why are the changes needed?

ANSI compliance and supply pure SQL users a way to retrieve all supported TimeZones

### Does this PR introduce _any_ user-facing change?

yes, add new syntax.

### How was this patch tested?

add unit tests.

and locally verified reference doc

![image](https://user-images.githubusercontent.com/8326978/87510244-c8dc3680-c6a5-11ea-954c-b098be84afee.png)

Closes #29064 from yaooqinn/SPARK-32272.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-07-16 13:01:53 +00:00
Erik Krogen cf22d947fb [SPARK-32036] Replace references to blacklist/whitelist language with more appropriate terminology, excluding the blacklisting feature
### What changes were proposed in this pull request?

This PR will remove references to these "blacklist" and "whitelist" terms besides the blacklisting feature as a whole, which can be handled in a separate JIRA/PR.

This touches quite a few files, but the changes are straightforward (variable/method/etc. name changes) and most quite self-contained.

### Why are the changes needed?

As per discussion on the Spark dev list, it will be beneficial to remove references to problematic language that can alienate potential community members. One such reference is "blacklist" and "whitelist". While it seems to me that there is some valid debate as to whether these terms have racist origins, the cultural connotations are inescapable in today's world.

### Does this PR introduce _any_ user-facing change?

In the test file `HiveQueryFileTest`, a developer has the ability to specify the system property `spark.hive.whitelist` to specify a list of Hive query files that should be tested. This system property has been renamed to `spark.hive.includelist`. The old property has been kept for compatibility, but will log a warning if used. I am open to feedback from others on whether keeping a deprecated property here is unnecessary given that this is just for developers running tests.

### How was this patch tested?

Existing tests should be suitable since no behavior changes are expected as a result of this PR.

Closes #28874 from xkrogen/xkrogen-SPARK-32036-rename-blacklists.

Authored-by: Erik Krogen <ekrogen@linkedin.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
2020-07-15 11:40:55 -05:00
Dongjoon Hyun 8950dcbb1c [SPARK-32318][SQL][TESTS] Add a test case to EliminateSortsSuite for ORDER BY in DISTRIBUTE BY
### What changes were proposed in this pull request?

This PR aims to add a test case to EliminateSortsSuite to protect a valid use case which is using ORDER BY in DISTRIBUTE BY statement.

### Why are the changes needed?

```scala
scala> scala.util.Random.shuffle((1 to 100000).map(x => (x % 2, x))).toDF("a", "b").repartition(2).createOrReplaceTempView("t")

scala> sql("select * from (select * from t order by b) distribute by a").write.orc("/tmp/master")

$ ls -al /tmp/master/
total 56
drwxr-xr-x  10 dongjoon  wheel  320 Jul 14 22:12 ./
drwxrwxrwt  15 root      wheel  480 Jul 14 22:12 ../
-rw-r--r--   1 dongjoon  wheel    8 Jul 14 22:12 ._SUCCESS.crc
-rw-r--r--   1 dongjoon  wheel   12 Jul 14 22:12 .part-00000-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc.crc
-rw-r--r--   1 dongjoon  wheel   16 Jul 14 22:12 .part-00043-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc.crc
-rw-r--r--   1 dongjoon  wheel   16 Jul 14 22:12 .part-00191-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc.crc
-rw-r--r--   1 dongjoon  wheel    0 Jul 14 22:12 _SUCCESS
-rw-r--r--   1 dongjoon  wheel  119 Jul 14 22:12 part-00000-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc
-rw-r--r--   1 dongjoon  wheel  932 Jul 14 22:12 part-00043-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc
-rw-r--r--   1 dongjoon  wheel  939 Jul 14 22:12 part-00191-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc
```

The following was found during SPARK-32276. If Spark optimizer removes the inner `ORDER BY`, the file size increases.
```scala
scala> scala.util.Random.shuffle((1 to 100000).map(x => (x % 2, x))).toDF("a", "b").repartition(2).createOrReplaceTempView("t")

scala> sql("select * from (select * from t order by b) distribute by a").write.orc("/tmp/SPARK-32276")

$ ls -al /tmp/SPARK-32276/
total 632
drwxr-xr-x  10 dongjoon  wheel     320 Jul 14 22:08 ./
drwxrwxrwt  14 root      wheel     448 Jul 14 22:08 ../
-rw-r--r--   1 dongjoon  wheel       8 Jul 14 22:08 ._SUCCESS.crc
-rw-r--r--   1 dongjoon  wheel      12 Jul 14 22:08 .part-00000-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc.crc
-rw-r--r--   1 dongjoon  wheel    1188 Jul 14 22:08 .part-00043-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc.crc
-rw-r--r--   1 dongjoon  wheel    1188 Jul 14 22:08 .part-00191-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc.crc
-rw-r--r--   1 dongjoon  wheel       0 Jul 14 22:08 _SUCCESS
-rw-r--r--   1 dongjoon  wheel     119 Jul 14 22:08 part-00000-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc
-rw-r--r--   1 dongjoon  wheel  150735 Jul 14 22:08 part-00043-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc
-rw-r--r--   1 dongjoon  wheel  150741 Jul 14 22:08 part-00191-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc
```

### Does this PR introduce _any_ user-facing change?

No. This only improves the test coverage.

### How was this patch tested?

Pass the GitHub Action or Jenkins.

Closes #29118 from dongjoon-hyun/SPARK-32318.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-15 07:43:56 -07:00
Dilip Biswal e4499932da [SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node
### What changes were proposed in this pull request?
Improve the EXPLAIN FORMATTED output of DSV2 Scan nodes (file based ones).

**Before**
```
== Physical Plan ==
* Project (4)
+- * Filter (3)
   +- * ColumnarToRow (2)
      +- BatchScan (1)

(1) BatchScan
Output [2]: [value#7, id#8]
Arguments: [value#7, id#8], ParquetScan(org.apache.spark.sql.test.TestSparkSession17477bbb,Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, __spark_hadoop_conf__.xml,org.apache.spark.sql.execution.datasources.InMemoryFileIndexa6c363ce,StructType(StructField(value,IntegerType,true)),StructType(StructField(value,IntegerType,true)),StructType(StructField(id,IntegerType,true)),[Lorg.apache.spark.sql.sources.Filter;40fee459,org.apache.spark.sql.util.CaseInsensitiveStringMapfeca1ec6,Vector(isnotnull(id#8), (id#8 > 1)),List(isnotnull(value#7), (value#7 > 2)))
(2) ...
(3) ...
(4) ...
```
**After**
```
== Physical Plan ==
* Project (4)
+- * Filter (3)
   +- * ColumnarToRow (2)
      +- BatchScan (1)

(1) BatchScan
Output [2]: [value#7, id#8]
DataFilters: [isnotnull(value#7), (value#7 > 2)]
Format: parquet
Location: InMemoryFileIndex[....]
PartitionFilters: [isnotnull(id#8), (id#8 > 1)]
PushedFilers: [IsNotNull(id), IsNotNull(value), GreaterThan(id,1), GreaterThan(value,2)]
ReadSchema: struct<value:int>
(2) ...
(3) ...
(4) ...
```
### Why are the changes needed?
The old format is not very readable. This improves the readability of the plan.

### Does this PR introduce any user-facing change?
Yes. the explain output will be different.

### How was this patch tested?
Added a test case in ExplainSuite.

Closes #28425 from dilipbiswal/dkb_dsv2_explain.

Lead-authored-by: Dilip Biswal <dkbiswal@gmail.com>
Co-authored-by: Dilip Biswal <dkbiswal@apache.org>
Signed-off-by: Dilip Biswal <dkbiswal@apache.org>
2020-07-15 01:28:39 -07:00
Dongjoon Hyun 2527fbc896 Revert "[SPARK-32276][SQL] Remove redundant sorts before repartition nodes"
This reverts commit af8e65fca9.
2020-07-14 22:14:31 -07:00
Jungtaek Lim (HeartSaVioR) 542aefb4c4 [SPARK-31985][SS] Remove incomplete/undocumented stateful aggregation in continuous mode
### What changes were proposed in this pull request?

This removes the undocumented and incomplete feature of "stateful aggregation" in continuous mode, which would reduce 1100+ lines of code.

### Why are the changes needed?

The work for the feature had been stopped for over an year, and no one asked/requested for the availability of such feature in community. Current state for the feature is that it only works with `coalesce(1)` which force the query to read and process, and write in "a" task, which doesn't make sense in production.

The remaining code increases the work on DSv2 changes as well - that's why I don't simply propose reverting relevant commits - the code path has been changed due to DSv2 evolution.

### Does this PR introduce _any_ user-facing change?

Technically no, because it's never documented and can't be used in production in current shape.

### How was this patch tested?

Existing tests.

Closes #29077 from HeartSaVioR/SPARK-31985.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
2020-07-15 13:40:43 +09:00
Anton Okolnychyi af8e65fca9 [SPARK-32276][SQL] Remove redundant sorts before repartition nodes
### What changes were proposed in this pull request?

This PR removes redundant sorts before repartition nodes with shuffles and repartitionByExpression with deterministic expressions.

### Why are the changes needed?

It looks like our `EliminateSorts` rule can be extended further to remove sorts before repartition nodes that shuffle data as such repartition operations change the ordering and distribution of data. That's why it seems safe to perform the following rewrites:
- `Repartition -> Sort -> Scan` as `Repartition -> Scan`
- `Repartition -> Project -> Sort -> Scan` as `Repartition -> Project -> Scan`

We don't apply this optimization to coalesce as it uses `DefaultPartitionCoalescer` that may preserve the ordering of data if there is no locality info in the parent RDD. At the same time, there is no guarantee that will happen.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

More test cases.

Closes #29089 from aokolnychyi/spark-32276.

Authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-14 21:17:33 -07:00
yi.wu a47b69a88a [SPARK-32307][SQL] ScalaUDF's canonicalized expression should exclude inputEncoders
### What changes were proposed in this pull request?

Override `canonicalized` to empty the `inputEncoders` for the canonicalized `ScalaUDF`.

### Why are the changes needed?

The following fails on `branch-3.0` currently, not on Apache Spark 3.0.0 release.

```scala
spark.udf.register("key", udf((m: Map[String, String]) => m.keys.head.toInt))
Seq(Map("1" -> "one", "2" -> "two")).toDF("a").createOrReplaceTempView("t")
checkAnswer(sql("SELECT key(a) AS k FROM t GROUP BY key(a)"), Row(1) :: Nil)

[info]   org.apache.spark.sql.AnalysisException: expression 't.`a`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;;
[info] Aggregate [UDF(a#6)], [UDF(a#6) AS k#8]
[info] +- SubqueryAlias t
[info]    +- Project [value#3 AS a#6]
[info]       +- LocalRelation [value#3]
[info]   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:49)
[info]   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:48)
[info]   at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:130)
[info]   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidAggregateExpression$1(CheckAnalysis.scala:257)
[info]   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10(CheckAnalysis.scala:259)
[info]   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10$adapted(CheckAnalysis.scala:259)
[info]   at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
[info]   at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
[info]   at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
[info]   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidAggregateExpression$1(CheckAnalysis.scala:259)
[info]   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10(CheckAnalysis.scala:259)
[info]   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10$adapted(CheckAnalysis.scala:259)
[info]   at scala.collection.immutable.List.foreach(List.scala:392)
[info]   at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidAggregateExpression$1(CheckAnalysis.scala:259)
...
```

We use the rule`ResolveEncodersInUDF` to resolve `inputEncoders` and the original`ScalaUDF` instance will be updated to a new `ScalaUDF` instance with the resolved encoders at the end. Note, during encoder resolving, types like `map`, `array` will be resolved to new expression(e.g. `MapObjects`, `CatalystToExternalMap`).

However, `ExpressionEncoder` can't be canonicalized. Thus, the canonicalized `ScalaUDF`s become different even if their original  `ScalaUDF`s are the same. Finally, it fails the `checkValidAggregateExpression` when this `ScalaUDF` is used as a group expression.

### Does this PR introduce _any_ user-facing change?

Yes, users will not hit the exception after this fix.

### How was this patch tested?

Added tests.

Closes #29106 from Ngone51/spark-32307.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-14 12:19:01 -07:00
Sean Owen d6a68e0b67 [SPARK-29292][STREAMING][SQL][BUILD] Get streaming, catalyst, sql compiling for Scala 2.13
### What changes were proposed in this pull request?

Continuation of https://github.com/apache/spark/pull/28971 which lets streaming, catalyst and sql compile for 2.13. Same idea.

### Why are the changes needed?

Eventually, we need to support a Scala 2.13 build, perhaps in Spark 3.1.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests. (2.13 was not tested; this is about getting it to compile without breaking 2.12)

Closes #29078 from srowen/SPARK-29292.2.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-14 02:06:50 -07:00
Liang-Chi Hsieh cc9371d885 [SPARK-32258][SQL] Not duplicate normalization on children for float/double If/CaseWhen/Coalesce
### What changes were proposed in this pull request?

This is followup to #29061. See https://github.com/apache/spark/pull/29061#discussion_r453458611. Basically this moves If/CaseWhen/Coalesce case patterns after float/double case so we don't duplicate normalization on children for float/double If/CaseWhen/Coalesce.

### Why are the changes needed?

Simplify expression tree.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Modify unit tests.

Closes #29091 from viirya/SPARK-32258-followup.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-07-14 05:51:59 +00:00
Peter Toth 24be81689c [SPARK-32241][SQL] Remove empty children of union
### What changes were proposed in this pull request?
This PR removes the empty child relations of a `Union`.

E.g. the query `SELECT c FROM t UNION ALL SELECT c FROM t WHERE false` has the following plan before this PR:
```
== Physical Plan ==
Union
:- *(1) Project [value#219 AS c#222]
:  +- *(1) LocalTableScan [value#219]
+- LocalTableScan <empty>, [c#224]
```
and after this PR:
```
== Physical Plan ==
*(1) Project [value#219 AS c#222]
+- *(1) LocalTableScan [value#219]
```

### Why are the changes needed?
To have a simpler plan.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Added new UTs.

Closes #29053 from peter-toth/SPARK-32241-remove-empty-children-of-union.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-07-14 04:26:29 +00:00
Liang-Chi Hsieh b6229df16c [SPARK-32258][SQL] NormalizeFloatingNumbers directly normalizes IF/CaseWhen/Coalesce child expressions
### What changes were proposed in this pull request?

This patch proposes to let `NormalizeFloatingNumbers` rule directly normalizes on certain children expressions. It could simplify expression tree.

### Why are the changes needed?

Currently NormalizeFloatingNumbers rule treats some expressions as black box but we can optimize it a bit by normalizing directly the inner children expressions.

Also see https://github.com/apache/spark/pull/28962#discussion_r448526240.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit tests.

Closes #29061 from viirya/SPARK-32258.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-12 15:34:43 -07:00
Michael Chirico 6ae400ccbe [MINOR][SQL][DOCS] consistency in argument naming for time functions
### What changes were proposed in this pull request?

Rename documented argument `format` as `fmt`, to match the same argument name in several other SQL date/time functions, to wit, `date_format`, `date_trunc`, `trunc`, `to_date`, and `to_timestamp` all use `fmt`. Also `format_string` and `printf` use the same abbreviation in their argument `strfmt`.

### Why are the changes needed?

Consistency -- I was trying to scour the documentation for functions with arguments using Java string formatting, it would have been nice to rely on searching for `fmt` instead of my more manual approach.

### Does this PR introduce _any_ user-facing change?

In the documentation only

### How was this patch tested?

No tests

Closes #29007 from MichaelChirico/sql-doc-format-fmt.

Authored-by: Michael Chirico <michael.chirico@grabtaxi.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-07-12 09:53:27 -05:00
yi.wu 004aea8155 [SPARK-32154][SQL] Use ExpressionEncoder for the return type of ScalaUDF to convert to catalyst type
### What changes were proposed in this pull request?

This PR proposes to use `ExpressionEncoder` for the return type of ScalaUDF to convert to the catalyst type, instead of using `CatalystTypeConverters`.

Note, this change only takes effect for typed Scala UDF since its the only case where we know the type tag of the raw type.

### Why are the changes needed?

Users now could register a UDF with `Instant`/`LocalDate` as return types even with `spark.sql.datetime.java8API.enabled=false`. However, the UDF can not really be used.
For example, if we try:

```scala
scala> sql("set spark.sql.datetime.java8API.enabled=false")
scala> spark.udf.register("buildDate", udf{ d: String => java.time.LocalDate.parse(d) })
scala> Seq("2020-07-02").toDF("d").selectExpr("CAST(buildDate(d) AS STRING)").show
```
Then, we will hit the error:
```scala
java.lang.ClassCastException: java.time.LocalDate cannot be cast to java.sql.Date
  at org.apache.spark.sql.catalyst.CatalystTypeConverters$DateConverter$.toCatalystImpl(CatalystTypeConverters.scala:304)
  at org.apache.spark.sql.catalyst.CatalystTypeConverters$CatalystTypeConverter.toCatalyst(CatalystTypeConverters.scala:107)
  at org.apache.spark.sql.catalyst.CatalystTypeConverters$.$anonfun$createToCatalystConverter$2(CatalystTypeConverters.scala:425)
  at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1169)
...
```
as it actually requires enabling `spark.sql.datetime.java8API.enabled` when using the UDF. And I think this could make users get confused.

This happens because when registering the UDF,  Spark actually uses `ExpressionEncoder` to ser/deser types. However, when using UDF, Spark uses `CatalystTypeConverters`, which is under control of `spark.sql.datetime.java8API.enabled`, to ser/deser types. Therefore, Spark would fail to convert the Java8 date time types.

If we could also use `ExpressionEncoder` to ser/deser types for the return type, similar to what we do for the input parameter types, then, UDF could support Instant/LocalDate, event other combined complex types as well.

### Does this PR introduce _any_ user-facing change?

Yes. Before this PR, if users run the demo above, they would hit the error. After this PR, the demo will run successfully.

### How was this patch tested?

Updated 2 tests and added a new one for combined types of `Instant` and `LocalDate`.

Closes #28979 from Ngone51/udf-return-encoder.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-07-12 09:45:18 +09:00
Max Gekk 22f9dfb25a [SPARK-32173][SQL] Deduplicate code in FromUTCTimestamp and ToUTCTimestamp
### What changes were proposed in this pull request?
- Extract common code from the `FromUTCTimestamp` and `ToUTCTimestamp` expressions to new trait `UTCTimestamp`.
- Move `ToUTCTimestamp` closer to `FromUTCTimestamp`

### Why are the changes needed?
Code deduplication improves maintainability.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running date-time test suites such as `DateFunctionsSuite`

Closes #28998 from MaxGekk/dedup-from-to-utc-timestamp.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-12 00:16:37 +09:00
yi.wu 0c9196e549 [SPARK-32238][SQL] Use Utils.getSimpleName to avoid hitting Malformed class name in ScalaUDF
### What changes were proposed in this pull request?

This PR proposes to use `Utils.getSimpleName(function)` instead of `function.getClass.getSimpleName` to get the class name.

### Why are the changes needed?

For some functions(see the demo below),  using `function.getClass.getSimpleName` can hit "Malformed class name" error.

### Does this PR introduce _any_ user-facing change?

Yes.  For the demo,

```scala
 object MalformedClassObject extends Serializable {
    class MalformedNonPrimitiveFunction extends (String => Int) with Serializable {
      override def apply(v1: String): Int = v1.toInt / 0
    }
  }
 OuterScopes.addOuterScope(MalformedClassObject)
 val f = new MalformedClassObject.MalformedNonPrimitiveFunction()
 Seq("20").toDF("col").select(udf(f).apply(Column("col"))).collect()
```

Before this PR, user can only see the error about "Malformed class name":

```scala
An exception or error caused a run to abort: Malformed class name
java.lang.InternalError: Malformed class name
	at java.lang.Class.getSimpleName(Class.java:1330)
	at org.apache.spark.sql.catalyst.expressions.ScalaUDF.udfErrorMessage$lzycompute(ScalaUDF.scala:1157)
	at org.apache.spark.sql.catalyst.expressions.ScalaUDF.udfErrorMessage(ScalaUDF.scala:1155)
	at org.apache.spark.sql.catalyst.expressions.ScalaUDF.doGenCode(ScalaUDF.scala:1077)
	at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:147)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:142)
	at org.apache.spark.sql.catalyst.expressions.Alias.genCode(namedExpressions.scala:160)
	at org.apache.spark.sql.execution.ProjectExec.$anonfun$doConsume$1(basicPhysicalOperators.scala:69)
        ...
```

After this PR, user can see the real root cause of the udf failure:

```scala
org.apache.spark.SparkException: Failed to execute user defined function(UDFSuite$MalformedClassObject$MalformedNonPrimitiveFunction: (string) => int)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:753)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:464)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:467)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ArithmeticException: / by zero
	at org.apache.spark.sql.UDFSuite$MalformedClassObject$MalformedNonPrimitiveFunction.apply(UDFSuite.scala:677)
	at org.apache.spark.sql.UDFSuite$MalformedClassObject$MalformedNonPrimitiveFunction.apply(UDFSuite.scala:676)
	... 17 more

```
### How was this patch tested?

Added a test.

Closes #29050 from Ngone51/fix-malformed-udf.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-11 06:27:56 -07:00
Wenchen Fan 84db660ebe [SPARK-32251][SQL][DOCS][TESTS] Fix SQL keyword document
### What changes were proposed in this pull request?

This PR improves the test to make sure all the SQL keywords are documented correctly. It fixes several issues:
1. some keywords are not documented
2. some keywords are not ANSI SQL keywords but documented as reserved/non-reserved.

### Why are the changes needed?

To make sure the implementation matches the doc.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

new test

Closes #29055 from cloud-fan/keyword.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-10 15:10:28 -07:00
TJX2014 500877e785 [SPARK-32133][SQL] Forbid time field steps for date start/end in Sequence
### What changes were proposed in this pull request?
1.Add time field steps check for date start/end in Sequence at `org.apache.spark.sql.catalyst.expressions.Sequence.TemporalSequenceImpl`
2.Add a UT:`SPARK-32133: Sequence step must be a day interval if start and end values are dates` at `org.apache.spark.sql.catalyst.expressions.CollectionExpressionsSuite`

### Why are the changes needed?
**Sequence time field steps for date start/end looks strange in spark as follows:**
```
scala> sql("select explode(sequence(cast('2011-03-01' as date), cast('2011-03-02' as date), interval 1 hour))").head(3)
res0: Array[org.apache.spark.sql.Row] = _Array([2011-03-01], [2011-03-01], [2011-03-01])_ **<- strange result.**

scala> sql("select explode(sequence(cast('2011-03-01' as date), cast('2011-03-02' as date), interval 1 day))").head(3)
res1: Array[org.apache.spark.sql.Row] = Array([2011-03-01], [2011-03-02])
```

**While this behavior in Prosto make sense:**
```
presto> select sequence(date('2011-03-01'),date('2011-03-02'),interval '1' hour);
Query 20200624_122744_00002_pehix failed: sequence step must be a day interval if start and end values are dates
presto> select sequence(date('2011-03-01'),date('2011-03-02'),interval '1' day);
_col0
[2011-03-01, 2011-03-02]
```

### Does this PR introduce _any_ user-facing change?
Yes, after this patch, users will get informed `sequence step must be a day interval if start and end values are dates` when
use time field steps for date start/end in Sequence.

### How was this patch tested?
Unit test.

Closes #28926 from TJX2014/master-SPARK-31982-sequence-cross-dst-follow-presto.

Authored-by: TJX2014 <xiaoxingstack@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-10 11:06:52 -07:00
Kent Yao 4609f1fdab [SPARK-32207][SQL] Support 'F'-suffixed Float Literals
### What changes were proposed in this pull request?

In this PR, I suppose we support 'f'-suffixed float literal, e.g. `select 1.1f`

### Why are the changes needed?

a very common feature across platforms, checked with pg, presto, hive, MySQL...

### Does this PR introduce _any_ user-facing change?

yes,

`select 1.1f` results float value 1.1 instead of throwing AnlaysisExceptiion`Can't extract value from 1: need struct type but got int;`

### How was this patch tested?

add unit tests

Closes #29022 from yaooqinn/SPARK-32207.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-09 19:45:16 -07:00
HyukjinKwon 01e9dd9050 [SPARK-20680][SQL][FOLLOW-UP] Revert NullType.simpleString from 'unknown' to 'null'
### What changes were proposed in this pull request?

This PR proposes to partially reverts the simple string in `NullType` at https://github.com/apache/spark/pull/28833: `NullType.simpleString` back from `unknown` to `null`.

### Why are the changes needed?

- Technically speaking, it's orthogonal with the issue itself, SPARK-20680.
- It needs some more discussion, see https://github.com/apache/spark/pull/28833#issuecomment-655277714

### Does this PR introduce _any_ user-facing change?

It reverts back the user-facing changes at https://github.com/apache/spark/pull/28833.
The simple string of `NullType` is back to `null`.

### How was this patch tested?

I just logically reverted. Jenkins should test it out.

Closes #29041 from HyukjinKwon/SPARK-20680.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-09 19:44:08 -07:00
Dilip Biswal 18aae21d96 [SPARK-31875][SQL] Provide a option to disable user supplied Hints
### What changes were proposed in this pull request?
Introduce a new SQL config `spark.sql.optimizer.ignoreHints`. When this is set to true
application of hints are disabled. This is similar to Oracle's OPTIMIZER_IGNORE_HINTS.
This can be helpful to study the impact of performance difference when hints are applied vs when they are not.

### Why are the changes needed?
Can be helpful to study the impact of performance difference when hints are applied vs when they are not.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
New tests added in ResolveHintsSuite.

Closes #28683 from dilipbiswal/disable_hint.

Authored-by: Dilip Biswal <dkbiswal@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-09 18:27:07 -07:00
Erik Erlandson 1cb5bfc47a [SPARK-32159][SQL] Fix integration between Aggregator[Array[_], _, _] and UnresolvedMapObjects
Context: The fix for SPARK-27296 introduced by #25024 allows `Aggregator` objects to appear in queries. This works fine for aggregators with atomic input types, e.g. `Aggregator[Double, _, _]`.

However it can cause a null pointer exception if the input type is `Array[_]`.  This was historically considered an ignorable case for serialization of `UnresolvedMapObjects`, but the new ScalaAggregator class causes these expressions to be serialized over to executors because the resolve-and-bind is being deferred.

### What changes were proposed in this pull request?
A new rule `ResolveEncodersInScalaAgg` that performs the resolution of the expressions contained in the encoders so that properly resolved expressions are serialized over to executors.

### Why are the changes needed?
Applying an aggregator of the form `Aggregator[Array[_], _, _]` using `functions.udaf()` currently causes a null pointer error in Catalyst.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
A unit test has been added that does aggregation with array types for input, buffer, and output. I have done additional testing with my own custom aggregators in the spark REPL.

Closes #28983 from erikerlandson/fix-spark-32159.

Authored-by: Erik Erlandson <eerlands@redhat.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-07-09 08:42:20 +00:00
Wenchen Fan 8c5bee599d [SPARK-28067][SPARK-32018] Fix decimal overflow issues
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/27627 to fix the remaining issues. There are 2 issues fixed in this PR:
1. `UnsafeRow.setDecimal` can set an overflowed decimal and causes an error when reading it. The expected behavior is to return null.
2. The update/merge expression for decimal type in `Sum` is wrong. We shouldn't turn the `sum` value back to 0 after it becomes null due to overflow. This issue was hidden because:
2.1 for hash aggregate, the buffer is unsafe row. Due to the first bug, we fail when overflow happens, so there is no chance to mistakenly turn null back to 0.
2.2 for sort-based aggregate, the buffer is generic row. The decimal can overflow (the Decimal class has unlimited precision) and we don't have the null problem.

If we only fix the first bug, then the second bug is exposed and test fails. If we only fix the second bug, there is no way to test it. This PR fixes these 2 bugs together.

### Why are the changes needed?

Fix issues during decimal sum when overflow happens

### Does this PR introduce _any_ user-facing change?

Yes. Now decimal sum can return null correctly for overflow under non-ansi mode.

### How was this patch tested?

new test and updated test

Closes #29026 from cloud-fan/decimal.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-09 15:56:40 +09:00
Ryan Blue 3bb1ac597a [SPARK-32168][SQL] Fix hidden partitioning correctness bug in SQL overwrite
### What changes were proposed in this pull request?

When converting an `INSERT OVERWRITE` query to a v2 overwrite plan, Spark attempts to detect when a dynamic overwrite and a static overwrite will produce the same result so it can use the static overwrite. Spark incorrectly detects when dynamic and static overwrites are equivalent when there are hidden partitions, such as `days(ts)`.

This updates the analyzer rule `ResolveInsertInto` to always use a dynamic overwrite when the mode is dynamic, and static when the mode is static. This avoids the problem by not trying to determine whether the two plans are equivalent and always using the one that corresponds to the partition overwrite mode.

### Why are the changes needed?

This is a correctness bug. If a table has hidden partitions, all of the values for those partitions are dropped instead of dynamically overwriting changed partitions.

This only affects SQL commands (not `DataFrameWriter`) writing to tables that have hidden partitions. It is also only a problem when the partition overwrite mode is dynamic.

### Does this PR introduce _any_ user-facing change?

Yes, it fixes the correctness bug detailed above.

### How was this patch tested?

* This updates the in-memory table to support a hidden partition transform, `days`, and adds a test case to `DataSourceV2SQLSuite` in which the table uses this hidden partition function. This test fails without the fix to `ResolveInsertInto`.
* This updates the test case `InsertInto: overwrite - multiple static partitions - dynamic mode` in `InsertIntoTests`. The result of the SQL command is unchanged, but the SQL command will now use a dynamic overwrite so the test now uses `dynamicOverwriteTest`.

Closes #28993 from rdblue/fix-insert-overwrite-v2-conversion.

Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-08 16:06:40 -07:00
ulysses 65286aec4b [SPARK-30703][SQL][FOLLOWUP] Update SqlBase.g4 invalid comment
### What changes were proposed in this pull request?

Modify the comment of `SqlBase.g4`.

### Why are the changes needed?

`docs/sql-keywords.md` has already moved to `docs/sql-ref-ansi-compliance.md#sql-keywords`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

No need.

Closes #29033 from ulysses-you/SPARK-30703-FOLLOWUP.

Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-07-08 11:30:47 +09:00
LantaoJin b5297c43b0 [SPARK-20680][SQL] Spark-sql do not support for creating table with void column datatype
### What changes were proposed in this pull request?

This is the new PR which to address the close one #17953

1. support "void" primitive data type in the `AstBuilder`, point it to `NullType`
2. forbid creating tables with VOID/NULL column type

### Why are the changes needed?

1. Spark is incompatible with hive void type. When Hive table schema contains void type, DESC table will throw an exception in Spark.

>hive> create table bad as select 1 x, null z from dual;
>hive> describe bad;
OK
x	int
z	void

In Spark2.0.x, the behaviour to read this view is normal:
>spark-sql> describe bad;
x       int     NULL
z       void    NULL
Time taken: 4.431 seconds, Fetched 2 row(s)

But in lastest Spark version, it failed with SparkException: Cannot recognize hive type string: void

>spark-sql> describe bad;
17/05/09 03:12:08 ERROR thriftserver.SparkSQLDriver: Failed in [describe bad]
org.apache.spark.SparkException: Cannot recognize hive type string: void
Caused by: org.apache.spark.sql.catalyst.parser.ParseException:
DataType void() is not supported.(line 1, pos 0)
== SQL ==
void
^^^
        ... 61 more
org.apache.spark.SparkException: Cannot recognize hive type string: void

2. Hive CTAS statements throws error when select clause has NULL/VOID type column since HIVE-11217
In Spark, creating table with a VOID/NULL column should throw readable exception message, include

- create data source table (using parquet, json, ...)
- create hive table (with or without stored as)
- CTAS

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Add unit tests

Closes #28833 from LantaoJin/SPARK-20680_COPY.

Authored-by: LantaoJin <jinlantao@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-07 18:58:01 -07:00
Max Gekk 1261fac674 [SPARK-31710][SQL][FOLLOWUP] Allow cast numeric to timestamp by default
### What changes were proposed in this pull request?
1. Set the SQL config `spark.sql.legacy.allowCastNumericToTimestamp` to `true` by default
2. Remove explicit sets of `spark.sql.legacy.allowCastNumericToTimestamp` to `true` in the cast suites.

### Why are the changes needed?
To avoid breaking changes in minor versions (in the upcoming Spark 3.1.0) according to the the semantic versioning guidelines (https://spark.apache.org/versioning-policy.html)

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
By `CastSuite`.

Closes #29012 from MaxGekk/allow-cast-numeric-to-timestamp.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-07 14:09:40 -07:00
Liang-Chi Hsieh 90b9099064 [SPARK-32163][SQL] Nested pruning should work even with cosmetic variations
### What changes were proposed in this pull request?

This patch proposes to deal with cosmetic variations when processing nested column extractors in `NestedColumnAliasing`. Currently if cosmetic variations are in the nested column extractors, the query is not optimized.

### Why are the changes needed?

If the expressions extracting nested fields have cosmetic variations like qualifier difference, currently nested column pruning cannot work well.

For example, two attributes which are semantically the same, are referred in a query, but the nested column extractors of them are treated differently when we deal with nested column pruning.

### Does this PR introduce _any_ user-facing change?

Yes, fixing a bug in nested column pruning.

### How was this patch tested?

Unit test.

Closes #28988 from viirya/SPARK-32163.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-07 11:17:53 -07:00
fqaiser94@gmail.com 4bbc343a4c [SPARK-31317][SQL] Add withField method to Column
### What changes were proposed in this pull request?

Added a new `withField` method to the `Column` class. This method should allow users to add or replace a `StructField` in a `StructType` column (with very similar semantics to the `withColumn` method on `Dataset`).

### Why are the changes needed?

Often Spark users have to work with deeply nested data e.g. to fix a data quality issue with an existing `StructField`. To do this with the existing Spark APIs, users have to rebuild the entire struct column.

For example, let's say you have the following deeply nested data structure which has a data quality issue (`5` is missing):
```
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

val data = spark.createDataFrame(sc.parallelize(
      Seq(Row(Row(Row(1, 2, 3), Row(Row(4, null, 6), Row(7, 8, 9), Row(10, 11, 12)), Row(13, 14, 15))))),
      StructType(Seq(
        StructField("a", StructType(Seq(
          StructField("a", StructType(Seq(
            StructField("a", IntegerType),
            StructField("b", IntegerType),
            StructField("c", IntegerType)))),
          StructField("b", StructType(Seq(
            StructField("a", StructType(Seq(
              StructField("a", IntegerType),
              StructField("b", IntegerType),
              StructField("c", IntegerType)))),
            StructField("b", StructType(Seq(
              StructField("a", IntegerType),
              StructField("b", IntegerType),
              StructField("c", IntegerType)))),
            StructField("c", StructType(Seq(
              StructField("a", IntegerType),
              StructField("b", IntegerType),
              StructField("c", IntegerType))))
          ))),
          StructField("c", StructType(Seq(
            StructField("a", IntegerType),
            StructField("b", IntegerType),
            StructField("c", IntegerType))))
        )))))).cache

data.show(false)
+---------------------------------+
|a                                |
+---------------------------------+
|[[1, 2, 3], [[4,, 6], [7, 8, 9]]]|
+---------------------------------+
```
Currently, to replace the missing value users would have to do something like this:
```
val result = data.withColumn("a",
  struct(
    $"a.a",
    struct(
      struct(
        $"a.b.a.a",
        lit(5).as("b"),
        $"a.b.a.c"
      ).as("a"),
      $"a.b.b",
      $"a.b.c"
    ).as("b"),
    $"a.c"
  ))

result.show(false)
+---------------------------------------------------------------+
|a                                                              |
+---------------------------------------------------------------+
|[[1, 2, 3], [[4, 5, 6], [7, 8, 9], [10, 11, 12]], [13, 14, 15]]|
+---------------------------------------------------------------+
```
As you can see above, with the existing methods users must call the `struct` function and list all fields, including fields they don't want to change. This is not ideal as:
>this leads to complex, fragile code that cannot survive schema evolution.
[SPARK-16483](https://issues.apache.org/jira/browse/SPARK-16483)

In contrast, with the method added in this PR, a user could simply do something like this:
```
val result = data.withColumn("a", 'a.withField("b.a.b", lit(5)))
result.show(false)
+---------------------------------------------------------------+
|a                                                              |
+---------------------------------------------------------------+
|[[1, 2, 3], [[4, 5, 6], [7, 8, 9], [10, 11, 12]], [13, 14, 15]]|
+---------------------------------------------------------------+

```

This is the first of maybe a few methods that could be added to the `Column` class to make it easier to manipulate nested data. Other methods under discussion in [SPARK-22231](https://issues.apache.org/jira/browse/SPARK-22231) include `drop` and `renameField`. However, these should be added in a separate PR.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

New unit tests were added. Jenkins must pass them.

### Related JIRAs:
- https://issues.apache.org/jira/browse/SPARK-22231
- https://issues.apache.org/jira/browse/SPARK-16483

Closes #27066 from fqaiser94/SPARK-22231-withField.

Lead-authored-by: fqaiser94@gmail.com <fqaiser94@gmail.com>
Co-authored-by: fqaiser94 <fqaiser94@gmail.com>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-07-07 16:34:03 +00:00
Max Gekk 75d342858a [SPARK-32209][SQL] Re-use GetTimestamp in ParseToDate
### What changes were proposed in this pull request?
Replace the combination of expressions `SecondsToTimestamp` and `UnixTimestamp` by `GetTimestamp` in `ParseToDate`.

### Why are the changes needed?
Eliminate unnecessary parsing overhead in: **string -> timestamp -> long (seconds) -> timestamp -> date**. After the changes, the chain will be: **string -> timestamp -> date**.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By existing test suites such as `DateFunctionsSuite`.

Closes #28999 from MaxGekk/ParseToDate-parse-timestamp.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-07 07:26:40 -07:00
ulysses 2e23da2bda [SPARK-31975][SQL] Show AnalysisException when WindowFunction is used without WindowExpression
### What changes were proposed in this pull request?

Add WindowFunction check at `CheckAnalysis`.

### Why are the changes needed?
Provide friendly error msg.

**BEFORE**
```scala
scala> sql("select rank() from values(1)").show
java.lang.UnsupportedOperationException: Cannot generate code for expression: rank()
```

**AFTER**
```scala
scala> sql("select rank() from values(1)").show
org.apache.spark.sql.AnalysisException: Window function rank() requires an OVER clause.;;
Project [rank() AS RANK()#3]
+- LocalRelation [col1#2]
```

### Does this PR introduce _any_ user-facing change?

Yes, user wiill be given a better error msg.

### How was this patch tested?

Pass the newly added UT.

Closes #28808 from ulysses-you/SPARK-31975.

Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-07-07 13:39:04 +00:00
Wenchen Fan 5d296ed39e [SPARK-32167][SQL] Fix GetArrayStructFields to respect inner field's nullability together
### What changes were proposed in this pull request?

Fix nullability of `GetArrayStructFields`. It should consider both the original array's `containsNull` and the inner field's nullability.

### Why are the changes needed?

Fix a correctness issue.

### Does this PR introduce _any_ user-facing change?

Yes. See the added test.

### How was this patch tested?

a new UT and end-to-end test

Closes #28992 from cloud-fan/bug.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-06 20:07:33 -07:00
stczwd f082a7996a [SPARK-31100][SQL] Check namespace existens when setting namespace
## What changes were proposed in this pull request?
Check the namespace existence while calling "use namespace", and throw NoSuchNamespaceException if namespace not exists.

### Why are the changes needed?
Users need to know that the namespace does not exist when they try to set a wrong namespace.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Run all suites and add a test for this

Closes #27900 from stczwd/SPARK-31100.

Authored-by: stczwd <qcsd2011@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-07-02 14:49:40 +00:00
Liang-Chi Hsieh 3f7780d30d [SPARK-32136][SQL] NormalizeFloatingNumbers should work on null struct
### What changes were proposed in this pull request?

This patch fixes wrong groupBy result if the grouping key is a null-value struct.

### Why are the changes needed?

`NormalizeFloatingNumbers` reconstructs a struct if input expression is StructType. If the input struct is null, it will reconstruct a struct with null-value fields, instead of null.

### Does this PR introduce _any_ user-facing change?

Yes, fixing incorrect groupBy result.

### How was this patch tested?

Unit test.

Closes #28962 from viirya/SPARK-32136.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-07-02 13:56:43 +09:00
Liang-Chi Hsieh 2a52a1b300 [SPARK-32056][SQL][FOLLOW-UP] Coalesce partitions for repartiotion hint and sql when AQE is enabled
### What changes were proposed in this pull request?

As the followup of #28900, this patch extends coalescing partitions to repartitioning using hints and SQL syntax without specifying number of partitions, when AQE is enabled.

### Why are the changes needed?

When repartitionning using hints and SQL syntax, we should follow the shuffling behavior of repartition by expression/range to coalesce partitions when AQE is enabled.

### Does this PR introduce _any_ user-facing change?

Yes. After this change, if users don't specify the number of partitions when repartitioning using `REPARTITION`/`REPARTITION_BY_RANGE` hint or `DISTRIBUTE BY`/`CLUSTER BY`, AQE will coalesce partitions.

### How was this patch tested?

Unit tests.

Closes #28952 from viirya/SPARK-32056-sql.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-01 16:14:51 -07:00
Max Gekk bcf23307f4 [SPARK-32130][SQL] Disable the JSON option inferTimestamp by default
### What changes were proposed in this pull request?
Set the JSON option `inferTimestamp` to `false` if an user don't pass it as datasource option.

### Why are the changes needed?
To prevent perf regression while inferring schemas from JSON with potential timestamps fields.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
- Modified existing tests in `JsonSuite` and `JsonInferSchemaSuite`.
- Regenerated results of `JsonBenchmark` in the environment:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_252 and OpenJDK 64-Bit Server VM 11.0.7+10 |

Closes #28966 from MaxGekk/json-inferTimestamps-disable-by-default.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-01 15:45:39 -07:00
Wenchen Fan 7dbd90b68d [SPARK-31797][SQL][FOLLOWUP] TIMESTAMP_SECONDS supports fractional input
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/28534 , to make `TIMESTAMP_SECONDS` function support fractional input as well.

### Why are the changes needed?

Previously the cast function can cast fractional values to timestamp. Now we suggest users to ues these new functions, and we need to cover all the cast use cases.

### Does this PR introduce _any_ user-facing change?

Yes, now `TIMESTAMP_SECONDS` function accepts fractional input.

### How was this patch tested?

new tests

Closes #28956 from cloud-fan/follow.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-07-01 08:03:46 -07:00
angerszhu 15fb5d7677 [SPARK-28169][SQL] Convert scan predicate condition to CNF
### What changes were proposed in this pull request?
Spark can't push down scan predicate condition of **Or**:
Such as if I have a table `default.test`, it's partition col is `dt`,
If we use query :
```
select * from default.test
where dt=20190625 or (dt = 20190626 and id in (1,2,3) )
```

In this case, Spark will resolve **Or** condition as one expression, and since this expr has reference of "id", then it can't been push down.

Base on pr https://github.com/apache/spark/pull/28733, In my PR ,  for SQL like
`select * from default.test`
 `where  dt = 20190626  or  (dt = 20190627  and xxx="a")   `

For this  condition `dt = 20190626  or  (dt = 20190627  and xxx="a"   )`, it will  been converted  to CNF
```
(dt = 20190626 or dt = 20190627) and (dt = 20190626 or xxx = "a" )
```
then condition `dt = 20190626 or dt = 20190627` will be push down when partition pruning

### Why are the changes needed?
Optimize partition pruning

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
Added UT

Closes #28805 from AngersZhuuuu/cnf-for-partition-pruning.

Lead-authored-by: angerszhu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-07-01 12:00:15 +00:00
Micah Kornfield a4ba34434c [SPARK-32095][SQL] Update documentation to reflect usage of updated statistics
### What changes were proposed in this pull request?
Update documentation to reflect changes in faf220aad9

I've changed the documentation to reflect updated statistics may be used to improve query plan.

### Why are the changes needed?
I believe the documentation is stale and misleading.

### Does this PR introduce _any_ user-facing change?
Yes, this is a javadoc documentation fix.

### How was this patch tested?
Doc fix.

Closes #28925 from emkornfield/spark-32095.

Authored-by: Micah Kornfield <micahk@google.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-07-01 07:13:10 +00:00
GuoPhilipse 02f3b80d3a [SPARK-32131][SQL] Fix AnalysisException messages at UNION/EXCEPT/MINUS operations
### What changes were proposed in this pull request?
fix error exception messages during exceptions on Union and set operations

### Why are the changes needed?
Union and set operations can only be performed on tables with the compatible column types,while when we have more than two column, the exception messages will have wrong column index.

Steps to reproduce:

```
drop table if exists test1;
drop table if exists test2;
drop table if exists test3;
create table if not exists test1(id int, age int, name timestamp);
create table if not exists test2(id int, age timestamp, name timestamp);
create table if not exists test3(id int, age int, name int);
insert into test1 select 1,2,'2020-01-01 01:01:01';
insert into test2 select 1,'2020-01-01 01:01:01','2020-01-01 01:01:01';
insert into test3 select 1,3,4;
```

Query1:
```sql
select * from test1 except select * from test2;
```
Result1:
```
Error: org.apache.spark.sql.AnalysisException: Except can only be performed on tables with the compatible column types. timestamp <> int at the second column of the second table;; 'Except false :- Project [id#620, age#621, name#622] : +- SubqueryAlias `default`.`test1` : +- HiveTableRelation `default`.`test1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#620, age#621, name#622] +- Project [id#623, age#624, name#625] +- SubqueryAlias `default`.`test2` +- HiveTableRelation `default`.`test2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [id#623, age#624, name#625] (state=,code=0)
```

Query2:

```sql
select * from test1 except select * from test3;
```

Result2:

```
Error: org.apache.spark.sql.AnalysisException: Except can only be performed on tables with the compatible column types
 int <> timestamp at the 2th column of the second table;
```

the above query1 has the right exception message
the above query2 have the wrong errors information, it may need to change to the following

```
Error: org.apache.spark.sql.AnalysisException: Except can only be performed on tables with the compatible column types.
int <> timestamp at the  third column of the second table
```

### Does this PR introduce _any_ user-facing change?
NO

### How was this patch tested?
unit test

Closes #28951 from GuoPhilipse/32131-correct-error-messages.

Lead-authored-by: GuoPhilipse <46367746+GuoPhilipse@users.noreply.github.com>
Co-authored-by: GuoPhilipse <guofei_ok@126.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-06-30 23:33:50 -07:00
xy_xin 20cd47e82d [SPARK-32030][SQL] Support unlimited MATCHED and NOT MATCHED clauses in MERGE INTO
### What changes were proposed in this pull request?
This PR add unlimited MATCHED and NOT MATCHED clauses in MERGE INTO statement.

### Why are the changes needed?
Now the MERGE INTO syntax is,
```
MERGE INTO [db_name.]target_table [AS target_alias]
 USING [db_name.]source_table [<time_travel_version>] [AS source_alias]
 ON <merge_condition>
 [ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ]
 [ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ]
 [ WHEN NOT MATCHED [ AND <condition> ] THEN <not_matched_action> ]
```
It would be nice if we support unlimited MATCHED and NOT MATCHED clauses in MERGE INTO statement, because users may want to deal with different "AND <condition>"s, the result of which just like a series of "CASE WHEN"s. The expected syntax looks like
```
MERGE INTO [db_name.]target_table [AS target_alias]
 USING [db_name.]source_table [<time_travel_version>] [AS source_alias]
 ON <merge_condition>
 [when_matched_clause [, ...]]
 [when_not_matched_clause [, ...]]
```
where when_matched_clause is
```
WHEN MATCHED [ AND <condition> ] THEN <matched_action>
```
and when_not_matched_clause is
```
WHEN NOT MATCHED [ AND <condition> ] THEN <not_matched_action>
 ```
matched_action can be one of
```
DELETE
UPDATE SET * or
UPDATE SET col1 = value1 [, col2 = value2, ...]
```
and not_matched_action can be one of
```
INSERT *
INSERT (col1 [, col2, ...]) VALUES (value1 [, value2, ...])
```
### Does this PR introduce _any_ user-facing change?
Yes. The SQL command changes, but it is backward compatible.

### How was this patch tested?
New tests added.

Closes #28875 from xianyinxin/SPARK-32030.

Authored-by: xy_xin <xianyin.xxy@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-29 13:13:42 +00:00
Liang-Chi Hsieh 4204a63d4f [SPARK-32056][SQL] Coalesce partitions for repartition by expressions when AQE is enabled
### What changes were proposed in this pull request?

This patch proposes to coalesce partitions for repartition by expressions without specifying number of partitions, when AQE is enabled.

### Why are the changes needed?

When repartition by some partition expressions, users can specify number of partitions or not. If  the number of partitions is specified, we should not coalesce partitions because it breaks user expectation. But if without specifying number of partitions, AQE should be able to coalesce partitions as other shuffling.

### Does this PR introduce _any_ user-facing change?

Yes. After this change, if users don't specify the number of partitions when repartitioning data by expressions, AQE will coalesce partitions.

### How was this patch tested?

Added unit test.

Closes #28900 from viirya/SPARK-32056.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-29 11:33:40 +00:00
yi.wu 0ec17c989d [SPARK-32090][SQL] Improve UserDefinedType.equal() to make it be symmetrical
### What changes were proposed in this pull request?

This PR fix `UserDefinedType.equal()` by comparing the UDT class instead of checking `acceptsType()`.

### Why are the changes needed?

It's weird that equality comparison between two UDT types can have different result by switching the order:

```scala
// ExampleSubTypeUDT.userClass is a subclass of ExampleBaseTypeUDT.userClass
val udt1 = new ExampleBaseTypeUDT
val udt2 = new ExampleSubTypeUDT
println(udt1 == udt2) // true
println(udt2 == udt1) // false
```

### Does this PR introduce _any_ user-facing change?

Yes.

Before:
```scala
// ExampleSubTypeUDT.userClass is a subclass of ExampleBaseTypeUDT.userClass
val udt1 = new ExampleBaseTypeUDT
val udt2 = new ExampleSubTypeUDT
println(udt1 == udt2) // true
println(udt2 == udt1) // false
```

After:
```scala
// ExampleSubTypeUDT.userClass is a subclass of ExampleBaseTypeUDT.userClass
val udt1 = new ExampleBaseTypeUDT
val udt2 = new ExampleSubTypeUDT
println(udt1 == udt2) // false
println(udt2 == udt1) // false
```

### How was this patch tested?

Added a unit test.

Closes #28923 from Ngone51/fix-udt-equal.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-06-28 21:49:10 -07:00
Yuanjian Li 6484c14c57 [SPARK-32115][SQL] Fix SUBSTRING to handle integer overflows
### What changes were proposed in this pull request?
Bug fix for overflow case in `UTF8String.substringSQL`.

### Why are the changes needed?
SQL query `SELECT SUBSTRING("abc", -1207959552, -1207959552)` incorrectly returns` "abc"` against expected output of `""`. For query `SUBSTRING("abc", -100, -100)`, we'll get the right output of `""`.

### Does this PR introduce _any_ user-facing change?
Yes, bug fix for the overflow case.

### How was this patch tested?
New UT.

Closes #28937 from xuanyuanking/SPARK-32115.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-06-28 12:22:44 -07:00
Pablo Langa bbb2cba615 [SPARK-32025][SQL] Csv schema inference problems with different types in the same column
### What changes were proposed in this pull request?

This pull request fixes a bug present in the csv type inference.
We have problems when we have different types in the same column.

**Previously:**
```
$ cat /example/f1.csv
col1
43200000
true

spark.read.csv(path="file:///example/*.csv", header=True, inferSchema=True).show()
+----+
|col1|
+----+
|null|
|true|
+----+

root
 |-- col1: boolean (nullable = true)
```
**Now**
```
spark.read.csv(path="file:///example/*.csv", header=True, inferSchema=True).show()
+-------------+
|col1          |
+-------------+
|43200000 |
|true           |
+-------------+

root
 |-- col1: string (nullable = true)
```

Previously the hierarchy of type inference is the following:

> IntegerType
> > LongType
> > > DecimalType
> > > > DoubleType
> > > > > TimestampType
> > > > > > BooleanType
> > > > > > > StringType

So, when, for example, we have integers in one column, and the last element is a boolean, all the column is inferred as a boolean column incorrectly and all the number are shown as null when you see the data

We need the following hierarchy. When we have different numeric types in the column it will be resolved correctly. And when we have other different types it will be resolved as a String type column
> IntegerType
> > LongType
> > > DecimalType
> > > > DoubleType
> > > > > StringType

> TimestampType
> > StringType

> BooleanType
> > StringType

> StringType

### Why are the changes needed?

Fix the bug explained

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test and manual tests

Closes #28896 from planga82/feature/SPARK-32025_csv_inference.

Authored-by: Pablo Langa <soypab@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-06-26 10:41:27 +09:00
yi.wu 47fb9d6054 [SPARK-32087][SQL] Allow UserDefinedType to use encoder to deserialize rows in ScalaUDF as well
### What changes were proposed in this pull request?

This PR tries to address the comment: https://github.com/apache/spark/pull/28645#discussion_r442183888
It changes `canUpCast/canCast` to allow cast from sub UDT to base UDT, in order to achieve the goal to allow UserDefinedType to use `ExpressionEncoder` to deserialize rows in ScalaUDF as well.

One thing that needs to mention is, even we allow cast from sub UDT to base UDT, it doesn't really do the cast in `Cast`. Because, yet, sub UDT and base UDT are considered as the same type(because of #16660), see:

5264164a67/sql/catalyst/src/main/scala/org/apache/spark/sql/types/UserDefinedType.scala (L81-L86)

5264164a67/sql/catalyst/src/main/scala/org/apache/spark/sql/types/UserDefinedType.scala (L92-L95)

Therefore, the optimize rule `SimplifyCast` will eliminate the cast at the end.

### Why are the changes needed?

Reduce the special case caused by `UserDefinedType` in `ResolveEncodersInUDF` and `ScalaUDF`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

It should be covered by the test of `SPARK-19311`, which is also updated a little in this PR.

Closes #28920 from Ngone51/fix-udf-udt.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-24 14:50:45 +00:00
Bryan Cutler df04107934 [SPARK-32080][SPARK-31998][SQL] Simplify ArrowColumnVector ListArray accessor
### What changes were proposed in this pull request?

This change simplifies the ArrowColumnVector ListArray accessor to use provided Arrow APIs available in v0.15.0 to calculate element indices.

### Why are the changes needed?

This simplifies the code by avoiding manual calculations on the Arrow offset buffer and makes use of more stable APIs.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests

Closes #28915 from BryanCutler/arrow-simplify-ArrowColumnVector-ListArray-SPARK-32080.

Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-06-24 22:13:54 +09:00
Max Gekk e00f43cb86 [SPARK-32043][SQL] Replace Decimal by Int op in make_interval and make_timestamp
### What changes were proposed in this pull request?
Replace Decimal by Int op in the `MakeInterval` & `MakeTimestamp` expression. For instance, `(secs * Decimal(MICROS_PER_SECOND)).toLong` can be replaced by the unscaled long because the former one already contains microseconds.

### Why are the changes needed?
To improve performance.

Before:
```
make_timestamp():                         Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
...
make_timestamp(2019, 1, 2, 3, 4, 50.123456)             94             99           4         10.7          93.8      38.8X
```

After:
```
make_timestamp(2019, 1, 2, 3, 4, 50.123456)             76             92          15         13.1          76.5      48.1X
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- By existing test suites `IntervalExpressionsSuite`, `DateExpressionsSuite` and etc.
- Re-generate results of `MakeDateTimeBenchmark` in the environment:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_252 and OpenJDK 64-Bit Server VM 11.0.7+10 |

Closes #28886 from MaxGekk/make_interval-opt-decimal.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-23 11:45:12 +00:00
Max Gekk fcf9768098 [SPARK-32052][SQL] Extract common code from date-time field expressions
### What changes were proposed in this pull request?
Extract common code from the expressions that get date or time fields from input dates/timestamps to new expressions `GetDateField` and `GetTimeField`, and re-use the common traits from the affected classes.

### Why are the changes needed?
Code deduplication improves maintainability.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By `DateExpressionsSuite`

Closes #28894 from MaxGekk/get-date-time-field-expr.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-23 06:13:55 +00:00
Max Gekk 979a8eb04a [MINOR][SQL] Simplify DateTimeUtils.cleanLegacyTimestampStr
### What changes were proposed in this pull request?
Call the `replace()` method from `UTF8String` to remove the `GMT` string from the input of `DateTimeUtils.cleanLegacyTimestampStr`. It removes all `GMT` substrings.

### Why are the changes needed?
Simpler code improves maintainability

### Does this PR introduce _any_ user-facing change?
Should not

### How was this patch tested?
By existing test suites `JsonSuite` and `UnivocityParserSuite`.

Closes #28892 from MaxGekk/simplify-cleanLegacyTimestampStr.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-23 05:53:04 +00:00
yi.wu 338efee509 [SPARK-32031][SQL] Fix the wrong references of the PartialMerge/Final AggregateExpression
### What changes were proposed in this pull request?

This PR changes the references of the `PartialMerge`/`Final` `AggregateExpression` from `aggBufferAttributes` to `inputAggBufferAttributes`.

After this change, the tests of `SPARK-31620` can fail on the assertion of `QueryTest.assertEmptyMissingInput`.  So, this PR also fixes it by overriding the `inputAggBufferAttributes` of the Aggregate operators.

### Why are the changes needed?

With my understanding of Aggregate framework, especially, according to the logic of `AggUtils.planAggXXX`, I think for the `PartialMerge`/`Final` `AggregateExpression` the right references should be `inputAggBufferAttributes`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Before this patch, for an Aggregate operator, its input attributes will always be equal to or more than(because it refers to its own attributes while it should refer to the attributes from the child) its reference attributes. Therefore, its missing inputs must always be empty and break nothing. Thus, it's impossible to add a UT for this patch.

However, after correcting the right references in this PR, the problem is then exposed by `QueryTest.assertEmptyMissingInput` in the UT of SPARK-31620, since missing inputs are no longer always empty. This PR can fix the problem.

Closes #28869 from Ngone51/fix-agg-reference.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-22 13:59:46 +00:00
Dilip Biswal 6293c38cff [MINOR][SQL] Add IS [NOT] NULL examples to ArrayFilter/ArrayExists
### What changes were proposed in this pull request?
A minor PR that adds a couple of usage examples for ArrayFilter and ArrayExists that shows how to deal with NULL data.

### Why are the changes needed?
Enhances the examples that shows how to filter out null values from an array and also to test if null value exists in an array.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Tested manually.

Closes #28890 from dilipbiswal/array_func_description.

Authored-by: Dilip Biswal <dkbiswal@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-06-22 21:38:19 +09:00
ulysses 978493467c [SPARK-32019][SQL] Add spark.sql.files.minPartitionNum config
### What changes were proposed in this pull request?

Add a new config `spark.sql.files.minPartitionNum` to control file split partition in local session.

### Why are the changes needed?

Aims to control file split partitions in session level.
More details see discuss in [PR-28778](https://github.com/apache/spark/pull/28778).

### Does this PR introduce _any_ user-facing change?

Yes, new config.

### How was this patch tested?

Add UT.

Closes #28853 from ulysses-you/SPARK-32019.

Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-06-20 18:38:44 -07:00
Max Gekk 66ba35666a [SPARK-32021][SQL] Increase precision of seconds and fractions of make_interval
### What changes were proposed in this pull request?
Change precision of seconds and its fraction from 8 to 18 to be able to construct intervals of max allowed microseconds value (long).

### Why are the changes needed?
To improve UX of Spark SQL.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
- Add tests to IntervalExpressionsSuite
- Add an example to the `MakeInterval` expression
- Add tests to `interval.sql`

Closes #28873 from MaxGekk/make_interval-sec-precision.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-06-19 19:33:13 -07:00
TJX2014 177a380bcf [SPARK-31980][SQL] Function sequence() fails if start and end of range are equal dates
### What changes were proposed in this pull request?
1. Add judge equal as bigger condition in `org.apache.spark.sql.catalyst.expressions.Sequence.TemporalSequenceImpl#eval`
2. Unit test for interval `day`, `month`, `year`

### Why are the changes needed?
Bug exists when sequence input get same equal start and end dates, which will occur `while loop` forever

### Does this PR introduce _any_ user-facing change?
Yes,
Before this PR, people will get a `java.lang.ArrayIndexOutOfBoundsException`, when eval as below:
`sql("select sequence(cast('2011-03-01' as date), cast('2011-03-01' as date), interval 1 year)").show(false)
`

### How was this patch tested?
Unit test.

Closes #28819 from TJX2014/master-SPARK-31980.

Authored-by: TJX2014 <xiaoxingstack@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-06-19 19:24:34 -07:00
Terry Kim 7b8683820b [SPARK-31350][SQL] Coalesce bucketed tables for sort merge join if applicable
### What changes were proposed in this pull request?

When two bucketed tables with different number of buckets are joined, it can introduce a full shuffle:
```
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "0")
val df1 = (0 until 20).map(i => (i % 5, i % 13, i.toString)).toDF("i", "j", "k")
val df2 = (0 until 20).map(i => (i % 7, i % 11, i.toString)).toDF("i", "j", "k")
df1.write.format("parquet").bucketBy(8, "i").saveAsTable("t1")
df2.write.format("parquet").bucketBy(4, "i").saveAsTable("t2")
val t1 = spark.table("t1")
val t2 = spark.table("t2")
val joined = t1.join(t2, t1("i") === t2("i"))
joined.explain

== Physical Plan ==
*(5) SortMergeJoin [i#44], [i#50], Inner
:- *(2) Sort [i#44 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(i#44, 200), true, [id=#105]
:     +- *(1) Project [i#44, j#45, k#46]
:        +- *(1) Filter isnotnull(i#44)
:           +- *(1) ColumnarToRow
:              +- FileScan parquet default.t1[i#44,j#45,k#46] Batched: true, DataFilters: [isnotnull(i#44)], Format: Parquet, Location: InMemoryFileIndex[...], PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int,j:int,k:string>, SelectedBucketsCount: 8 out of 8
+- *(4) Sort [i#50 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(i#50, 200), true, [id=#115]
      +- *(3) Project [i#50, j#51, k#52]
         +- *(3) Filter isnotnull(i#50)
            +- *(3) ColumnarToRow
               +- FileScan parquet default.t2[i#50,j#51,k#52] Batched: true, DataFilters: [isnotnull(i#50)], Format: Parquet, Location: InMemoryFileIndex[...], PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int,j:int,k:string>, SelectedBucketsCount: 4 out of 4
```
This PR proposes to introduce coalescing buckets when the following conditions are met to eliminate the full shuffle:
- Join is the sort merge one (which is created only for equi-join).
- Join keys match with output partition expressions on their respective sides.
- The larger bucket number is divisible by the smaller bucket number.
- `spark.sql.bucketing.coalesceBucketsInSortMergeJoin.enabled` is set to `true`.
- The ratio of the number of buckets should be less than the value set in `spark.sql.bucketing.coalesceBucketsInSortMergeJoin.maxBucketRatio`.

### Why are the changes needed?

Eliminating the full shuffle can benefit for scenarios where two large tables are joined. Especially when the tables are already bucketed but differ in the number of buckets, we could take advantage of it.

### Does this PR introduce any user-facing change?

If the bucket coalescing conditions explained above are met, a full shuffle can be eliminated (also note that you will see `SelectedBucketsCount: 8 out of 8 (Coalesced to 4)` in the physical plan):
```
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", "0")
spark.conf.set("spark.sql.bucketing.coalesceBucketsInSortMergeJoin.enabled", "true")
val df1 = (0 until 20).map(i => (i % 5, i % 13, i.toString)).toDF("i", "j", "k")
val df2 = (0 until 20).map(i => (i % 7, i % 11, i.toString)).toDF("i", "j", "k")
df1.write.format("parquet").bucketBy(8, "i").saveAsTable("t1")
df2.write.format("parquet").bucketBy(4, "i").saveAsTable("t2")
val t1 = spark.table("t1")
val t2 = spark.table("t2")
val joined = t1.join(t2, t1("i") === t2("i"))
joined.explain

== Physical Plan ==
*(3) SortMergeJoin [i#44], [i#50], Inner
:- *(1) Sort [i#44 ASC NULLS FIRST], false, 0
:  +- *(1) Project [i#44, j#45, k#46]
:     +- *(1) Filter isnotnull(i#44)
:        +- *(1) ColumnarToRow
:           +- FileScan parquet default.t1[i#44,j#45,k#46] Batched: true, DataFilters: [isnotnull(i#44)], Format: Parquet, Location: InMemoryFileIndex[...], PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int,j:int,k:string>, SelectedBucketsCount: 8 out of 8 (Coalesced to 4)
+- *(2) Sort [i#50 ASC NULLS FIRST], false, 0
   +- *(2) Project [i#50, j#51, k#52]
      +- *(2) Filter isnotnull(i#50)
         +- *(2) ColumnarToRow
            +- FileScan parquet default.t2[i#50,j#51,k#52] Batched: true, DataFilters: [isnotnull(i#50)], Format: Parquet, Location: InMemoryFileIndex[...], PartitionFilters: [], PushedFilters: [IsNotNull(i)], ReadSchema: struct<i:int,j:int,k:string>, SelectedBucketsCount: 4 out of 4
```

### How was this patch tested?

Added unit tests

Closes #28123 from imback82/coalescing_bucket.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-06-20 08:20:45 +09:00
yi.wu 5ee5cfd9c0 [SPARK-31826][SQL] Support composed type of case class for typed Scala UDF
### What changes were proposed in this pull request?

This PR adds support for typed Scala UDF to accept composed type of case class, e.g. Seq[T], Array[T], Map[Int, T] (assuming T is case class type), as input parameter type.

### Why are the changes needed?

After #27937, typed Scala UDF now has supported case class as its input parameter type. However, it can not accept the composed type of case class, such as Seq[T], Array[T], Map[Int, T] (assuming T is case class type), which causing confuse(e.g. https://github.com/apache/spark/pull/27937#discussion_r422699979) to the user.

### Does this PR introduce _any_ user-facing change?

Yes.

Run the query:

```
scala> case class Person(name: String, age: Int)
scala> Seq((1, Seq(Person("Jack", 5)))).toDF("id", "persons").withColumn("ages", udf{ s: Seq[Person] => s.head.age }.apply(col("persons"))).show

```

Before:

```

org.apache.spark.SparkException: Failed to execute user defined function($read$$Lambda$2861/628175152: (array<struct<name:string,age:int>>) => int)
  at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1129)
  at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:156)
  at org.apache.spark.sql.catalyst.expressions.InterpretedMutableProjection.apply(InterpretedMutableProjection.scala:83)
  at org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation$$anonfun$apply$17.$anonfun$applyOrElse$69(Optimizer.scala:1492)
  at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)

....

Caused by: java.lang.ClassCastException: org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema cannot be cast to Person
  at $anonfun$res3$1(<console>:30)
  at $anonfun$res3$1$adapted(<console>:30)
  at org.apache.spark.sql.catalyst.expressions.ScalaUDF.$anonfun$f$2(ScalaUDF.scala:156)
  at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1126)
  ... 142 more
```

After:
```
+---+-----------+----+
| id|    persons|ages|
+---+-----------+----+
|  1|[[Jack, 5]]| [5]|
+---+-----------+----+
```

### How was this patch tested?

Added tests.

Closes #28645 from Ngone51/impr-udf.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-19 12:45:47 +00:00
Jungtaek Lim (HeartSaVioR) 6fe3bf66eb [SPARK-31993][SQL] Build arrays for passing variables generated from children for 'concat_ws' with columns having at least one of array type
### What changes were proposed in this pull request?

Please refer the next section `Why are the changes needed?` for details how the current implementation of `concat_ws` is broken for some condition.

This patch fixes the code generation logic for columns having at least one array types of columns in `concat_ws` to build two arrays for storing isNull and value from children's generated code and pass these arrays to the both varargCounts and varargBuilds. This change guarantees that both varargCounts and varargBuilds can access the relevant local variables the children's generated code makes as array parameters, which is critical to ensure both varargCounts and varargBuilds succeed to compile.

Below is the generated code for newly added UT, `SPARK-31993: concat_ws in agg function with plenty of string/array types columns`.

> before the patch

```
/* 001 */ public java.lang.Object generate(Object[] references) {
/* 002 */   return new SpecificUnsafeProjection(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[1];
/* 009 */
/* 010 */   public SpecificUnsafeProjection(Object[] references) {
/* 011 */     this.references = references;
/* 012 */     mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 32);
/* 013 */
/* 014 */   }
/* 015 */
/* 016 */   public void initialize(int partitionIndex) {
/* 017 */
/* 018 */   }
/* 019 */
/* 020 */   // Scala.Function1 need this
/* 021 */   public java.lang.Object apply(java.lang.Object row) {
/* 022 */     return apply((InternalRow) row);
/* 023 */   }
/* 024 */
/* 025 */   public UnsafeRow apply(InternalRow i) {
/* 026 */     mutableStateArray_0[0].reset();
/* 027 */
/* 028 */
/* 029 */     mutableStateArray_0[0].zeroOutNullBytes();
/* 030 */
/* 031 */     apply_0_0(i);
/* 032 */     apply_0_1(i);
/* 033 */     int varargNum_0 = 30;
/* 034 */     int idxInVararg_0 = 0;
/* 035 */
/* 036 */     if (!isNull_2) {
/* 037 */       varargNum_0 += value_2.numElements();
/* 038 */     }
/* 039 */
/* 040 */     if (!isNull_3) {
/* 041 */       varargNum_0 += value_3.numElements();
/* 042 */     }
/* 043 */
/* 044 */     UTF8String[] array_0 = new UTF8String[varargNum_0];
/* 045 */     idxInVararg_0 = varargBuildsConcatWs_0_0(i, array_0, idxInVararg_0);
/* 046 */     idxInVararg_0 = varargBuildsConcatWs_0_1(i, array_0, idxInVararg_0);
/* 047 */     idxInVararg_0 = varargBuildsConcatWs_0_2(i, array_0, idxInVararg_0);
/* 048 */     UTF8String value_0 = UTF8String.concatWs(((UTF8String) references[0] /* literal */), array_0);
/* 049 */     boolean isNull_0 = value_0 == null;
/* 050 */     mutableStateArray_0[0].write(0, value_0);
/* 051 */     return (mutableStateArray_0[0].getRow());
/* 052 */   }
/* 053 */
/* 054 */
/* 055 */   private void apply_0_1(InternalRow i) {
/* 056 */     UTF8String value_25 = i.getUTF8String(22);UTF8String value_26 = i.getUTF8String(23);UTF8String value_27 = i.getUTF8String(24);UTF8String value_28 = i.getUTF8String(25);UTF8String value_29 = i.getUTF8String(26);UTF8String value_30 = i.getUTF8String(27);UTF8String value_31 = i.getUTF8String(28);UTF8String value_32 = i.getUTF8String(29);UTF8String value_33 = i.getUTF8String(30);
/* 057 */   }
/* 058 */
/* 059 */
/* 060 */   private int varargBuildsConcatWs_0_0(InternalRow i, UTF8String [] array_0, int idxInVararg_0) {
/* 061 */
/* 062 */
/* 063 */     if (!isNull_2) {
/* 064 */       final int n_0 = value_2.numElements();
/* 065 */       for (int j = 0; j < n_0; j ++) {
/* 066 */         array_0[idxInVararg_0 ++] = value_2.getUTF8String(j);
/* 067 */       }
/* 068 */     }
/* 069 */
/* 070 */     if (!isNull_3) {
/* 071 */       final int n_1 = value_3.numElements();
/* 072 */       for (int j = 0; j < n_1; j ++) {
/* 073 */         array_0[idxInVararg_0 ++] = value_3.getUTF8String(j);
/* 074 */       }
/* 075 */     }
/* 076 */     array_0[idxInVararg_0 ++] = false ? (UTF8String) null : value_4;array_0[idxInVararg_0 ++] = false ? (UTF8String) null : value_5;array_0[idxInVararg_0 ++] = false ? (UTF8String) null : value_6;array_0[idxInVararg_0 ++] = false ? (UTF8String) null : value_7;array_0[idxInVararg_0 ++] = false ? (UTF8String) null : value_8;array_0[idxInVararg_0 ++] = false ? (UTF8String) null : value_9;array_0[idxInVararg_0 ++] = false ? (UTF8String) null : value_10;array_0[idxInVararg_0 ++] = false ? (UTF8String) null : value_11;
/* 077 */     return idxInVararg_0;
/* 078 */
/* 079 */   }
/* 080 */
/* 081 */
/* 082 */   private int varargBuildsConcatWs_0_2(InternalRow i, UTF8String [] array_0, int idxInVararg_0) {
/* 083 */
/* 084 */     array_0[idxInVararg_0 ++] = false ? (UTF8String) null : value_28;array_0[idxInVararg_0 ++] = false ? (UTF8String) null : value_29;array_0[idxInVararg_0 ++] = false ? (UTF8String) null : value_30;array_0[idxInVararg_0 ++] = false ? (UTF8String) null : value_31;array_0[idxInVararg_0 ++] = false ? (UTF8String) null : value_32;array_0[idxInVararg_0 ++] = false ? (UTF8String) null : value_33;
/* 085 */     return idxInVararg_0;
/* 086 */
/* 087 */   }
/* 088 */
/* 089 */
/* 090 */   private void apply_0_0(InternalRow i) {
/* 091 */     boolean isNull_2 = i.isNullAt(31);
/* 092 */     ArrayData value_2 = isNull_2 ?
/* 093 */     null : (i.getArray(31));boolean isNull_3 = i.isNullAt(32);
/* 094 */     ArrayData value_3 = isNull_3 ?
/* 095 */     null : (i.getArray(32));UTF8String value_4 = i.getUTF8String(1);UTF8String value_5 = i.getUTF8String(2);UTF8String value_6 = i.getUTF8String(3);UTF8String value_7 = i.getUTF8String(4);UTF8String value_8 = i.getUTF8String(5);UTF8String value_9 = i.getUTF8String(6);UTF8String value_10 = i.getUTF8String(7);UTF8String value_11 = i.getUTF8String(8);UTF8String value_12 = i.getUTF8String(9);UTF8String value_13 = i.getUTF8String(10);UTF8String value_14 = i.getUTF8String(11);UTF8String value_15 = i.getUTF8String(12);UTF8String value_16 = i.getUTF8String(13);UTF8String value_17 = i.getUTF8String(14);UTF8String value_18 = i.getUTF8String(15);UTF8String value_19 = i.getUTF8String(16);UTF8String value_20 = i.getUTF8String(17);UTF8String value_21 = i.getUTF8String(18);UTF8String value_22 = i.getUTF8String(19);UTF8String value_23 = i.getUTF8String(20);UTF8String value_24 = i.getUTF8String(21);
/* 096 */   }
/* 097 */
/* 098 */
/* 099 */   private int varargBuildsConcatWs_0_1(InternalRow i, UTF8String [] array_0, int idxInVararg_0) {
/* 100 */
/* 101 */     array_0[idxInVararg_0 ++] = false ? (UTF8String) null : value_12;array_0[idxInVararg_0 ++] = false ? (UTF8String) null : value_13;array_0[idxInVararg_0 ++] = false ? (UTF8String) null : value_14;array_0[idxInVararg_0 ++] = false ? (UTF8String) null : value_15;array_0[idxInVararg_0 ++] = false ? (UTF8String) null : value_16;array_0[idxInVararg_0 ++] = false ? (UTF8String) null : value_17;array_0[idxInVararg_0 ++] = false ? (UTF8String) null : value_18;array_0[idxInVararg_0 ++] = false ? (UTF8String) null : value_19;array_0[idxInVararg_0 ++] = false ? (UTF8String) null : value_20;array_0[idxInVararg_0 ++] = false ? (UTF8String) null : value_21;array_0[idxInVararg_0 ++] = false ? (UTF8String) null : value_22;array_0[idxInVararg_0 ++] = false ? (UTF8String) null : value_23;array_0[idxInVararg_0 ++] = false ? (UTF8String) null : value_24;array_0[idxInVararg_0 ++] = false ? (UTF8String) null : value_25;array_0[idxInVararg_0 ++] = false ? (UTF8String) null : value_26;array_0[idxInVararg_0 ++] = false ? (UTF8String) null : value_27;
/* 102 */     return idxInVararg_0;
/* 103 */
/* 104 */   }
/* 105 */
/* 106 */ }
```

Compilation of the generated code fails with error message: `org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 36, Column 6: Expression "isNull_2" is not an rvalue`

> after the patch

```
/* 001 */ public java.lang.Object generate(Object[] references) {
/* 002 */   return new SpecificUnsafeProjection(references);
/* 003 */ }
/* 004 */
/* 005 */ class SpecificUnsafeProjection extends org.apache.spark.sql.catalyst.expressions.UnsafeProjection {
/* 006 */
/* 007 */   private Object[] references;
/* 008 */   private boolean globalIsNull_0;
/* 009 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[1];
/* 010 */
/* 011 */   public SpecificUnsafeProjection(Object[] references) {
/* 012 */     this.references = references;
/* 013 */
/* 014 */     mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 32);
/* 015 */
/* 016 */   }
/* 017 */
/* 018 */   public void initialize(int partitionIndex) {
/* 019 */
/* 020 */   }
/* 021 */
/* 022 */   // Scala.Function1 need this
/* 023 */   public java.lang.Object apply(java.lang.Object row) {
/* 024 */     return apply((InternalRow) row);
/* 025 */   }
/* 026 */
/* 027 */   public UnsafeRow apply(InternalRow i) {
/* 028 */     mutableStateArray_0[0].reset();
/* 029 */
/* 030 */
/* 031 */     mutableStateArray_0[0].zeroOutNullBytes();
/* 032 */
/* 033 */     UTF8String value_34 = ConcatWs_0(i);
/* 034 */     mutableStateArray_0[0].write(0, value_34);
/* 035 */     return (mutableStateArray_0[0].getRow());
/* 036 */   }
/* 037 */
/* 038 */
/* 039 */   private void initializeArgsArrays_0_0(InternalRow i, boolean [] isNullArgs_0, Object [] valueArgs_0) {
/* 040 */
/* 041 */     boolean isNull_2 = i.isNullAt(31);
/* 042 */     ArrayData value_2 = isNull_2 ?
/* 043 */     null : (i.getArray(31));
/* 044 */     isNullArgs_0[0] = isNull_2;
/* 045 */     valueArgs_0[0] = value_2;
/* 046 */
/* 047 */     boolean isNull_3 = i.isNullAt(32);
/* 048 */     ArrayData value_3 = isNull_3 ?
/* 049 */     null : (i.getArray(32));
/* 050 */     isNullArgs_0[1] = isNull_3;
/* 051 */     valueArgs_0[1] = value_3;
/* 052 */
/* 053 */     UTF8String value_4 = i.getUTF8String(1);
/* 054 */     isNullArgs_0[2] = false;
/* 055 */     valueArgs_0[2] = value_4;
/* 056 */
/* 057 */     UTF8String value_5 = i.getUTF8String(2);
/* 058 */     isNullArgs_0[3] = false;
/* 059 */     valueArgs_0[3] = value_5;
/* 060 */
/* 061 */     UTF8String value_6 = i.getUTF8String(3);
/* 062 */     isNullArgs_0[4] = false;
/* 063 */     valueArgs_0[4] = value_6;
/* 064 */
/* 065 */     UTF8String value_7 = i.getUTF8String(4);
/* 066 */     isNullArgs_0[5] = false;
/* 067 */     valueArgs_0[5] = value_7;
/* 068 */
/* 069 */     UTF8String value_8 = i.getUTF8String(5);
/* 070 */     isNullArgs_0[6] = false;
/* 071 */     valueArgs_0[6] = value_8;
/* 072 */
/* 073 */   }
/* 074 */
/* 075 */
/* 076 */   private void initializeArgsArrays_0_3(InternalRow i, boolean [] isNullArgs_0, Object [] valueArgs_0) {
/* 077 */
/* 078 */     UTF8String value_25 = i.getUTF8String(22);
/* 079 */     isNullArgs_0[23] = false;
/* 080 */     valueArgs_0[23] = value_25;
/* 081 */
/* 082 */     UTF8String value_26 = i.getUTF8String(23);
/* 083 */     isNullArgs_0[24] = false;
/* 084 */     valueArgs_0[24] = value_26;
/* 085 */
/* 086 */     UTF8String value_27 = i.getUTF8String(24);
/* 087 */     isNullArgs_0[25] = false;
/* 088 */     valueArgs_0[25] = value_27;
/* 089 */
/* 090 */     UTF8String value_28 = i.getUTF8String(25);
/* 091 */     isNullArgs_0[26] = false;
/* 092 */     valueArgs_0[26] = value_28;
/* 093 */
/* 094 */     UTF8String value_29 = i.getUTF8String(26);
/* 095 */     isNullArgs_0[27] = false;
/* 096 */     valueArgs_0[27] = value_29;
/* 097 */
/* 098 */     UTF8String value_30 = i.getUTF8String(27);
/* 099 */     isNullArgs_0[28] = false;
/* 100 */     valueArgs_0[28] = value_30;
/* 101 */
/* 102 */     UTF8String value_31 = i.getUTF8String(28);
/* 103 */     isNullArgs_0[29] = false;
/* 104 */     valueArgs_0[29] = value_31;
/* 105 */
/* 106 */     UTF8String value_32 = i.getUTF8String(29);
/* 107 */     isNullArgs_0[30] = false;
/* 108 */     valueArgs_0[30] = value_32;
/* 109 */
/* 110 */   }
/* 111 */
/* 112 */
/* 113 */   private int varargBuildsConcatWs_0_3(InternalRow i, UTF8String [] array_0, int idxInVararg_0, boolean [] isNullArgs_0, Object [] valueArgs_0) {
/* 114 */
/* 115 */     array_0[idxInVararg_0 ++] = isNullArgs_0[29] ? (UTF8String) null : ((UTF8String) valueArgs_0[29]);array_0[idxInVararg_0 ++] = isNullArgs_0[30] ? (UTF8String) null : ((UTF8String) valueArgs_0[30]);array_0[idxInVararg_0 ++] = isNullArgs_0[31] ? (UTF8String) null : ((UTF8String) valueArgs_0[31]);
/* 116 */     return idxInVararg_0;
/* 117 */
/* 118 */   }
/* 119 */
/* 120 */
/* 121 */   private int varargBuildsConcatWs_0_0(InternalRow i, UTF8String [] array_0, int idxInVararg_0, boolean [] isNullArgs_0, Object [] valueArgs_0) {
/* 122 */
/* 123 */
/* 124 */     if (!isNullArgs_0[0]) {
/* 125 */       final int n_0 = ((ArrayData) valueArgs_0[0]).numElements();
/* 126 */       for (int j = 0; j < n_0; j ++) {
/* 127 */         array_0[idxInVararg_0 ++] = ((ArrayData) valueArgs_0[0]).getUTF8String(j);
/* 128 */       }
/* 129 */     }
/* 130 */
/* 131 */     if (!isNullArgs_0[1]) {
/* 132 */       final int n_1 = ((ArrayData) valueArgs_0[1]).numElements();
/* 133 */       for (int j = 0; j < n_1; j ++) {
/* 134 */         array_0[idxInVararg_0 ++] = ((ArrayData) valueArgs_0[1]).getUTF8String(j);
/* 135 */       }
/* 136 */     }
/* 137 */     array_0[idxInVararg_0 ++] = isNullArgs_0[2] ? (UTF8String) null : ((UTF8String) valueArgs_0[2]);array_0[idxInVararg_0 ++] = isNullArgs_0[3] ? (UTF8String) null : ((UTF8String) valueArgs_0[3]);array_0[idxInVararg_0 ++] = isNullArgs_0[4] ? (UTF8String) null : ((UTF8String) valueArgs_0[4]);array_0[idxInVararg_0 ++] = isNullArgs_0[5] ? (UTF8String) null : ((UTF8String) valueArgs_0[5]);array_0[idxInVararg_0 ++] = isNullArgs_0[6] ? (UTF8String) null : ((UTF8String) valueArgs_0[6]);
/* 138 */     return idxInVararg_0;
/* 139 */
/* 140 */   }
/* 141 */
/* 142 */
/* 143 */   private UTF8String ConcatWs_0(InternalRow i) {
/* 144 */     boolean[] isNullArgs_0 = new boolean[32];
/* 145 */     Object[] valueArgs_0 = new Object[32];
/* 146 */     initializeArgsArrays_0_0(i, isNullArgs_0, valueArgs_0);
/* 147 */     initializeArgsArrays_0_1(i, isNullArgs_0, valueArgs_0);
/* 148 */     initializeArgsArrays_0_2(i, isNullArgs_0, valueArgs_0);
/* 149 */     initializeArgsArrays_0_3(i, isNullArgs_0, valueArgs_0);
/* 150 */     initializeArgsArrays_0_4(i, isNullArgs_0, valueArgs_0);
/* 151 */     int varargNum_0 = 30;
/* 152 */     int idxInVararg_0 = 0;
/* 153 */
/* 154 */     if (!isNullArgs_0[0]) {
/* 155 */       varargNum_0 += ((ArrayData) valueArgs_0[0]).numElements();
/* 156 */     }
/* 157 */
/* 158 */     if (!isNullArgs_0[1]) {
/* 159 */       varargNum_0 += ((ArrayData) valueArgs_0[1]).numElements();
/* 160 */     }
/* 161 */
/* 162 */     UTF8String[] array_0 = new UTF8String[varargNum_0];
/* 163 */     idxInVararg_0 = varargBuildsConcatWs_0_0(i, array_0, idxInVararg_0, isNullArgs_0, valueArgs_0);
/* 164 */     idxInVararg_0 = varargBuildsConcatWs_0_1(i, array_0, idxInVararg_0, isNullArgs_0, valueArgs_0);
/* 165 */     idxInVararg_0 = varargBuildsConcatWs_0_2(i, array_0, idxInVararg_0, isNullArgs_0, valueArgs_0);
/* 166 */     idxInVararg_0 = varargBuildsConcatWs_0_3(i, array_0, idxInVararg_0, isNullArgs_0, valueArgs_0);
/* 167 */     UTF8String value_0 = UTF8String.concatWs(((UTF8String) references[0] /* literal */), array_0);
/* 168 */     boolean isNull_0 = value_0 == null;
/* 169 */     globalIsNull_0 = isNull_0;
/* 170 */     return value_0;
/* 171 */   }
/* 172 */
/* 173 */
/* 174 */   private void initializeArgsArrays_0_2(InternalRow i, boolean [] isNullArgs_0, Object [] valueArgs_0) {
/* 175 */
/* 176 */     UTF8String value_17 = i.getUTF8String(14);
/* 177 */     isNullArgs_0[15] = false;
/* 178 */     valueArgs_0[15] = value_17;
/* 179 */
/* 180 */     UTF8String value_18 = i.getUTF8String(15);
/* 181 */     isNullArgs_0[16] = false;
/* 182 */     valueArgs_0[16] = value_18;
/* 183 */
/* 184 */     UTF8String value_19 = i.getUTF8String(16);
/* 185 */     isNullArgs_0[17] = false;
/* 186 */     valueArgs_0[17] = value_19;
/* 187 */
/* 188 */     UTF8String value_20 = i.getUTF8String(17);
/* 189 */     isNullArgs_0[18] = false;
/* 190 */     valueArgs_0[18] = value_20;
/* 191 */
/* 192 */     UTF8String value_21 = i.getUTF8String(18);
/* 193 */     isNullArgs_0[19] = false;
/* 194 */     valueArgs_0[19] = value_21;
/* 195 */
/* 196 */     UTF8String value_22 = i.getUTF8String(19);
/* 197 */     isNullArgs_0[20] = false;
/* 198 */     valueArgs_0[20] = value_22;
/* 199 */
/* 200 */     UTF8String value_23 = i.getUTF8String(20);
/* 201 */     isNullArgs_0[21] = false;
/* 202 */     valueArgs_0[21] = value_23;
/* 203 */
/* 204 */     UTF8String value_24 = i.getUTF8String(21);
/* 205 */     isNullArgs_0[22] = false;
/* 206 */     valueArgs_0[22] = value_24;
/* 207 */
/* 208 */   }
/* 209 */
/* 210 */
/* 211 */   private int varargBuildsConcatWs_0_2(InternalRow i, UTF8String [] array_0, int idxInVararg_0, boolean [] isNullArgs_0, Object [] valueArgs_0) {
/* 212 */
/* 213 */     array_0[idxInVararg_0 ++] = isNullArgs_0[18] ? (UTF8String) null : ((UTF8String) valueArgs_0[18]);array_0[idxInVararg_0 ++] = isNullArgs_0[19] ? (UTF8String) null : ((UTF8String) valueArgs_0[19]);array_0[idxInVararg_0 ++] = isNullArgs_0[20] ? (UTF8String) null : ((UTF8String) valueArgs_0[20]);array_0[idxInVararg_0 ++] = isNullArgs_0[21] ? (UTF8String) null : ((UTF8String) valueArgs_0[21]);array_0[idxInVararg_0 ++] = isNullArgs_0[22] ? (UTF8String) null : ((UTF8String) valueArgs_0[22]);array_0[idxInVararg_0 ++] = isNullArgs_0[23] ? (UTF8String) null : ((UTF8String) valueArgs_0[23]);array_0[idxInVararg_0 ++] = isNullArgs_0[24] ? (UTF8String) null : ((UTF8String) valueArgs_0[24]);array_0[idxInVararg_0 ++] = isNullArgs_0[25] ? (UTF8String) null : ((UTF8String) valueArgs_0[25]);array_0[idxInVararg_0 ++] = isNullArgs_0[26] ? (UTF8String) null : ((UTF8String) valueArgs_0[26]);array_0[idxInVararg_0 ++] = isNullArgs_0[27] ? (UTF8String) null : ((UTF8String) valueArgs_0[27]);array_0[idxInVararg_0 ++] = isNullArgs_0[28] ? (UTF8String) null : ((UTF8String) valueArgs_0[28]);
/* 214 */     return idxInVararg_0;
/* 215 */
/* 216 */   }
/* 217 */
/* 218 */
/* 219 */   private void initializeArgsArrays_0_4(InternalRow i, boolean [] isNullArgs_0, Object [] valueArgs_0) {
/* 220 */
/* 221 */     UTF8String value_33 = i.getUTF8String(30);
/* 222 */     isNullArgs_0[31] = false;
/* 223 */     valueArgs_0[31] = value_33;
/* 224 */
/* 225 */   }
/* 226 */
/* 227 */
/* 228 */   private void initializeArgsArrays_0_1(InternalRow i, boolean [] isNullArgs_0, Object [] valueArgs_0) {
/* 229 */
/* 230 */     UTF8String value_9 = i.getUTF8String(6);
/* 231 */     isNullArgs_0[7] = false;
/* 232 */     valueArgs_0[7] = value_9;
/* 233 */
/* 234 */     UTF8String value_10 = i.getUTF8String(7);
/* 235 */     isNullArgs_0[8] = false;
/* 236 */     valueArgs_0[8] = value_10;
/* 237 */
/* 238 */     UTF8String value_11 = i.getUTF8String(8);
/* 239 */     isNullArgs_0[9] = false;
/* 240 */     valueArgs_0[9] = value_11;
/* 241 */
/* 242 */     UTF8String value_12 = i.getUTF8String(9);
/* 243 */     isNullArgs_0[10] = false;
/* 244 */     valueArgs_0[10] = value_12;
/* 245 */
/* 246 */     UTF8String value_13 = i.getUTF8String(10);
/* 247 */     isNullArgs_0[11] = false;
/* 248 */     valueArgs_0[11] = value_13;
/* 249 */
/* 250 */     UTF8String value_14 = i.getUTF8String(11);
/* 251 */     isNullArgs_0[12] = false;
/* 252 */     valueArgs_0[12] = value_14;
/* 253 */
/* 254 */     UTF8String value_15 = i.getUTF8String(12);
/* 255 */     isNullArgs_0[13] = false;
/* 256 */     valueArgs_0[13] = value_15;
/* 257 */
/* 258 */     UTF8String value_16 = i.getUTF8String(13);
/* 259 */     isNullArgs_0[14] = false;
/* 260 */     valueArgs_0[14] = value_16;
/* 261 */
/* 262 */   }
/* 263 */
/* 264 */
/* 265 */   private int varargBuildsConcatWs_0_1(InternalRow i, UTF8String [] array_0, int idxInVararg_0, boolean [] isNullArgs_0, Object [] valueArgs_0) {
/* 266 */
/* 267 */     array_0[idxInVararg_0 ++] = isNullArgs_0[7] ? (UTF8String) null : ((UTF8String) valueArgs_0[7]);array_0[idxInVararg_0 ++] = isNullArgs_0[8] ? (UTF8String) null : ((UTF8String) valueArgs_0[8]);array_0[idxInVararg_0 ++] = isNullArgs_0[9] ? (UTF8String) null : ((UTF8String) valueArgs_0[9]);array_0[idxInVararg_0 ++] = isNullArgs_0[10] ? (UTF8String) null : ((UTF8String) valueArgs_0[10]);array_0[idxInVararg_0 ++] = isNullArgs_0[11] ? (UTF8String) null : ((UTF8String) valueArgs_0[11]);array_0[idxInVararg_0 ++] = isNullArgs_0[12] ? (UTF8String) null : ((UTF8String) valueArgs_0[12]);array_0[idxInVararg_0 ++] = isNullArgs_0[13] ? (UTF8String) null : ((UTF8String) valueArgs_0[13]);array_0[idxInVararg_0 ++] = isNullArgs_0[14] ? (UTF8String) null : ((UTF8String) valueArgs_0[14]);array_0[idxInVararg_0 ++] = isNullArgs_0[15] ? (UTF8String) null : ((UTF8String) valueArgs_0[15]);array_0[idxInVararg_0 ++] = isNullArgs_0[16] ? (UTF8String) null : ((UTF8String) valueArgs_0[16]);array_0[idxInVararg_0 ++] = isNullArgs_0[17] ? (UTF8String) null : ((UTF8String) valueArgs_0[17]);
/* 268 */     return idxInVararg_0;
/* 269 */
/* 270 */   }
/* 271 */
/* 272 */ }
```

### Why are the changes needed?

The generated code in `concat_ws` fails to compile when the below conditions are met:

* Plenty of columns are provided as input of `concat_ws`.
* There's at least one column with array[string] type. (In other words, not all columns are string type.)
* Splitting methods is triggered in `splitExpressionsWithCurrentInputs`.
  * This is a bit tricky, as the method won't split methods under whole stage codegen, as well as it will be simply no-op (inlined) if the number of blocks to convert into methods is 1.

a0187cd6b5/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/stringExpressions.scala (L88-L195)

There're three parts of generated code in `concat_ws` (`codes`, `varargCounts`, `varargBuilds`) and all parts try to split method by itself, while `varargCounts` and `varargBuilds` refer on the generated code in `codes`, hence the overall generated code fails to compile if any of part succeeds to split.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New UTs added. (One for verification of the patch, another one for regression test)

Closes #28831 from HeartSaVioR/SPARK-31993.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-19 06:01:06 +00:00
Yuanjian Li 86b54f3321 [SPARK-31894][SS] Introduce UnsafeRow format validation for streaming state store
### What changes were proposed in this pull request?
Introduce UnsafeRow format validation for streaming state store.

### Why are the changes needed?
Currently, Structured Streaming directly puts the UnsafeRow into StateStore without any schema validation. It's a dangerous behavior when users reusing the checkpoint file during migration. Any changes or bug fix related to the aggregate function may cause random exceptions, even the wrong answer, e.g SPARK-28067.

### Does this PR introduce _any_ user-facing change?
Yes. If the underlying changes are detected when the checkpoint is reused during migration, the InvalidUnsafeRowException will be thrown.

### How was this patch tested?
UT added. Will also add integrated tests for more scenario in another PR separately.

Closes #28707 from xuanyuanking/SPARK-31894.

Lead-authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Co-authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-19 05:56:50 +00:00
Max Gekk 17a5007fd8 [SPARK-30865][SQL][SS] Refactor DateTimeUtils
### What changes were proposed in this pull request?

1. Move TimeZoneUTC and TimeZoneGMT to DateTimeTestUtils
2. Remove TimeZoneGMT
3. Use ZoneId.systemDefault() instead of defaultTimeZone().toZoneId
4. Alias SQLDate & SQLTimestamp to internal types of DateType and TimestampType
5. Avoid one `*` `DateTimeUtils`.`in fromJulianDay()`
6. Use toTotalMonths in `DateTimeUtils`.`subtractDates()`
7. Remove `julianCommonEraStart`, `timestampToString()`, `microsToEpochDays()`, `epochDaysToMicros()`, `instantToDays()` from `DateTimeUtils`.
8. Make splitDate() private.
9. Remove `def daysToMicros(days: Int): Long` and `def microsToDays(micros: Long): Int`.

### Why are the changes needed?

This simplifies the common code related to date-time operations, and should improve maintainability. In particular:

1. TimeZoneUTC and TimeZoneGMT are moved to DateTimeTestUtils because they are used only in tests
2. TimeZoneGMT can be removed because it is equal to TimeZoneUTC
3. After the PR #27494, Spark expressions and DateTimeUtils functions switched to ZoneId instead of TimeZone completely. `defaultTimeZone()` with `TimeZone` as return type is not needed anymore.
4. SQLDate and SQLTimestamp types can be explicitly aliased to internal types of DateType and and TimestampType instead of declaring this in a comment.
5. Avoid one `*` `DateTimeUtils`.`in fromJulianDay()`.
6. Use toTotalMonths in `DateTimeUtils`.`subtractDates()`.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By existing test suites

Closes #27617 from MaxGekk/move-time-zone-consts.

Lead-authored-by: Max Gekk <max.gekk@gmail.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-19 05:41:09 +00:00
Dilip Biswal e4f5036146 [SPARK-32020][SQL] Better error message when SPARK_HOME or spark.test.home is not set
### What changes were proposed in this pull request?
Better error message when SPARK_HOME or spark,test.home is not set.

### Why are the changes needed?
Currently the error message is not easily consumable as it prints  (see below) the real error after printing the current environment which is rather long.

**Old output**
`
 time.name" -> "Java(TM) SE Runtime Environment", "sun.boot.library.path" -> "/Library/Java/JavaVirtualMachines/jdk1.8.0_221.jdk/Contents/Home/jre/lib",
 "java.vm.version" -> "25.221-b11",
 . . .
 . . .
 . . .
) did not contain key "SPARK_HOME" spark.test.home or SPARK_HOME is not set.
	at org.scalatest.Assertions.newAssertionFailedExceptio
`

**New output**
An exception or error caused a run to abort: spark.test.home or SPARK_HOME is not set.
org.scalatest.exceptions.TestFailedException: spark.test.home or SPARK_HOME is not set
### Does this PR introduce any user-facing change?
`
No.

### How was this patch tested?
Ran the tests in intellej  manually to see the new error.

Closes #28825 from dilipbiswal/minor-spark-31950-followup.

Authored-by: Dilip Biswal <dkbiswal@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-06-18 22:45:55 +09:00
Max Gekk afd8a8b964 [SPARK-31989][SQL] Generate JSON rebasing files w/ 30 minutes step
### What changes were proposed in this pull request?
1. Change the max step from 1 week to 30 minutes in the tests `RebaseDateTimeSuite`.`generate 'gregorian-julian-rebase-micros.json'` and `generate 'julian-gregorian-rebase-micros.json'`.
2. Parallelise JSON files generation in the function `generateRebaseJson` by using `ThreadUtils.parmap`.

### Why are the changes needed?
1. To prevent the bugs that are fixed by https://github.com/apache/spark/pull/28787 and https://github.com/apache/spark/pull/28816.
2. The parallelisation speeds up JSON file generation.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
By generating the JSON file `julian-gregorian-rebase-micros.json`.

Closes #28827 from MaxGekk/rebase-30-min.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-06-17 12:07:36 +09:00
Takeshi Yamamuro 8d577092ed [SPARK-31705][SQL][FOLLOWUP] Avoid the unnecessary CNF computation for full-outer joins
### What changes were proposed in this pull request?

To avoid the unnecessary CNF computation for full-outer joins, this PR fixes code for filtering out full-outer joins at the entrance of the rule.

### Why are the changes needed?

To mitigate optimizer overhead.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #28810 from maropu/SPARK-31705.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
2020-06-16 09:13:00 -07:00
Max Gekk 6e9ff72195 [SPARK-31984][SQL] Make micros rebasing functions via local timestamps pure
### What changes were proposed in this pull request?
1. Set the given time zone as the first parameter of `RebaseDateTime`.`rebaseJulianToGregorianMicros()` and `rebaseGregorianToJulianMicros()` to Java 7 `GregorianCalendar`.
```scala
    val cal = new Calendar.Builder()
      // `gregory` is a hybrid calendar that supports both the Julian and Gregorian calendar systems
      .setCalendarType("gregory")
    ...
      .setTimeZone(tz)
      .build()
```
This makes the instance of the calendar independent from the default JVM time zone.

2. Change type of the first parameter from `ZoneId` to `TimeZone`. This allows to avoid unnecessary conversion from `TimeZone` to `ZoneId`, for example in
```scala
  def rebaseJulianToGregorianMicros(micros: Long): Long = {
    ...
      if (rebaseRecord == null || micros < rebaseRecord.switches(0)) {
        rebaseJulianToGregorianMicros(timeZone.toZoneId, micros)
```
and back to `TimeZone` inside of `rebaseJulianToGregorianMicros(zoneId: ZoneId, ...)`

3. Modify tests in `RebaseDateTimeSuite`, and set the default JVM time zone only for functions that depend on it.

### Why are the changes needed?
1. Ignoring passed parameter and using a global variable is bad practice.
2. Dependency from the global state doesn't allow to run the functions in parallel otherwise there is non-zero probability that the functions may return wrong result if the default JVM is changed during their execution.
3. This open opportunity for parallelisation of JSON files generation `gregorian-julian-rebase-micros.json` and `julian-gregorian-rebase-micros.json`. Currently, the tests `generate 'gregorian-julian-rebase-micros.json'` and `generate 'julian-gregorian-rebase-micros.json'` generate the JSON files by iterating over all time zones sequentially w/ step of 1 week. Due to the large step, we can miss some spikes in diffs between 2 calendars (Java 8 Gregorian and Java 7 hybrid calendars) as the PR https://github.com/apache/spark/pull/28787 fixed and https://github.com/apache/spark/pull/28816 should fix.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running existing tests from `RebaseDateTimeSuite`.

Closes #28824 from MaxGekk/pure-micros-rebasing.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-16 12:56:27 +00:00
GuoPhilipse f0e6d0ec13 [SPARK-31710][SQL] Fail casting numeric to timestamp by default
## What changes were proposed in this pull request?
we fail casting from numeric to timestamp by default.

## Why are the changes needed?
casting from numeric to timestamp is not a  non-standard,meanwhile it may generate different result between spark and other systems,for example hive

## Does this PR introduce any user-facing change?
Yes,user cannot cast numeric to timestamp directly,user have to use the following function to achieve the same effect:TIMESTAMP_SECONDS/TIMESTAMP_MILLIS/TIMESTAMP_MICROS

## How was this patch tested?
unit test added

Closes #28593 from GuoPhilipse/31710-fix-compatibility.

Lead-authored-by: GuoPhilipse <guofei_ok@126.com>
Co-authored-by: GuoPhilipse <46367746+GuoPhilipse@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-16 08:35:35 +00:00
Max Gekk e9145d41f3 [SPARK-31986][SQL] Fix Julian-Gregorian micros rebasing of overlapping local timestamps
### What changes were proposed in this pull request?
It fixes microseconds rebasing from the hybrid calendar (Julian + Gregorian) to Proleptic Gregorian calendar in the function `RebaseDateTime`.`rebaseJulianToGregorianMicros(zoneId: ZoneId, micros: Long): Long` in the case of local timestamp overlapping.

In the case of overlapping, we look ahead of 1 day to determinate which instant we should take - earlier or later zoned timestamp. If our current standard zone and DST offsets are equal to zone offset of the next date, we choose the later timestamp otherwise the earlier one. For example, the local timestamp **1945-11-18 01:30:00.0** can be mapped to two instants (microseconds since 1970-01-01 00:00:00Z): -761211000000000 or -761207400000000. If the first one is passed to `rebaseJulianToGregorianMicros()`, we take the earlier instant in Proleptic Gregorian calendar while rebasing **1945-11-18T01:30+09:00[Asia/Hong_Kong]** otherwise the later one **1945-11-18T01:30+08:00[Asia/Hong_Kong]**.

Note: The fix assumes that only one transition of standard or DST offsets can occur during a day.

### Why are the changes needed?
Current implementation of `rebaseJulianToGregorianMicros()` handles timestamps overlapping only during daylight saving time but overlapping can happen also during transition from one standard time zone to another one. For example in the case of `Asia/Hong_Kong`, the time zone switched from `Japan Standard Time` (UTC+9) to `Hong Kong Time` (UTC+8) on _Sunday, 18 November, 1945 01:59:59 AM_. The changes allow to handle the special case as well.

### Does this PR introduce _any_ user-facing change?
There is no behaviour change for timestamps of CE after 0001-01-01. The PR might affects timestamps of BCE for which the modified `rebaseJulianToGregorianMicros()` is called directly.

### How was this patch tested?

1. By existing tests in `DateTimeUtilsSuite`, `RebaseDateTimeSuite`, `DateFunctionsSuite`, `DateExpressionsSuite` and `TimestampFormatterSuite`.

2. Added new checks to `RebaseDateTimeSuite`.`SPARK-31959: JST -> HKT at Asia/Hong_Kong in 1945`:
```scala
      assert(rebaseJulianToGregorianMicros(hkZid, rebasedEarlierMicros) === earlierMicros)
      assert(rebaseJulianToGregorianMicros(hkZid, rebasedLaterMicros) === laterMicros)
```

3. Regenerated `julian-gregorian-rebase-micros.json` with the step of 30 minutes, and got the same JSON file. The JSON file isn't affected because previously it was generated with the step of 1 week. And the spike in diffs/switch points during 1 hour of timestamp overlapping wasn't detected.

Closes #28816 from MaxGekk/fix-overlap-julian-2-grep.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-16 06:00:05 +00:00
Takeshi Yamamuro 3698a14204 [SPARK-26905][SQL] Follow the SQL:2016 reserved keywords
### What changes were proposed in this pull request?

This PR intends to move keywords `ANTI`, `SEMI`, and `MINUS` from reserved to non-reserved.

### Why are the changes needed?

To comply with the ANSI/SQL standard.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added tests.

Closes #28807 from maropu/SPARK-26905-2.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-06-16 00:27:45 +09:00
Max Gekk eae1747b66 [SPARK-31959][SQL][TESTS][FOLLOWUP] Adopt the test "SPARK-31959: JST -> HKT at Asia/Hong_Kong in 1945" to outdated tzdb
### What changes were proposed in this pull request?
Old JDK can have outdated time zone database in which `Asia/Hong_Kong` doesn't have timestamp overlapping in 1946 at all. This PR changes the test "SPARK-31959: JST -> HKT at Asia/Hong_Kong in 1945" in `RebaseDateTimeSuite`, and makes it tolerant to the case.

### Why are the changes needed?
To fix the test failures on old JDK w/ outdated tzdb like on Jenkins machine `research-jenkins-worker-09`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the test on old JDK

Closes #28832 from MaxGekk/HongKong-tz-1945-followup.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-06-15 08:09:07 -07:00
Liang-Chi Hsieh 8282bbf12d [SPARK-27633][SQL] Remove redundant aliases in NestedColumnAliasing
## What changes were proposed in this pull request?

In NestedColumnAliasing rule, we create aliases for nested field access in project list. We considered that top level parent field and nested fields under it were both accessed. In the case, we don't create the aliases because they are redundant.

There is another case, where a nested parent field and nested fields under it were both accessed, which we don't consider now. We don't need to create aliases in this case too.

## How was this patch tested?

Added test.

Closes #24525 from viirya/SPARK-27633.

Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-06-15 11:01:56 +09:00
HyukjinKwon a620a2a7e5 [SPARK-31977][SQL] Returns the plan directly from NestedColumnAliasing
### What changes were proposed in this pull request?

This proposes a minor refactoring to match `NestedColumnAliasing` to `GeneratorNestedColumnAliasing` so it returns the pruned plan directly.

```scala
    case p  NestedColumnAliasing(nestedFieldToAlias, attrToAliases) =>
      NestedColumnAliasing.replaceToAliases(p, nestedFieldToAlias, attrToAliases)
```

vs

```scala
    case GeneratorNestedColumnAliasing(p) => p
```

### Why are the changes needed?

Just for readability.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests should cover.

Closes #28812 from HyukjinKwon/SPARK-31977.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-06-13 07:26:37 +09:00
Takeshi Yamamuro 78d08a8c38 [SPARK-31950][SQL][TESTS] Extract SQL keywords from the SqlBase.g4 file
### What changes were proposed in this pull request?

This PR intends to extract SQL reserved/non-reserved keywords from the ANTLR grammar file (`SqlBase.g4`) directly.

This approach is based on the cloud-fan suggestion: https://github.com/apache/spark/pull/28779#issuecomment-642033217

### Why are the changes needed?

It is hard to maintain a full set of the keywords in `TableIdentifierParserSuite`, so it would be nice if we could extract them from the `SqlBase.g4` file directly.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #28802 from maropu/SPARK-31950-2.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-06-13 07:12:27 +09:00
Liang-Chi Hsieh ff89b11143 [SPARK-31736][SQL] Nested column aliasing for RepartitionByExpression/Join
### What changes were proposed in this pull request?

Currently we only push nested column pruning through a few operators such as LIMIT, SAMPLE, etc. This patch extends the feature to other operators including RepartitionByExpression, Join.

### Why are the changes needed?

Currently nested column pruning only applied on a few operators. It limits the benefit of nested column pruning. Extending nested column pruning coverage to make this feature more generally applied through different queries.

### Does this PR introduce _any_ user-facing change?

Yes. More SQL operators are covered by nested column pruning.

### How was this patch tested?

Added unit test, end-to-end tests.

Closes #28556 from viirya/others-column-pruning.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-06-12 16:54:55 +09:00
Max Gekk c259844df8 [SPARK-31959][SQL][TEST-JAVA11] Fix Gregorian-Julian micros rebasing while switching standard time zone offset
### What changes were proposed in this pull request?
Fix the bug in microseconds rebasing during transitions from one standard time zone offset to another one. In the PR, I propose to change the implementation of `rebaseGregorianToJulianMicros` which performs rebasing via local timestamps. In the case of overlapping:
1. Check that the original instant belongs to earlier or later instant of overlapped local timestamp.
2. If it is an earlier instant, take zone and DST offsets from the previous day otherwise
3. Set time zone offsets to Julian timestamp from the next day.

Note: The fix assumes that transitions cannot happen more often than once per 2 days.

### Why are the changes needed?
Current implementation handles timestamps overlapping only during daylight saving time but overlapping can happen also during transition from one standard time zone to another one. For example in the case of `Asia/Hong_Kong`, the time zone switched from `Japan Standard Time` (UTC+9) to `Hong Kong Time` (UTC+8) on _Sunday, 18 November, 1945 01:59:59 AM_. The changes allow to handle the special case as well.

### Does this PR introduce _any_ user-facing change?
It might affect micros rebasing in before common era when not-optimised version of `rebaseGregorianToJulianMicros()` is used directly.

### How was this patch tested?
1. By existing tests in `DateTimeUtilsSuite`, `RebaseDateTimeSuite`, `DateFunctionsSuite`, `DateExpressionsSuite` and `TimestampFormatterSuite`.
2. Added new test to `RebaseDateTimeSuite`
3. Regenerated `gregorian-julian-rebase-micros.json` with the step of 30 minutes, and got the same JSON file. The JSON file isn't affected because previously it was generated with the step of 1 week. And the spike in diffs/switch points during 1 hour of timestamp overlapping wasn't detected.

Closes #28787 from MaxGekk/HongKong-tz-1945.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-12 06:17:31 +00:00
Yuming Wang 78f9043862 [SPARK-31912][SQL][TESTS] Normalize all binary comparison expressions
### What changes were proposed in this pull request?

This pr normalize all binary comparison expressions when comparing plans.

### Why are the changes needed?

Improve test framework, otherwise this test will fail:
```scala
  test("SPARK-31912 Normalize all binary comparison expressions") {
    val original = testRelation
      .where('a === 'b && Literal(13) >= 'b).as("x")
    val optimized = testRelation
      .where(IsNotNull('a) && IsNotNull('b) && 'a === 'b && 'b <= 13 && 'a <= 13).as("x")
    comparePlans(Optimize.execute(original.analyze), optimized.analyze)
  }
```

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?

Manual test.

Closes #28734 from wangyum/SPARK-31912.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
2020-06-11 22:50:36 -07:00
Dilip Biswal b87a342c7d [SPARK-31916][SQL] StringConcat can lead to StringIndexOutOfBoundsException
### What changes were proposed in this pull request?
A minor fix to fix the append method of StringConcat to cap the length at MAX_ROUNDED_ARRAY_LENGTH to make sure it does not overflow and cause StringIndexOutOfBoundsException

Thanks to **Jeffrey Stokes** for reporting the issue and explaining the underlying problem in detail in the JIRA.

### Why are the changes needed?
This fixes StringIndexOutOfBoundsException on an overflow.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Added a test in StringsUtilSuite.

Closes #28750 from dilipbiswal/SPARK-31916.

Authored-by: Dilip Biswal <dkbiswal@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-06-12 09:19:29 +09:00
Takeshi Yamamuro b1adc3deee [SPARK-21117][SQL] Built-in SQL Function Support - WIDTH_BUCKET
### What changes were proposed in this pull request?

This PR intends to add a build-in SQL function - `WIDTH_BUCKET`.
It is the rework of #18323.

Closes #18323

The other RDBMS references for `WIDTH_BUCKET`:
 - Oracle: https://docs.oracle.com/cd/B28359_01/olap.111/b28126/dml_functions_2137.htm#OLADM717
 - PostgreSQL: https://www.postgresql.org/docs/current/functions-math.html
 - Snowflake: https://docs.snowflake.com/en/sql-reference/functions/width_bucket.html
 - Prestodb: https://prestodb.io/docs/current/functions/math.html
 - Teradata: https://docs.teradata.com/reader/kmuOwjp1zEYg98JsB8fu_A/Wa8vw69cGzoRyNULHZeudg
 - DB2: https://www.ibm.com/support/producthub/db2/docs/content/SSEPGG_11.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r0061483.html?pos=2

### Why are the changes needed?

For better usability.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit tests.

Closes #28764 from maropu/SPARK-21117.

Lead-authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Co-authored-by: Yuming Wang <wgyumg@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-06-11 14:15:28 -07:00
Gengliang Wang 11d3a744e2 [SPARK-31705][SQL] Push more possible predicates through Join via CNF conversion
### What changes were proposed in this pull request?

This PR add a new rule to support push predicate through join by rewriting join condition to CNF(conjunctive normal form). The following example is the steps of this rule:

1. Prepare Table:

```sql
CREATE TABLE x(a INT);
CREATE TABLE y(b INT);
...
SELECT * FROM x JOIN y ON ((a < 0 and a > b) or a > 10);
```

2. Convert the join condition to CNF:
```
(a < 0 or a > 10) and (a > b or a > 10)
```

3. Split conjunctive predicates

Predicates
---|
(a < 0 or a > 10)
(a > b or a > 10)

4. Push predicate

Table | Predicate
--- | ---
x | (a < 0 or a > 10)

### Why are the changes needed?
Improve query performance. PostgreSQL, [Impala](https://issues.apache.org/jira/browse/IMPALA-9183) and Hive support this feature.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Unit test and benchmark test.

SQL | Before this PR | After this PR
--- | --- | ---
TPCDS 5T Q13 | 84s | 21s
TPCDS 5T q85 | 66s | 34s
TPCH 1T q19 | 37s | 32s

Closes #28733 from gengliangwang/cnf.

Lead-authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Co-authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-06-11 10:13:45 -07:00
Wenchen Fan 6fb9c80da1 [SPARK-31958][SQL] normalize special floating numbers in subquery
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/23388 .

https://github.com/apache/spark/pull/23388 has an issue: it doesn't handle subquery expressions and assumes they will be turned into joins. However, this is not true for non-correlated subquery expressions.

This PR fixes this issue. It now doesn't skip `Subquery`, and subquery expressions will be handled by `OptimizeSubqueries`, which runs the optimizer with the subquery.

Note that, correlated subquery expressions will be handled twice: once in `OptimizeSubqueries`, once later when it becomes join. This is OK as `NormalizeFloatingNumbers` is idempotent now.

### Why are the changes needed?

fix a bug

### Does this PR introduce _any_ user-facing change?

yes, see the newly added test.

### How was this patch tested?

new test

Closes #28785 from cloud-fan/normalize.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-11 06:39:14 +00:00
Kent Yao 22dda6e18e [SPARK-31939][SQL][TEST-JAVA11] Fix Parsing day of year when year field pattern is missing
### What changes were proposed in this pull request?

If a datetime pattern contains no year field, the day of year field should not be ignored if exists

e.g.

```
spark-sql> select to_timestamp('31', 'DD');
1970-01-01 00:00:00
spark-sql> select to_timestamp('31 30', 'DD dd');
1970-01-30 00:00:00

spark.sql.legacy.timeParserPolicy legacy
spark-sql> select to_timestamp('31', 'DD');
1970-01-31 00:00:00
spark-sql> select to_timestamp('31 30', 'DD dd');
NULL
```

This PR only fixes some corner cases that use 'D' pattern to parse datetimes and there is w/o 'y'.

### Why are the changes needed?

fix some corner cases

### Does this PR introduce _any_ user-facing change?

yes, the day of year field will not be ignored

### How was this patch tested?

add unit tests.

Closes #28766 from yaooqinn/SPARK-31939.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-11 03:29:12 +00:00
Liang-Chi Hsieh 43063e2db2 [SPARK-27217][SQL] Nested column aliasing for more operators which can prune nested column
### What changes were proposed in this pull request?

Currently we only push nested column pruning from a Project through a few operators such as LIMIT, SAMPLE, etc. There are a few operators like Aggregate, Expand which can prune nested columns by themselves, without a Project on top.

This patch extends the feature to those operators.

### Why are the changes needed?

Currently nested column pruning only applied on a few cases. It limits the benefit of nested column pruning. Extending nested column pruning coverage to make this feature more generally applied through different queries.

### Does this PR introduce _any_ user-facing change?

Yes. More SQL operators are covered by nested column pruning.

### How was this patch tested?

Added unit test, end-to-end tests.

Closes #28560 from viirya/SPARK-27217-2.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-06-10 18:08:47 +09:00
Takeshi Yamamuro e14029b18d [SPARK-26905][SQL] Add TYPE in the ANSI non-reserved list
### What changes were proposed in this pull request?

This PR intends to add `TYPE` in the ANSI non-reserved list because it is not reserved in the standard. See SPARK-26905 for a full set of the reserved/non-reserved keywords of `SQL:2016`.

Note: The current master behaviour is as follows;
```
scala> sql("SET spark.sql.ansi.enabled=false")
scala> sql("create table t1 (type int)")
res4: org.apache.spark.sql.DataFrame = []

scala> sql("SET spark.sql.ansi.enabled=true")
scala> sql("create table t2 (type int)")
org.apache.spark.sql.catalyst.parser.ParseException:
no viable alternative at input 'type'(line 1, pos 17)

== SQL ==
create table t2 (type int)
-----------------^^^
```

### Why are the changes needed?

To follow the ANSI/SQL standard.

### Does this PR introduce _any_ user-facing change?

Makes users use `TYPE` as identifiers.

### How was this patch tested?

Update the keyword lists in `TableIdentifierParserSuite`.

Closes #28773 from maropu/SPARK-26905.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-06-10 16:29:43 +09:00
Kent Yao 6a424b93e5 [SPARK-31830][SQL] Consistent error handling for datetime formatting and parsing functions
### What changes were proposed in this pull request?
Currently, `date_format` and `from_unixtime`, `unix_timestamp`,`to_unix_timestamp`, `to_timestamp`, `to_date`  have different exception handling behavior for formatting datetime values.

In this PR, we apply the exception handling behavior of `date_format` to `from_unixtime`, `unix_timestamp`,`to_unix_timestamp`, `to_timestamp` and `to_date`.

In the phase of creating the datetime formatted or formating, exceptions will be raised.

e.g.

```java
spark-sql> select date_format(make_timestamp(1, 1 ,1,1,1,1), 'yyyyyyyyyyy-MM-aaa');
20/05/28 15:25:38 ERROR SparkSQLDriver: Failed in [select date_format(make_timestamp(1, 1 ,1,1,1,1), 'yyyyyyyyyyy-MM-aaa')]
org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to recognize 'yyyyyyyyyyy-MM-aaa' pattern in the DateTimeFormatter. 1) You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0. 2) You can form a valid datetime pattern with the guide from https://spark.apache.org/docs/latest/sql-ref-datetime-pattern.html
```

```java
spark-sql> select date_format(make_timestamp(1, 1 ,1,1,1,1), 'yyyyyyyyyyy-MM-AAA');
20/05/28 15:26:10 ERROR SparkSQLDriver: Failed in [select date_format(make_timestamp(1, 1 ,1,1,1,1), 'yyyyyyyyyyy-MM-AAA')]
java.lang.IllegalArgumentException: Illegal pattern character: A
```

```java
spark-sql> select date_format(make_timestamp(1,1,1,1,1,1), 'yyyyyyyyyyy-MM-dd');
20/05/28 15:23:23 ERROR SparkSQLDriver: Failed in [select date_format(make_timestamp(1,1,1,1,1,1), 'yyyyyyyyyyy-MM-dd')]
java.lang.ArrayIndexOutOfBoundsException: 11
	at java.time.format.DateTimeFormatterBuilder$NumberPrinterParser.format(DateTimeFormatterBuilder.java:2568)
```
In the phase of parsing, `DateTimeParseException | DateTimeException | ParseException` will be suppressed, but `SparkUpgradeException` will still be raised

e.g.

```java
spark-sql> set spark.sql.legacy.timeParserPolicy=exception;
spark.sql.legacy.timeParserPolicy	exception
spark-sql> select to_timestamp("2020-01-27T20:06:11.847-0800", "yyyy-MM-dd'T'HH:mm:ss.SSSz");
20/05/28 15:31:15 ERROR SparkSQLDriver: Failed in [select to_timestamp("2020-01-27T20:06:11.847-0800", "yyyy-MM-dd'T'HH:mm:ss.SSSz")]
org.apache.spark.SparkUpgradeException: You may get a different result due to the upgrading of Spark 3.0: Fail to parse '2020-01-27T20:06:11.847-0800' in the new parser. You can set spark.sql.legacy.timeParserPolicy to LEGACY to restore the behavior before Spark 3.0, or set to CORRECTED and treat it as an invalid datetime string.
```

```java
spark-sql> set spark.sql.legacy.timeParserPolicy=corrected;
spark.sql.legacy.timeParserPolicy	corrected
spark-sql> select to_timestamp("2020-01-27T20:06:11.847-0800", "yyyy-MM-dd'T'HH:mm:ss.SSSz");
NULL
spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
spark.sql.legacy.timeParserPolicy	legacy
spark-sql> select to_timestamp("2020-01-27T20:06:11.847-0800", "yyyy-MM-dd'T'HH:mm:ss.SSSz");
2020-01-28 12:06:11.847
```

### Why are the changes needed?
Consistency

### Does this PR introduce _any_ user-facing change?

Yes, invalid datetime patterns will fail `from_unixtime`, `unix_timestamp`,`to_unix_timestamp`, `to_timestamp` and `to_date` instead of resulting `NULL`

### How was this patch tested?

add more tests

Closes #28650 from yaooqinn/SPARK-31830.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-09 16:56:45 +00:00
Yuming Wang 1d1eacde9d [SPARK-31220][SQL] repartition obeys initialPartitionNum when adaptiveExecutionEnabled
### What changes were proposed in this pull request?
This PR makes `repartition`/`DISTRIBUTE BY` obeys [initialPartitionNum](af4248b2d6/sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala (L446-L455)) when adaptive execution enabled.

### Why are the changes needed?
To make `DISTRIBUTE BY`/`GROUP BY` partitioned by same partition number.
How to reproduce:
```scala
spark.sql("CREATE TABLE spark_31220(id int)")
spark.sql("set spark.sql.adaptive.enabled=true")
spark.sql("set spark.sql.adaptive.coalescePartitions.initialPartitionNum=1000")
```

Before this PR:
```
scala> spark.sql("SELECT id from spark_31220 GROUP BY id").explain
== Physical Plan ==
AdaptiveSparkPlan(isFinalPlan=false)
+- HashAggregate(keys=[id#5], functions=[])
   +- Exchange hashpartitioning(id#5, 1000), true, [id=#171]
      +- HashAggregate(keys=[id#5], functions=[])
         +- FileScan parquet default.spark_31220[id#5]

scala> spark.sql("SELECT id from spark_31220 DISTRIBUTE BY id").explain
== Physical Plan ==
AdaptiveSparkPlan(isFinalPlan=false)
+- Exchange hashpartitioning(id#5, 200), false, [id=#179]
   +- FileScan parquet default.spark_31220[id#5]
```
After this PR:
```
scala> spark.sql("SELECT id from spark_31220 GROUP BY id").explain
== Physical Plan ==
AdaptiveSparkPlan(isFinalPlan=false)
+- HashAggregate(keys=[id#5], functions=[])
   +- Exchange hashpartitioning(id#5, 1000), true, [id=#171]
      +- HashAggregate(keys=[id#5], functions=[])
         +- FileScan parquet default.spark_31220[id#5]

scala> spark.sql("SELECT id from spark_31220 DISTRIBUTE BY id").explain
== Physical Plan ==
AdaptiveSparkPlan(isFinalPlan=false)
+- Exchange hashpartitioning(id#5, 1000), false, [id=#179]
   +- FileScan parquet default.spark_31220[id#5]
```

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Unit test.

Closes #27986 from wangyum/SPARK-31220.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-09 16:07:22 +00:00
Max Gekk de91915a24 [SPARK-31940][SQL][DOCS] Document the default JVM time zone in to/fromJavaDate and legacy date formatters
### What changes were proposed in this pull request?
Update comments for `DateTimeUtils`.`toJavaDate` and `fromJavaDate`, and for the legacy date formatters `LegacySimpleDateFormatter` and `LegacyFastDateFormatter` regarding to the default JVM time zone. The comments say that the default JVM time zone is used intentionally for backward compatibility with Spark 2.4 and earlier versions.

Closes #28709

### Why are the changes needed?
To document current behaviour of related methods in `DateTimeUtils` and the legacy date formatters. For example, correctness of `HiveResult.hiveResultString` and `toHiveString` is directly related to the same time zone used by `toJavaDate` and `LegacyFastDateFormatter`.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the Scala style checker `./dev/scalastyle`

Closes #28767 from MaxGekk/doc-legacy-formatters.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-09 15:20:13 +00:00
lipzhu ca2cfd4185 [SPARK-31906][SQL][DOCS] Enhance comments in NamedExpression.qualifier
### Why are the changes needed?
The qualifier name should contains catalog name.

### Does this PR introduce _any_ user-facing change?
NO.

### How was this patch tested?
UT.

Closes #28726 from lipzhu/SPARK-31906.

Authored-by: lipzhu <lipzhu@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-09 13:59:00 +00:00
Kent Yao fc6af9d900 [SPARK-31867][SQL][FOLLOWUP] Check result differences for datetime formatting
### What changes were proposed in this pull request?

In this PR, we throw `SparkUpgradeException` when getting `DateTimeException` for datetime formatting in the `EXCEPTION` legacy Time Parser Policy.

### Why are the changes needed?
`DateTimeException` is also declared by `java.time.format.DateTimeFormatter#format`, but in Spark, it can barely occur. We have suspected one that due to a JDK bug so far. see https://bugs.openjdk.java.net/browse/JDK-8079628.

For `from_unixtime` function, we will suppress the DateTimeException caused by `DD` and result `NULL`. It is a silent date change that should be avoided in Java 8.

### Does this PR introduce _any_ user-facing change?

Yes,  when running on Java8 and using `from_unixtime` function with pattern `DD` to format datetimes, if dayofyear>=100, `SparkUpgradeException` will alert users instead of silently resulting null. For `date_format`, `SparkUpgradeException` take the palace of  `DateTimeException`.

### How was this patch tested?

add unit tests.

Closes #28736 from yaooqinn/SPARK-31867-F.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-05 16:44:16 +00:00
Kent Yao 9d5b5d0a58 [SPARK-31879][SQL][TEST-JAVA11] Make week-based pattern invalid for formatting too
# What changes were proposed in this pull request?

After all these attempts https://github.com/apache/spark/pull/28692 and https://github.com/apache/spark/pull/28719 an https://github.com/apache/spark/pull/28727.
they all have limitations as mentioned in their discussions.

Maybe the only way is to forbid them all

### Why are the changes needed?

These week-based fields need Locale to express their semantics, the first day of the week varies from country to country.

From the Java doc of WeekFields
```java
    /**
     * Gets the first day-of-week.
     * <p>
     * The first day-of-week varies by culture.
     * For example, the US uses Sunday, while France and the ISO-8601 standard use Monday.
     * This method returns the first day using the standard {code DayOfWeek} enum.
     *
     * return the first day-of-week, not null
     */
    public DayOfWeek getFirstDayOfWeek() {
        return firstDayOfWeek;
    }
```

But for the SimpleDateFormat, the day-of-week is not localized

```
u	Day number of week (1 = Monday, ..., 7 = Sunday)	Number	1
```

Currently, the default locale we use is the US, so the result moved a day or a year or a week backward.

e.g.

For the date `2019-12-29(Sunday)`, in the Sunday Start system(e.g. en-US), it belongs to 2020 of week-based-year, in the Monday Start system(en-GB), it goes to 2019. the week-of-week-based-year(w) will be affected too

```sql
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-US'));
2020
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-GB'));
2019

spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US'));
2020-01-01
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB'));
2019-52-07

spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US'));
2020-02-01
spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB'));
2020-01-07
```

For other countries, please refer to [First Day of the Week in Different Countries](http://chartsbin.com/view/41671)

### Does this PR introduce _any_ user-facing change?
With this change, user can not use 'YwuW',  but 'e' for 'u' instead. This can at least turn this not to be a silent data change.

### How was this patch tested?

add unit tests

Closes #28728 from yaooqinn/SPARK-31879-NEW2.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-05 08:14:01 +00:00
Wenchen Fan dc0709fa0c [SPARK-29947][SQL][FOLLOWUP] ResolveRelations should return relations with fresh attribute IDs
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/26589, which caches the table relations to speed up the table lookup. However, it brings some side effects: the rule `ResolveRelations` may return exactly the same relations, while before it always returns relations with fresh attribute IDs.

This PR is to eliminate this side effect.

### Why are the changes needed?

There is no bug report yet, but this side effect may impact things like self-join. It's better to restore the 2.4 behavior and always return refresh relations.

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

N/A

Closes #28717 from cloud-fan/fix.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-03 19:08:36 +00:00
Wenchen Fan e61d0de11f Revert "[SPARK-31879][SQL] Using GB as default Locale for datetime formatters"
This reverts commit c59f51bcc2.
2020-06-04 01:54:22 +08:00
Kent Yao afcc14c6d2 [SPARK-31896][SQL] Handle am-pm timestamp parsing when hour is missing
### What changes were proposed in this pull request?

This PR set the hour to 12/0 when the AMPM_OF_DAY field exists

### Why are the changes needed?

When the hour is absent but the am-pm is present, the time is incorrect for pm

### Does this PR introduce _any_ user-facing change?
yes, the change is user-facing but to change back to 2.4 to keep backward compatibility

e.g.
```sql
spark-sql> select to_timestamp('33:33 PM', 'mm:ss a');
1970-01-01 12:33:33
spark-sql> select to_timestamp('33:33 AM', 'mm:ss a');
1970-01-01 00:33:33

```

otherwise, the results are all `1970-01-01 00:33:33`

### How was this patch tested?

add unit tests

Closes #28713 from yaooqinn/SPARK-31896.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-03 13:30:22 +00:00
Kent Yao afe95bd9ad [SPARK-31892][SQL] Disable week-based date filed for parsing
### What changes were proposed in this pull request?

This PR disables week-based date filed for parsing

closes #28674
### Why are the changes needed?

1. It's an un-fixable behavior change to fill the gap between SimpleDateFormat and DateTimeFormater and backward-compatibility for different JDKs.A lot of effort has been made to prove it at https://github.com/apache/spark/pull/28674

2. The existing behavior itself in 2.4 is confusing, e.g.

```sql
spark-sql> select to_timestamp('1', 'w');
1969-12-28 00:00:00
spark-sql> select to_timestamp('1', 'u');
1970-01-05 00:00:00
```
  The 'u' here seems not to go to the Monday of the first week in week-based form or the first day of the year in non-week-based form but go to the Monday of the second week in week-based form.

And, e.g.
```sql
spark-sql> select to_timestamp('2020 2020', 'YYYY yyyy');
2020-01-01 00:00:00
spark-sql> select to_timestamp('2020 2020', 'yyyy YYYY');
2019-12-29 00:00:00
spark-sql> select to_timestamp('2020 2020 1', 'YYYY yyyy w');
NULL
spark-sql> select to_timestamp('2020 2020 1', 'yyyy YYYY w');
2019-12-29 00:00:00
```

  I think we don't need to introduce all the weird behavior from Java

3. The current test coverage for week-based date fields is almost 0%, which indicates that we've never imagined using it.

4. Avoiding JDK bugs

https://issues.apache.org/jira/browse/SPARK-31880

### Does this PR introduce _any_ user-facing change?

Yes, the 'Y/W/w/u/F/E' pattern cannot be used datetime parsing functions.

### How was this patch tested?

more tests added

Closes #28706 from yaooqinn/SPARK-31892.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-03 06:49:58 +00:00
Kent Yao c59f51bcc2 [SPARK-31879][SQL] Using GB as default Locale for datetime formatters
# What changes were proposed in this pull request?

This PR switches the default Locale from the `US` to `GB` to change the behavior of the first day of the week from Sunday-started to Monday-started as same as v2.4

### Why are the changes needed?

#### cases
```sql
spark-sql> select to_timestamp('2020-1-1', 'YYYY-w-u');
2019-12-29 00:00:00
spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
spark.sql.legacy.timeParserPolicy	legacy
spark-sql> select to_timestamp('2020-1-1', 'YYYY-w-u');
2019-12-30 00:00:00
```

#### reasons

These week-based fields need Locale to express their semantics, the first day of the week varies from country to country.

From the Java doc of WeekFields
```java
    /**
     * Gets the first day-of-week.
     * <p>
     * The first day-of-week varies by culture.
     * For example, the US uses Sunday, while France and the ISO-8601 standard use Monday.
     * This method returns the first day using the standard {code DayOfWeek} enum.
     *
     * return the first day-of-week, not null
     */
    public DayOfWeek getFirstDayOfWeek() {
        return firstDayOfWeek;
    }
```

But for the SimpleDateFormat, the day-of-week is not localized

```
u	Day number of week (1 = Monday, ..., 7 = Sunday)	Number	1
```

Currently, the default locale we use is the US, so the result moved a day backward.

For other countries, please refer to [First Day of the Week in Different Countries](http://chartsbin.com/view/41671)

With this change, it restores the first day of week calculating for functions when using the default locale.

### Does this PR introduce _any_ user-facing change?

Yes, but the behavior change is used to restore the old one of v2.4

### How was this patch tested?

add unit tests

Closes #28692 from yaooqinn/SPARK-31879.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-03 06:07:53 +00:00
lipzhu d79a8a88b1 [SPARK-31834][SQL] Improve error message for incompatible data types
### What changes were proposed in this pull request?
We should use dataType.catalogString to unified the data type mismatch message.
Before:
```sql
spark-sql> create table SPARK_31834(a int) using parquet;
spark-sql> insert into SPARK_31834 select '1';
Error in query: Cannot write incompatible data to table '`default`.`spark_31834`':
- Cannot safely cast 'a': StringType to IntegerType;
```

After:
```sql
spark-sql> create table SPARK_31834(a int) using parquet;
spark-sql> insert into SPARK_31834 select '1';
Error in query: Cannot write incompatible data to table '`default`.`spark_31834`':
- Cannot safely cast 'a': string to int;
```

### How was this patch tested?
UT.

Closes #28654 from lipzhu/SPARK-31834.

Authored-by: lipzhu <lipzhu@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-06-02 21:07:10 +09:00
Sunitha Kambhampati 4161c62429 [SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow
### What changes were proposed in this pull request?

JIRA SPARK-28067:  Wrong results are returned for aggregate sum with decimals with whole stage codegen enabled

**Repro:**
WholeStage enabled enabled ->  Wrong results
WholeStage disabled -> Returns exception Decimal precision 39 exceeds max precision 38

**Issues:**
1. Wrong results are returned which is bad
2. Inconsistency between whole stage enabled and disabled.

**Cause:**
Sum does not take care of possibility of overflow for the intermediate steps.  ie the updateExpressions and mergeExpressions.

This PR makes the following changes:
- Add changes to check if overflow occurs for decimal in aggregate Sum and if there is an overflow,  it will return null for the Sum operation when spark.sql.ansi.enabled is false.
- When spark.sql.ansi.enabled is true, then the sum operation will return an exception if an overflow occurs for the decimal operation in Sum.
- This is keeping it consistent with the behavior defined in spark.sql.ansi.enabled property

**Before the fix:  Scenario 1:** - WRONG RESULTS
```
scala> val df = Seq(
     |  (BigDecimal("10000000000000000000"), 1),
     |  (BigDecimal("10000000000000000000"), 1),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2)).toDF("decNum", "intNum")
df: org.apache.spark.sql.DataFrame = [decNum: decimal(38,18), intNum: int]

scala> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, "intNum").agg(sum("decNum"))
df2: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]

scala> df2.show(40,false)
+---------------------------------------+
|sum(decNum)                            |
+---------------------------------------+
|20000000000000000000.000000000000000000|
+---------------------------------------+
```

--
**Before fix: Scenario2:  Setting spark.sql.ansi.enabled to true** - WRONG RESULTS
```
scala> spark.conf.set("spark.sql.ansi.enabled", "true")

scala> val df = Seq(
     |  (BigDecimal("10000000000000000000"), 1),
     |  (BigDecimal("10000000000000000000"), 1),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2)).toDF("decNum", "intNum")
df: org.apache.spark.sql.DataFrame = [decNum: decimal(38,18), intNum: int]

scala> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, "intNum").agg(sum("decNum"))
df2: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]

scala> df2.show(40,false)
+---------------------------------------+
|sum(decNum)                            |
+---------------------------------------+
|20000000000000000000.000000000000000000|
+---------------------------------------+

```

**After the fix: Scenario1:**
```
scala> val df = Seq(
     |  (BigDecimal("10000000000000000000"), 1),
     |  (BigDecimal("10000000000000000000"), 1),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2)).toDF("decNum", "intNum")
df: org.apache.spark.sql.DataFrame = [decNum: decimal(38,18), intNum: int]

scala> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, "intNum").agg(sum("decNum"))
df2: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]

scala>  df2.show(40,false)
+-----------+
|sum(decNum)|
+-----------+
|null       |
+-----------+

```

**After fix:  Scenario2:  Setting the spark.sql.ansi.enabled to true:**
```
scala> spark.conf.set("spark.sql.ansi.enabled", "true")

scala> val df = Seq(
     |  (BigDecimal("10000000000000000000"), 1),
     |  (BigDecimal("10000000000000000000"), 1),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2),
     |  (BigDecimal("10000000000000000000"), 2)).toDF("decNum", "intNum")
df: org.apache.spark.sql.DataFrame = [decNum: decimal(38,18), intNum: int]

scala> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, "intNum").agg(sum("decNum"))
df2: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)]

scala>  df2.show(40,false)
20/02/18 10:57:43 ERROR Executor: Exception in task 5.0 in stage 4.0 (TID 30)
java.lang.ArithmeticException: Decimal(expanded,100000000000000000000.000000000000000000,39,18}) cannot be represented as Decimal(38, 18).

```

### Why are the changes needed?
The changes are needed in order to fix the wrong results that are returned for decimal aggregate sum.

### Does this PR introduce any user-facing change?
User would see wrong results on aggregate sum that involved decimal overflow prior to this change, but now the user will see null.  But if user enables the spark.sql.ansi.enabled flag to true, then the user will see an exception and not incorrect results.

### How was this patch tested?
New test has been added and existing tests for sql, catalyst and hive suites were run ok.

Closes #27627 from skambha/decaggfixwrongresults.

Lead-authored-by: Sunitha Kambhampati <skambha@us.ibm.com>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-06-02 11:30:30 +00:00
HyukjinKwon e69466056f [SPARK-31849][PYTHON][SQL] Make PySpark SQL exceptions more Pythonic
### What changes were proposed in this pull request?

This PR proposes to make PySpark exception more Pythonic by hiding JVM stacktrace by default. It can be enabled by turning on `spark.sql.pyspark.jvmStacktrace.enabled` configuration.

```
Traceback (most recent call last):
  ...
pyspark.sql.utils.PythonException:
  An exception was thrown from Python worker in the executor. The below is the Python worker stacktrace.
Traceback (most recent call last):
  ...
```

If this `spark.sql.pyspark.jvmStacktrace.enabled` is enabled, it appends:

```
JVM stacktrace:
org.apache.spark.Exception: ...
  ...
```

For example, the codes below:

```python
from pyspark.sql.functions import udf
udf
def divide_by_zero(v):
    raise v / 0

spark.range(1).select(divide_by_zero("id")).show()
```

will show an error messages that looks like Python exception thrown from the local.

<details>
<summary>Python exception message when <code>spark.sql.pyspark.jvmStacktrace.enabled</code> is off (default)</summary>

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/dataframe.py", line 427, in show
    print(self._jdf.showString(n, 20, vertical))
  File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/.../spark/python/pyspark/sql/utils.py", line 131, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from
pyspark.sql.utils.PythonException:
  An exception was thrown from Python worker in the executor. The below is the Python worker stacktrace.
Traceback (most recent call last):
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main
    process()
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process
    serializer.dump_stream(out_iter, outfile)
  File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream
    for obj in iterator:
  File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched
    for item in iterator:
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda>
    return lambda *a: f(*a)
  File "/.../spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper
    return f(*args, **kwargs)
  File "<stdin>", line 3, in divide_by_zero
ZeroDivisionError: division by zero
```

</details>

<details>
<summary>Python exception message when <code>spark.sql.pyspark.jvmStacktrace.enabled</code> is on</summary>

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/dataframe.py", line 427, in show
    print(self._jdf.showString(n, 20, vertical))
  File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/.../spark/python/pyspark/sql/utils.py", line 137, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from
pyspark.sql.utils.PythonException:
  An exception was thrown from Python worker in the executor. The below is the Python worker stacktrace.
Traceback (most recent call last):
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main
    process()
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process
    serializer.dump_stream(out_iter, outfile)
  File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream
    for obj in iterator:
  File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched
    for item in iterator:
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda>
    return lambda *a: f(*a)
  File "/.../spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper
    return f(*args, **kwargs)
  File "<stdin>", line 3, in divide_by_zero
ZeroDivisionError: division by zero

JVM stacktrace:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, 192.168.35.193, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main
    process()
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process
    serializer.dump_stream(out_iter, outfile)
  File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream
    for obj in iterator:
  File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched
    for item in iterator:
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda>
    return lambda *a: f(*a)
  File "/.../spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper
    return f(*args, **kwargs)
  File "<stdin>", line 3, in divide_by_zero
ZeroDivisionError: division by zero

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:516)
	at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:81)
	at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:64)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:469)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:753)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:469)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:472)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2117)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2066)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2065)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2065)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1021)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1021)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1021)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2297)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2246)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2235)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:823)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2108)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2129)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2148)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:467)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:420)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47)
	at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3653)
	at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2695)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3644)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3642)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:2695)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:2902)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:300)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:337)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main
    process()
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process
    serializer.dump_stream(out_iter, outfile)
  File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream
    for obj in iterator:
  File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched
    for item in iterator:
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda>
    return lambda *a: f(*a)
  File "/.../spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper
    return f(*args, **kwargs)
  File "<stdin>", line 3, in divide_by_zero
ZeroDivisionError: division by zero

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:516)
	at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:81)
	at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:64)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:469)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:753)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:469)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:472)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more
```

</details>

<details>
<summary>Python exception message without this change</summary>

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/dataframe.py", line 427, in show
    print(self._jdf.showString(n, 20, vertical))
  File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/.../spark/python/pyspark/sql/utils.py", line 98, in deco
    return f(*a, **kw)
  File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o160.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 10 in stage 5.0 failed 4 times, most recent failure: Lost task 10.3 in stage 5.0 (TID 37, 192.168.35.193, executor 3): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main
    process()
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process
    serializer.dump_stream(out_iter, outfile)
  File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream
    for obj in iterator:
  File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched
    for item in iterator:
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda>
    return lambda *a: f(*a)
  File "/.../spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper
    return f(*args, **kwargs)
  File "<stdin>", line 3, in divide_by_zero
ZeroDivisionError: division by zero

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:516)
	at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:81)
	at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:64)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:469)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:753)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:469)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:472)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
	at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2117)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2066)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2065)
	at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
	at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
	at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2065)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1021)
	at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1021)
	at scala.Option.foreach(Option.scala:407)
	at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1021)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2297)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2246)
	at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2235)
	at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49)
	at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:823)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2108)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2129)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2148)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:467)
	at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:420)
	at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47)
	at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3653)
	at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2695)
	at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3644)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3642)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:2695)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:2902)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:300)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:337)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main
    process()
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process
    serializer.dump_stream(out_iter, outfile)
  File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream
    self.serializer.dump_stream(self._batched(iterator), stream)
  File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream
    for obj in iterator:
  File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched
    for item in iterator:
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>
    result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
  File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda>
    return lambda *a: f(*a)
  File "/.../spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper
    return f(*args, **kwargs)
  File "<stdin>", line 3, in divide_by_zero
ZeroDivisionError: division by zero

	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:516)
	at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:81)
	at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:64)
	at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:469)
	at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
	at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
	at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
	at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:753)
	at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898)
	at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898)
	at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52)
	at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373)
	at org.apache.spark.rdd.RDD.iterator(RDD.scala:337)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:469)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:472)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	... 1 more
```

</details>

<br/>

Another example with Python 3.7:

```python
sql("a")
```

<details>
<summary>Python exception message when <code>spark.sql.pyspark.jvmStacktrace.enabled</code> is off (default)</summary>

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/session.py", line 646, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/.../spark/python/pyspark/sql/utils.py", line 131, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from
pyspark.sql.utils.ParseException:
mismatched input 'a' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0)

== SQL ==
a
^^^
```

</details>

<details>
<summary>Python exception message when <code>spark.sql.pyspark.jvmStacktrace.enabled</code> is on</summary>

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/session.py", line 646, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/.../spark/python/pyspark/sql/utils.py", line 131, in deco
    raise_from(converted)
  File "<string>", line 3, in raise_from
pyspark.sql.utils.ParseException:
mismatched input 'a' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0)

== SQL ==
a
^^^

JVM stacktrace:
org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'a' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0)

== SQL ==
a
^^^

	at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:266)
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:133)
	at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:49)
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:81)
	at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:604)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
	at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:604)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:601)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)
```

</details>

<details>
<summary>Python exception message without this change</summary>

```
Traceback (most recent call last):
  File "/.../spark/python/pyspark/sql/utils.py", line 98, in deco
    return f(*a, **kw)
  File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o26.sql.
: org.apache.spark.sql.catalyst.parser.ParseException:
mismatched input 'a' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0)

== SQL ==
a
^^^

	at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:266)
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:133)
	at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:49)
	at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:81)
	at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:604)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
	at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:604)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763)
	at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:601)
	at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
	at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
	at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
	at java.lang.reflect.Method.invoke(Method.java:498)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
	at py4j.Gateway.invoke(Gateway.java:282)
	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
	at py4j.commands.CallCommand.execute(CallCommand.java:79)
	at py4j.GatewayConnection.run(GatewayConnection.java:238)
	at java.lang.Thread.run(Thread.java:748)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/session.py", line 646, in sql
    return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
  File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
  File "/.../spark/python/pyspark/sql/utils.py", line 102, in deco
    raise converted
pyspark.sql.utils.ParseException:
mismatched input 'a' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0)

== SQL ==
a
^^^
```

</details>

### Why are the changes needed?

Currently, PySpark exceptions are very unfriendly to Python users with causing a bunch of JVM stacktrace. See "Python exception message without this change" above.

### Does this PR introduce _any_ user-facing change?

Yes, it will change the exception message. See the examples above.

### How was this patch tested?

Manually tested by

```bash
./bin/pyspark --conf spark.sql.pyspark.jvmStacktrace.enabled=true
```

and running the examples above.

Closes #28661 from HyukjinKwon/python-debug.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-06-01 09:45:21 +09:00
Max Gekk 47dc332258 [SPARK-31874][SQL] Use FastDateFormat as the legacy fractional formatter
### What changes were proposed in this pull request?
1. Replace `SimpleDateFormat` by `FastDateFormat` as the legacy formatter of `FractionTimestampFormatter`.
2. Optimise `LegacyFastTimestampFormatter` for `java.sql.Timestamp` w/o fractional part.

### Why are the changes needed?
1. By default `HiveResult`.`hiveResultString` retrieves timestamps values as instances of `java.sql.Timestamp`, and uses the legacy parser `SimpleDateFormat` to convert the timestamps to strings. After the fix https://github.com/apache/spark/pull/28024, the fractional formatter and its companion - legacy formatter `SimpleDateFormat` are created per every value. By switching from `LegacySimpleTimestampFormatter` to `LegacyFastTimestampFormatter`, we can utilize the internal cache of `FastDateFormat`, and avoid parsing the default pattern `yyyy-MM-dd HH:mm:ss`.
2. The second change in the method `def format(ts: Timestamp): String` of `LegacyFastTimestampFormatter` is needed to optimize the formatter for patterns without the fractional part and avoid conversions to microseconds.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By existing tests in `TimestampFormatter`.

Closes #28678 from MaxGekk/fastdateformat-as-legacy-frac-formatter.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-31 13:05:00 +00:00
Kent Yao 547c5bf552 [SPARK-31867][SQL] Disable year type datetime patterns which are longer than 10
### What changes were proposed in this pull request?

As mentioned in https://github.com/apache/spark/pull/28673 and suggested via cloud-fan at https://github.com/apache/spark/pull/28673#discussion_r432817075

In this PR, we disable datetime pattern in the form of `y..y` and `Y..Y` whose lengths are greater than 10 to avoid sort of JDK bug as described below

he new datetime formatter introduces silent data change like,

```sql
spark-sql> select from_unixtime(1, 'yyyyyyyyyyy-MM-dd');
NULL
spark-sql> set spark.sql.legacy.timeParserPolicy=legacy;
spark.sql.legacy.timeParserPolicy	legacy
spark-sql> select from_unixtime(1, 'yyyyyyyyyyy-MM-dd');
00000001970-01-01
spark-sql>
```

For patterns that support `SignStyle.EXCEEDS_PAD`, e.g. `y..y`(len >=4), when using the `NumberPrinterParser` to format it

```java
switch (signStyle) {
  case EXCEEDS_PAD:
    if (minWidth < 19 && value >= EXCEED_POINTS[minWidth]) {
      buf.append(decimalStyle.getPositiveSign());
    }
    break;

           ....
```
the `minWidth` == `len(y..y)`
the `EXCEED_POINTS` is

```java
/**
         * Array of 10 to the power of n.
         */
        static final long[] EXCEED_POINTS = new long[] {
            0L,
            10L,
            100L,
            1000L,
            10000L,
            100000L,
            1000000L,
            10000000L,
            100000000L,
            1000000000L,
            10000000000L,
        };
```

So when the `len(y..y)` is greater than 10, ` ArrayIndexOutOfBoundsException` will be raised.

 And at the caller side, for `from_unixtime`, the exception will be suppressed and silent data change occurs. for `date_format`, the `ArrayIndexOutOfBoundsException` will continue.

### Why are the changes needed?
fix silent data change

### Does this PR introduce _any_ user-facing change?

Yes, SparkUpgradeException will take place of `null` result when the pattern contains 10 or more continuous  'y' or 'Y'

### How was this patch tested?

new tests

Closes #28684 from yaooqinn/SPARK-31867-2.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-31 12:34:39 +00:00
Maryann Xue b9737c3c22 [SPARK-31864][SQL] Adjust AQE skew join trigger condition
### What changes were proposed in this pull request?

This PR makes a minor change in deciding whether a partition is skewed by comparing the partition size to the median size of coalesced partitions instead of median size of raw partitions before coalescing.

### Why are the changes needed?

This change is line with target size criteria of splitting skew join partitions and can also cope with situations of extra empty partitions caused by over-partitioning. This PR has also improved skew join tests in AQE tests.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Updated UTs.

Closes #28669 from maryannxue/spark-31864.

Authored-by: Maryann Xue <maryann.xue@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-30 07:47:29 +00:00
Jungtaek Lim (HeartSaVioR) fe1d1e24bc [SPARK-31214][BUILD] Upgrade Janino to 3.1.2
### What changes were proposed in this pull request?

This PR proposes to upgrade Janino to 3.1.2 which is released recently.

Major changes were done for refactoring, as well as there're lots of "no commit message". Belows are the pairs of (commit title, commit) which seem to deal with some bugs or specific improvements (not purposed to refactor) after 3.0.15.

* Issue #119: Guarantee executing popOperand() in popUninitializedVariableOperand() via moving popOperand() out of "assert"
* Issue #116: replace operand to final target type if boxing conversion and widening reference conversion happen together
* Merged pull request `#114` "Grow the code for relocatables, and do fixup, and relocate".
  * 367c58e73e
* issue `#107`: Janino requires "org.codehaus.commons.compiler.io", but commons-compiler does not export this package
  * f7d99596d4
* Throw an NYI CompileException when a static interface method is invoked.
  * efd3884983
* Fixed the promotion of the array access index expression (see JLS7 15.13 Array Access Expressions)
  * 32fdb5f5f1
* Issue `#104`: ClassLoaderIClassLoader 's ClassNotFoundException handle mechanism enhancement
  * 6e8a97d609

You can see the changelog from the link: http://janino-compiler.github.io/janino/changelog.html

### Why are the changes needed?

We got some report on failure on user's query which Janino throws error on compiling generated code. The issue is here: https://github.com/janino-compiler/janino/issues/113 It contains the information of generated code, symptom (error), and analysis of the bug, so please refer the link for more details.
Janino 3.1.1 contains the PR https://github.com/janino-compiler/janino/pull/114 which would enable Janino to succeed to compile user's query properly. I've also fixed a couple of more bugs as 3.1.1 made Spark UTs fail - hence we need to upgrade to 3.1.2.

Furthermore, from my testing, https://github.com/janino-compiler/janino/issues/90 (which Josh Rosen filed before) seems to be also resolved in 3.1.2 as well.

Looks like Janino is maintained by one person and there's no even version branches and releases/tags so we can't expect Janino maintainer to release a new bugfix version - hence have to try out new minor version.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing UTs.

Closes #27860 from HeartSaVioR/SPARK-31101.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-29 07:42:57 -07:00
Yuming Wang 91148f428b [SPARK-28481][SQL] More expressions should extend NullIntolerant
### What changes were proposed in this pull request?

1. Make more expressions extend `NullIntolerant`.
2. Add a checker(in `ExpressionInfoSuite`) to identify whether the expression is `NullIntolerant`.

### Why are the changes needed?

Avoid skew join if the join column has many null values and can improve query performance. For examples:
```sql
CREATE TABLE t1(c1 string, c2 string) USING parquet;
CREATE TABLE t2(c1 string, c2 string) USING parquet;
EXPLAIN SELECT t1.* FROM t1 JOIN t2 ON upper(t1.c1) = upper(t2.c1);
```

Before and after this PR:
```sql
== Physical Plan ==
*(2) Project [c1#5, c2#6]
+- *(2) BroadcastHashJoin [upper(c1#5)], [upper(c1#7)], Inner, BuildLeft
   :- BroadcastExchange HashedRelationBroadcastMode(List(upper(input[0, string, true]))), [id=#41]
   :  +- *(1) ColumnarToRow
   :     +- FileScan parquet default.t1[c1#5,c2#6]
   +- *(2) ColumnarToRow
      +- FileScan parquet default.t2[c1#7]

== Physical Plan ==
*(2) Project [c1#5, c2#6]
+- *(2) BroadcastHashJoin [upper(c1#5)], [upper(c1#7)], Inner, BuildRight
   :- *(2) Project [c1#5, c2#6]
   :  +- *(2) Filter isnotnull(c1#5)
   :     +- *(2) ColumnarToRow
   :        +- FileScan parquet default.t1[c1#5,c2#6]
   +- BroadcastExchange HashedRelationBroadcastMode(List(upper(input[0, string, true]))), [id=#59]
      +- *(1) Project [c1#7]
         +- *(1) Filter isnotnull(c1#7)
            +- *(1) ColumnarToRow
               +- FileScan parquet default.t2[c1#7]

```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #28626 from wangyum/SPARK-28481.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-29 07:28:57 +00:00
GuoPhilipse dfbc5edf20 [SPARK-31839][TESTS] Delete duplicate code in castsuit
### What changes were proposed in this pull request?
Delete duplicate code castsuit

### Why are the changes needed?
keep spark code clean

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
no need

Closes #28655 from GuoPhilipse/delete-duplicate-code-castsuit.

Lead-authored-by: GuoPhilipse <46367746+GuoPhilipse@users.noreply.github.com>
Co-authored-by: GuoPhilipse <guofei_ok@126.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-28 09:57:11 +09:00
Wenchen Fan 1528fbced8 [SPARK-31827][SQL] fail datetime parsing/formatting if detect the Java 8 bug of stand-alone form
### What changes were proposed in this pull request?

If `LLL`/`qqq` is used in the datetime pattern string, and the current JDK in use has a bug for the stand-alone form (see https://bugs.openjdk.java.net/browse/JDK-8114833), throw an exception with a clear error message.

### Why are the changes needed?

to keep backward compatibility with Spark 2.4

### Does this PR introduce _any_ user-facing change?

Yes

Spark 2.4
```
scala> sql("select date_format('1990-1-1', 'LLL')").show
+---------------------------------------------+
|date_format(CAST(1990-1-1 AS TIMESTAMP), LLL)|
+---------------------------------------------+
|                                          Jan|
+---------------------------------------------+
```

Spark 3.0 with Java 11
```
scala> sql("select date_format('1990-1-1', 'LLL')").show
+---------------------------------------------+
|date_format(CAST(1990-1-1 AS TIMESTAMP), LLL)|
+---------------------------------------------+
|                                          Jan|
+---------------------------------------------+
```

Spark 3.0 with Java 8
```
// before this PR
+---------------------------------------------+
|date_format(CAST(1990-1-1 AS TIMESTAMP), LLL)|
+---------------------------------------------+
|                                            1|
+---------------------------------------------+
// after this PR
scala> sql("select date_format('1990-1-1', 'LLL')").show
org.apache.spark.SparkUpgradeException
```

### How was this patch tested?

manual test with java 8 and 11

Closes #28646 from cloud-fan/format.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-27 18:53:19 +00:00
Max Gekk b5eb0933ac [SPARK-31762][SQL][FOLLOWUP] Avoid double formatting in legacy fractional formatter
### What changes were proposed in this pull request?
Currently, the legacy fractional formatter is based on the implementation from Spark 2.4 which formats the input timestamp twice:
```
    val timestampString = ts.toString
    val formatted = legacyFormatter.format(ts)
```
to strip trailing zeros. This PR proposes to avoid the first formatting by forming the second fraction directly.

### Why are the changes needed?
It makes legacy fractional formatter faster.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By existing test "format fraction of second" in `TimestampFormatterSuite` + added test for timestamps before 1970-01-01 00:00:00Z

Closes #28643 from MaxGekk/optimize-legacy-fract-format.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-27 18:44:22 +00:00
Kent Yao 311fe6a880 [SPARK-31835][SQL][TESTS] Add zoneId to codegen related tests in DateExpressionsSuite
### What changes were proposed in this pull request?

This PR modifies some codegen related tests to test escape characters for datetime functions which are time zone aware. If the timezone is absent, the formatter could result in `null` caused by `java.util.NoSuchElementException: None.get` and bypassing the real intention of those test cases.

### Why are the changes needed?

fix tests

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

passing the modified test cases.

Closes #28653 from yaooqinn/SPARK-31835.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-27 17:26:07 +00:00
Ali Afroozeh f6f1e51072 [SPARK-31719][SQL] Refactor JoinSelection
### What changes were proposed in this pull request?
This PR extracts the logic for selecting the planned join type out of the `JoinSelection` rule and moves it to `JoinSelectionHelper` in Catalyst.

### Why are the changes needed?
This change both cleans up the code in `JoinSelection` and allows the logic to be in one place and be used from other rules that need to make decision based on the join type before the planning time.

### Does this PR introduce _any_ user-facing change?
`BuildSide`, `BuildLeft`, and `BuildRight` are moved from `org.apache.spark.sql.execution` to Catalyst in `org.apache.spark.sql.catalyst.optimizer`.

### How was this patch tested?
This is a refactoring, passes existing tests.

Closes #28540 from dbaliafroozeh/RefactorJoinSelection.

Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-27 15:49:08 +00:00
beliefer 8f2b6f3a0b [SPARK-31393][SQL][FOLLOW-UP] Show the correct alias in schema for expression
### What changes were proposed in this pull request?
Some alias of expression can not display correctly in schema. This PR will fix them.
- `ln`
- `rint`
- `lcase`
- `position`

### Why are the changes needed?
Improve the implement of some expression.

### Does this PR introduce _any_ user-facing change?
 'Yes'. This PR will let user see the correct alias in schema.

### How was this patch tested?
Jenkins test.

Closes #28551 from beliefer/show-correct-alias-in-schema.

Lead-authored-by: beliefer <beliefer@163.com>
Co-authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-27 15:05:06 +09:00
HyukjinKwon df2a1fe131
[SPARK-31808][SQL] Makes struct function's output name and class name pretty
### What changes were proposed in this pull request?

This PR proposes to set the alias, and class name in its `ExpressionInfo` for `struct`.
- Class name in `ExpressionInfo`
  - from: `org.apache.spark.sql.catalyst.expressions.NamedStruct`
  - to:`org.apache.spark.sql.catalyst.expressions.CreateNamedStruct`
- Alias name: `named_struct(col1, v, ...)` -> `struct(v, ...)`

This PR takes over https://github.com/apache/spark/pull/28631

### Why are the changes needed?

To show the correct output name and class names to users.

### Does this PR introduce _any_ user-facing change?

Yes.

**Before:**

```scala
scala> sql("DESC FUNCTION struct").show(false)
+------------------------------------------------------------------------------------+
|function_desc                                                                       |
+------------------------------------------------------------------------------------+
|Function: struct                                                                    |
|Class: org.apache.spark.sql.catalyst.expressions.NamedStruct                        |
|Usage: struct(col1, col2, col3, ...) - Creates a struct with the given field values.|
+------------------------------------------------------------------------------------+
```

```scala
scala> sql("SELECT struct(1, 2)").show(false)
+------------------------------+
|named_struct(col1, 1, col2, 2)|
+------------------------------+
|[1, 2]                        |
+------------------------------+
```

**After:**

```scala
scala> sql("DESC FUNCTION struct").show(false)
+------------------------------------------------------------------------------------+
|function_desc                                                                       |
+------------------------------------------------------------------------------------+
|Function: struct                                                                    |
|Class: org.apache.spark.sql.catalyst.expressions.CreateNamedStruct                  |
|Usage: struct(col1, col2, col3, ...) - Creates a struct with the given field values.|
+------------------------------------------------------------------------------------+
```

```scala
scala> sql("SELECT struct(1, 2)").show(false)
+------------+
|struct(1, 2)|
+------------+
|[1, 2]      |
+------------+
```

### How was this patch tested?

Manually tested, and Jenkins tests.

Closes #28633 from HyukjinKwon/SPARK-31808.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-25 20:36:00 -07:00
Max Gekk 6c80ebbccb
[SPARK-31818][SQL] Fix pushing down filters with java.time.Instant values in ORC
### What changes were proposed in this pull request?
Convert `java.time.Instant` to `java.sql.Timestamp` in pushed down filters to ORC datasource when Java 8 time API enabled.

### Why are the changes needed?
The changes fix the exception raised while pushing date filters when `spark.sql.datetime.java8API.enabled` is set to `true`:
```
java.lang.IllegalArgumentException: Wrong value class java.time.Instant for TIMESTAMP.EQUALS leaf
 at org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl.checkLiteralType(SearchArgumentImpl.java:192)
 at org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl.<init>(SearchArgumentImpl.java:75)
```

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
Added tests to `OrcFilterSuite`.

Closes #28636 from MaxGekk/orc-timestamp-filter-pushdown.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-25 18:36:02 -07:00
Kent Yao 695cb617d4 [SPARK-31771][SQL] Disable Narrow TextStyle for datetime pattern 'G/M/L/E/u/Q/q'
### What changes were proposed in this pull request?

Five continuous pattern characters with 'G/M/L/E/u/Q/q' means Narrow-Text Style while we turn to use `java.time.DateTimeFormatterBuilder` since 3.0.0, which output the leading single letter of the value, e.g. `December` would be `D`. In Spark 2.4 they mean Full-Text Style.

In this PR, we explicitly disable Narrow-Text Style for these pattern characters.

### Why are the changes needed?

Without this change, there will be a silent data change.

### Does this PR introduce _any_ user-facing change?

Yes, queries with datetime operations using datetime patterns, e.g. `G/M/L/E/u` will fail if the pattern length is 5 and other patterns, e,g. 'k', 'm' also can accept a certain number of letters.

1. datetime patterns that are not supported by the new parser but the legacy will get SparkUpgradeException, e.g. "GGGGG", "MMMMM", "LLLLL", "EEEEE", "uuuuu", "aa", "aaa". 2 options are given to end-users, one is to use legacy mode, and the other is to follow the new online doc for correct datetime patterns

2, datetime patterns that are not supported by both the new parser and the legacy, e.g.  "QQQQQ", "qqqqq",  will get IllegalArgumentException which is captured by Spark internally and results NULL to end-users.

### How was this patch tested?

add unit tests

Closes #28592 from yaooqinn/SPARK-31771.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-25 15:07:41 +00:00
Kent Yao 0df8dd6073 [SPARK-30352][SQL] DataSourceV2: Add CURRENT_CATALOG function
### What changes were proposed in this pull request?

As we support multiple catalogs with DataSourceV2, we may need the `CURRENT_CATALOG` value expression from the SQL standard.

`CURRENT_CATALOG` is a general value specification in the SQL Standard, described as:

> The value specified by CURRENT_CATALOG is the character string that represents the current default catalog name.

### Why are the changes needed?
improve catalog v2 with ANSI SQL standard.

### Does this PR introduce any user-facing change?
yes, add a new function `current_catalog()` to point the current active catalog

### How was this patch tested?

add ut

Closes #27006 from yaooqinn/SPARK-30352.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-25 14:27:47 +00:00
Max Gekk 7f36310500 [SPARK-31802][SQL] Format Java date-time types in Row.jsonValue directly
### What changes were proposed in this pull request?
Use `format()` methods for Java date-time types in `Row.jsonValue`. The PR https://github.com/apache/spark/pull/28582 added the methods to avoid conversions to days and microseconds.

### Why are the changes needed?
To avoid unnecessary overhead of converting Java date-time types to micros/days before formatting. Also formatters have to convert input micros/days back to Java types to pass instances to standard library API.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By existing tests in `RowJsonSuite`.

Closes #28620 from MaxGekk/toJson-format-Java-datetime-types.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-25 12:50:38 +09:00
sandeep katta cf7463f309 [SPARK-31761][SQL] cast integer to Long to avoid IntegerOverflow for IntegralDivide operator
### What changes were proposed in this pull request?
`IntegralDivide` operator returns Long DataType, so integer overflow case should be handled.
If the operands are of type Int it will be casted to Long

### Why are the changes needed?
As `IntegralDivide` returns Long datatype, integer overflow should not happen

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT and also tested in the local cluster

After fix

![image](https://user-images.githubusercontent.com/35216143/82603361-25eccc00-9bd0-11ea-9ca7-001c539e628b.png)

SQL Test

After fix
![image](https://user-images.githubusercontent.com/35216143/82637689-f0250300-9c22-11ea-85c3-886ab2c23471.png)

Before Fix
![image](https://user-images.githubusercontent.com/35216143/82637984-878a5600-9c23-11ea-9e47-5ce2fb923c01.png)

Closes #28600 from sandeep-katta/integerOverFlow.

Authored-by: sandeep katta <sandeep.katta2007@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-24 14:50:11 +09:00
TJX2014 2115c55efe [SPARK-31710][SQL] Adds TIMESTAMP_SECONDS, TIMESTAMP_MILLIS and TIMESTAMP_MICROS functions
### What changes were proposed in this pull request?
Add and register three new functions: `TIMESTAMP_SECONDS`, `TIMESTAMP_MILLIS` and `TIMESTAMP_MICROS`
A test is added.

Reference: [BigQuery](https://cloud.google.com/bigquery/docs/reference/standard-sql/timestamp_functions?hl=en#timestamp_seconds)

### Why are the changes needed?
People will have convenient way to get timestamps from seconds,milliseconds and microseconds.

### Does this PR introduce _any_ user-facing change?
Yes, people will have the following ways to get timestamp:

```scala
sql("select TIMESTAMP_SECONDS(t.a) as timestamp from values(1230219000),(-1230219000) as t(a)").show(false)
```
```
+-------------------------+
|timestamp                  |
+-------------------------+
|2008-12-25 23:30:00|
|1931-01-07 16:30:00|
+-------------------------+
```
```scala
sql("select TIMESTAMP_MILLIS(t.a) as timestamp from values(1230219000123),(-1230219000123) as t(a)").show(false)
```
```
+-------------------------------+
|timestamp                           |
+-------------------------------+
|2008-12-25 23:30:00.123|
|1931-01-07 16:29:59.877|
+-------------------------------+
```
```scala
sql("select TIMESTAMP_MICROS(t.a) as timestamp from values(1230219000123123),(-1230219000123123) as t(a)").show(false)
```
```
+------------------------------------+
|timestamp                                   |
+------------------------------------+
|2008-12-25 23:30:00.123123|
|1931-01-07 16:29:59.876877|
+------------------------------------+
```
### How was this patch tested?
Unit test.

Closes #28534 from TJX2014/master-SPARK-31710.

Authored-by: TJX2014 <xiaoxingstack@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-22 14:16:30 +00:00
Wenchen Fan ce4da29ec3 [SPARK-31755][SQL] allow missing year/hour when parsing date/timestamp string
### What changes were proposed in this pull request?

This PR allows missing hour fields when parsing date/timestamp string, with 0 as the default value.

If the year field is missing, this PR still fail the query by default, but provides a new legacy config to allow it and use 1970 as the default value. It's not a good default value, as it is not a leap year, which means that it would never parse Feb 29. We just pick it for backward compatibility.

### Why are the changes needed?

To keep backward compatibility with Spark 2.4.

### Does this PR introduce _any_ user-facing change?

Yes.

Spark 2.4:
```
scala> sql("select to_timestamp('16', 'dd')").show
+------------------------+
|to_timestamp('16', 'dd')|
+------------------------+
|     1970-01-16 00:00:00|
+------------------------+

scala> sql("select to_date('16', 'dd')").show
+-------------------+
|to_date('16', 'dd')|
+-------------------+
|         1970-01-16|
+-------------------+

scala> sql("select to_timestamp('2019 40', 'yyyy mm')").show
+----------------------------------+
|to_timestamp('2019 40', 'yyyy mm')|
+----------------------------------+
|               2019-01-01 00:40:00|
+----------------------------------+

scala> sql("select to_timestamp('2019 10:10:10', 'yyyy hh:mm:ss')").show
+----------------------------------------------+
|to_timestamp('2019 10:10:10', 'yyyy hh:mm:ss')|
+----------------------------------------------+
|                           2019-01-01 10:10:10|
+----------------------------------------------+
```

in branch 3.0
```
scala> sql("select to_timestamp('16', 'dd')").show
+--------------------+
|to_timestamp(16, dd)|
+--------------------+
|                null|
+--------------------+

scala> sql("select to_date('16', 'dd')").show
+---------------+
|to_date(16, dd)|
+---------------+
|           null|
+---------------+

scala> sql("select to_timestamp('2019 40', 'yyyy mm')").show
+------------------------------+
|to_timestamp(2019 40, yyyy mm)|
+------------------------------+
|           2019-01-01 00:00:00|
+------------------------------+

scala> sql("select to_timestamp('2019 10:10:10', 'yyyy hh:mm:ss')").show
+------------------------------------------+
|to_timestamp(2019 10:10:10, yyyy hh:mm:ss)|
+------------------------------------------+
|                       2019-01-01 00:00:00|
+------------------------------------------+
```

After this PR, the behavior becomes the same as 2.4, if the legacy config is enabled.

### How was this patch tested?

new tests

Closes #28576 from cloud-fan/bug.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-22 16:10:08 +09:00
Max Gekk 5d673319af [SPARK-31762][SQL] Fix perf regression of date/timestamp formatting in toHiveString
### What changes were proposed in this pull request?
1. Add new methods that accept date-time Java types to the DateFormatter and TimestampFormatter traits. The methods format input date-time instances to strings:
    - TimestampFormatter:
      - `def format(ts: Timestamp): String`
      - `def format(instant: Instant): String`
    - DateFormatter:
      - `def format(date: Date): String`
      - `def format(localDate: LocalDate): String`
2. Re-use the added methods from `HiveResult.toHiveString`
3. Borrow the code for formatting of `java.sql.Timestamp` from Spark 2.4 `DateTimeUtils.timestampToString` to `FractionTimestampFormatter` because legacy formatters don't support variable length patterns for seconds fractions.

### Why are the changes needed?
To avoid unnecessary overhead of converting Java date-time types to micros/days before formatting. Also formatters have to convert input micros/days back to Java types to pass instances to standard library API.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By existing tests for toHiveString and new tests in `TimestampFormatterSuite`.

Closes #28582 from MaxGekk/opt-format-old-types.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-21 04:01:19 +00:00
Wenchen Fan 34414acfa3 [SPARK-31706][SQL] add back the support of streaming update mode
### What changes were proposed in this pull request?

This PR adds a private `WriteBuilder` mixin trait: `SupportsStreamingUpdate`, so that the builtin v2 streaming sinks can still support the update mode.

Note: it's private because we don't have a proper design yet. I didn't take the proposal in https://github.com/apache/spark/pull/23702#discussion_r258593059 because we may want something more general, like updating by an expression `key1 = key2 + 10`.

### Why are the changes needed?

In Spark 2.4, all builtin v2 streaming sinks support all streaming output modes, and v2 sinks are enabled by default, see https://issues.apache.org/jira/browse/SPARK-22911

It's too risky for 3.0 to go back to v1 sinks, so I propose to add a private trait to fix builtin v2 sinks, to keep backward compatibility.

### Does this PR introduce _any_ user-facing change?

Yes, now all the builtin v2 streaming sinks support all streaming output modes, which is the same as 2.4

### How was this patch tested?

existing tests.

Closes #28523 from cloud-fan/update.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-20 03:45:13 +00:00
yi.wu 0fd98abd85 [SPARK-31750][SQL] Eliminate UpCast if child's dataType is DecimalType
### What changes were proposed in this pull request?

Eliminate the `UpCast` if it's child data type is already decimal type.

### Why are the changes needed?

While deserializing internal `Decimal` value to external `BigDecimal`(Java/Scala) value, Spark should also respect `Decimal`'s precision and scale, otherwise it will cause precision lost and look weird in some cases, e.g.:

```
sql("select cast(11111111111111111111111111111111111111 as decimal(38, 0)) as d")
  .write.mode("overwrite")
  .parquet(f.getAbsolutePath)

// can fail
spark.read.parquet(f.getAbsolutePath).as[BigDecimal]
```
```
[info]   org.apache.spark.sql.AnalysisException: Cannot up cast `d` from decimal(38,0) to decimal(38,18).
[info] The type path of the target object is:
[info] - root class: "scala.math.BigDecimal"
[info] You can either add an explicit cast to the input data or choose a higher precision type of the field in the target object;
[info]   at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$.org$apache$spark$sql$catalyst$analysis$Analyzer$ResolveUpCast$$fail(Analyzer.scala:3060)
[info]   at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3087)
[info]   at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast$$anonfun$apply$33$$anonfun$applyOrElse$174.applyOrElse(Analyzer.scala:3071)
[info]   at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:309)
[info]   at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
[info]   at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:309)
[info]   at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:314)
```

### Does this PR introduce _any_ user-facing change?

Yes, for cases(cause precision lost) mentioned above will fail before this change but run successfully after this change.

### How was this patch tested?

Added tests.

Closes #28572 from Ngone51/fix_encoder.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-20 11:00:58 +09:00
Max Gekk fc5b90243c [SPARK-31727][SQL] Fix error message of casting timestamp to int in ANSI non-codegen mode
### What changes were proposed in this pull request?
Change timestamp casting to int in ANSI and non-codegen mode, and make the error message consistent to the error messages in the codegen mode. In particular, casting to int is implemented in the same way as casting to short and byte.

### Why are the changes needed?
1. The error message in the non-codegen mode is diversed from the error message in the codegen mode.
2. The error message contains intermediate results that could confuse.

### Does this PR introduce _any_ user-facing change?
Yes. Before the changes, the error message of casting timestamp to int contains intermediate result but after the changes it contains the input values which causes arithmetic overflow.

### How was this patch tested?
By running the modified test suite `AnsiCastSuite`.

Closes #28549 from MaxGekk/fix-error-msg-cast-timestamp.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-18 05:00:50 +00:00
Jungtaek Lim (HeartSaVioR) d2bec5e265 [SPARK-31707][SQL] Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax
### What changes were proposed in this pull request?

This patch effectively reverts SPARK-30098 via below changes:

* Removed the config
* Removed the changes done in parser rule
* Removed the usage of config in tests
  * Removed tests which depend on the config
  * Rolled back some tests to before SPARK-30098 which were affected by SPARK-30098
* Reflect the change into docs (migration doc, create table syntax)

### Why are the changes needed?

SPARK-30098 brought confusion and frustration on using create table DDL query, and we agreed about the bad effect on the change.

Please go through the [discussion thread](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Resolve-ambiguous-parser-rule-between-two-quot-create-table-quot-s-td29051i20.html) to see the details.

### Does this PR introduce _any_ user-facing change?

No, compared to Spark 2.4.x. End users tried to experiment with Spark 3.0.0 previews will see the change that the behavior is going back to Spark 2.4.x, but I believe we won't guarantee compatibility in preview releases.

### How was this patch tested?

Existing UTs.

Closes #28517 from HeartSaVioR/revert-SPARK-30098.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-17 02:27:23 +00:00
Max Gekk 5539ecfdac [SPARK-31725][CORE][SQL][TESTS] Set America/Los_Angeles time zone and Locale.US in tests by default
### What changes were proposed in this pull request?
Set default time zone and locale in the default constructor of `SparkFunSuite`:
- Default time zone to `America/Los_Angeles`
- Default locale to `Locale.US`

### Why are the changes needed?
1. To deduplicate code by moving common time zone and locale settings to one place SparkFunSuite
2. To have the same default time zone and locale in all tests. This should prevent errors like https://github.com/apache/spark/pull/28538

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
by running all affected test suites

Closes #28548 from MaxGekk/timezone-settings-SparkFunSuite.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-17 02:26:00 +00:00
Yuanjian Li 86bd37f37e [SPARK-31663][SQL] Grouping sets with having clause returns the wrong result
### What changes were proposed in this pull request?
- Resolve the havingcondition with expanding the GROUPING SETS/CUBE/ROLLUP expressions together in `ResolveGroupingAnalytics`:
    - Change the operations resolving directions to top-down.
    - Try resolving the condition of the filter as though it is in the aggregate clause by reusing the function in `ResolveAggregateFunctions`
    - Push the aggregate expressions into the aggregate which contains the expanded operations.
- Use UnresolvedHaving for all having clause.

### Why are the changes needed?
Correctness bug fix. See the demo and analysis in SPARK-31663.

### Does this PR introduce _any_ user-facing change?
Yes, correctness bug fix for HAVING with GROUPING SETS.

### How was this patch tested?
New UTs added.

Closes #28501 from xuanyuanking/SPARK-31663.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-16 04:37:18 +00:00
Max Gekk c7ce37dfa7 [SPARK-31712][SQL][TESTS] Check casting timestamps before the epoch to Byte/Short/Int/Long types
### What changes were proposed in this pull request?
Added tests to check casting timestamps before 1970-01-01 00:00:00Z to ByteType, ShortType, IntegerType and LongType in ansi and non-ansi modes.

### Why are the changes needed?
To improve test coverage and prevent errors while modifying the CAST expression code.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running the modified test suites:
```
$ ./build/sbt "test:testOnly *CastSuite"
```

Closes #28531 from MaxGekk/test-cast-timestamp-to-byte.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-15 04:24:58 +00:00
sunke.03 ddbce4edee [SPARK-30973][SQL] ScriptTransformationExec should wait for the termination …
### What changes were proposed in this pull request?

This PR try to fix a bug in `org.apache.spark.sql.hive.execution.ScriptTransformationExec`. This bug appears in our online cluster.  `ScriptTransformationExec` should throw an exception, when user uses a python script which contains parse error.  But current implementation may miss this case of failure.

### Why are the changes needed?

When user uses a python script which contains a parse error, there will be no output. So  `scriptOutputReader.next(scriptOutputWritable) <= 0` matches, then we use `checkFailureAndPropagate()` to check the `proc`.  But the `proc` may still be alive and `writerThread.exception` is not defined,  `checkFailureAndPropagate` cannot check this case of failure.  In the end, the Spark SQL job runs successfully and returns no result. In fact, the SparK SQL job should fails and shows the exception properly.

For example, the error python script is blow.
``` python
# encoding: utf8
import unknow_module
import sys

for line in sys.stdin:
    print line
```
The bug can be reproduced by running the following code in our cluter.
```
spark.range(100*100).toDF("index").createOrReplaceTempView("test")
spark.sql("select TRANSFORM(index) USING 'python error_python.py' as new_index from test").collect.foreach(println)
```

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Existing UT

Closes #27724 from slamke/transformation.

Authored-by: sunke.03 <sunke.03@bytedance.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-14 13:55:24 +00:00
Wenchen Fan fd2d55c991 [SPARK-31405][SQL] Fail by default when reading/writing legacy datetime values from/to Parquet/Avro files
### What changes were proposed in this pull request?

When reading/writing datetime values that before the rebase switch day, from/to Avro/Parquet files, fail by default and ask users to set a config to explicitly do rebase or not.

### Why are the changes needed?

Rebase or not rebase have different behaviors and we should let users decide it explicitly. In most cases, users won't hit this exception as it only affects ancient datetime values.

### Does this PR introduce _any_ user-facing change?

Yes, now users will see an error when reading/writing dates before 1582-10-15 or timestamps before 1900-01-01 from/to Parquet/Avro files, with an error message to ask setting a config.

### How was this patch tested?

updated tests

Closes #28477 from cloud-fan/rebase.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-14 12:32:40 +09:00
Max Gekk a3fafddf39 [SPARK-31680][SQL][TESTS] Support Java 8 datetime types by Random data generator
### What changes were proposed in this pull request?
Generates java.time.Instant/java.time.LocalDate for DateType/TimestampType by `RandomDataGenerator.forType` when the SQL config `spark.sql.datetime.java8API.enabled` is set to `true`.

### Why are the changes needed?
To improve test coverage, and check java.time.Instant/java.time.LocalDate types in round trip tests.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running modified test suites `RowEncoderSuite`, `RandomDataGeneratorSuite` and `HadoopFsRelationTest`.

Closes #28502 from MaxGekk/random-java8-datetime.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-12 14:05:31 +00:00
Javier Fuentes 178ca961fe [SPARK-31102][SQL] Spark-sql fails to parse when contains comment
### What changes were proposed in this pull request?

This PR introduces a change to false for the insideComment flag on a newline. Fixing the issue introduced by SPARK-30049.

### Why are the changes needed?

Previously on SPARK-30049 a comment containing an unclosed quote produced the following issue:
```
spark-sql> SELECT 1 -- someone's comment here
         > ;
Error in query:
extraneous input ';' expecting <EOF>(line 2, pos 0)

== SQL ==
SELECT 1 -- someone's comment here
;
^^^
```

This was caused because there was no flag for comment sections inside the splitSemiColon method to ignore quotes. SPARK-30049 added that flag and fixed the issue, but introduced the follwoing problem:
```
spark-sql> select
         >   1,
         >   -- two
         >   2;
Error in query:
mismatched input '<EOF>' expecting {'(', 'ADD', 'AFTER', 'ALL', 'ALTER', ...}(line 3, pos 2)
== SQL ==
select
  1,
--^^^
```
This issue is generated by a missing turn-off for the insideComment flag with a newline.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

- For previous tests using line-continuity(`\`) it was added a line-continuity rule in the SqlBase.g4 file to add the functionality to the SQL context.
- A new test for inline comments was added.

Closes #27920 from javierivanov/SPARK-31102.

Authored-by: Javier Fuentes <j.fuentes.m@icloud.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-12 13:46:24 +00:00
beliefer a89006aba0 [SPARK-31393][SQL] Show the correct alias in schema for expression
### What changes were proposed in this pull request?
Some alias of expression can not display correctly in schema. This PR will fix them.
- `TimeWindow`
- `MaxBy`
- `MinBy`
- `UnaryMinus`
- `BitwiseCount`

This PR also fix a typo issue, please look at b7cde42b04/sql/core/src/test/scala/org/apache/spark/sql/ExpressionsSchemaSuite.scala (L142)

Note:

1. `MaxBy` and `MinBy` extends `MaxMinBy` and the latter add a method `funcName` not needed.  We can reuse `prettyName` to replace `funcName`.
2. Spark SQL exists some function no elegant implementation.For example: `BitwiseCount` override the sql method show below:
`override def sql: String = s"bit_count(${child.sql})"`
I don't think it's elegant enough. Because `Expression` gives the following definitions.
```
  def sql: String = {
    val childrenSQL = children.map(_.sql).mkString(", ")
    s"$prettyName($childrenSQL)"
  }
```
By this definition, `BitwiseCount` should override `prettyName` method.

### Why are the changes needed?
Improve the implement of some expression.

### Does this PR introduce any user-facing change?
 'Yes'. This PR will let user see the correct alias in schema.

### How was this patch tested?
Jenkins test.

Closes #28164 from beliefer/elegant-pretty-name-for-function.

Lead-authored-by: beliefer <beliefer@163.com>
Co-authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-12 10:25:04 +09:00
Max Gekk 32a5398b65 [SPARK-31665][SQL][TESTS] Check parquet dictionary encoding of random dates/timestamps
### What changes were proposed in this pull request?
Modified `RandomDataGenerator.forType` for DateType and TimestampType to generate special date//timestamp values with 0.5 probability. This will trigger dictionary encoding in Parquet datasource test  HadoopFsRelationTest "test all data types". Currently, dictionary encoding is tested only for numeric types like ShortType.

### Why are the changes needed?
To extend test coverage. Currently, probability of testing of dictionary encoding in the test HadoopFsRelationTest "test all data types" for DateType and TimestampType is close to zero because dates/timestamps are uniformly distributed in wide range, and the chance of generating the same values is pretty low. In this way, parquet datasource cannot apply dictionary encoding for such column types.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running `ParquetHadoopFsRelationSuite` and `JsonHadoopFsRelationSuite`.

Closes #28481 from MaxGekk/test-random-parquet-dict-enc.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-11 12:59:41 +00:00
Max Gekk 9f768fa991 [SPARK-31669][SQL][TESTS] Fix RowEncoderSuite failures on non-existing dates/timestamps
### What changes were proposed in this pull request?
Shift non-existing dates in Proleptic Gregorian calendar by 1 day. The reason for that is `RowEncoderSuite` generates random dates/timestamps in the hybrid calendar, and some dates/timestamps don't exist in Proleptic Gregorian calendar like 1000-02-29 because 1000 is not leap year in Proleptic Gregorian calendar.

### Why are the changes needed?
This makes RowEncoderSuite much stable.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running RowEncoderSuite and set non-existing date manually:
```scala
val date = new java.sql.Date(1000 - 1900, 1, 29)
Try { date.toLocalDate; date }.getOrElse(new Date(date.getTime + MILLIS_PER_DAY))
```

Closes #28486 from MaxGekk/fix-RowEncoderSuite.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-05-10 14:22:12 -05:00
Kent Yao b31ae7bb0b [SPARK-31615][SQL] Pretty string output for sql method of RuntimeReplaceable expressions
### What changes were proposed in this pull request?

The RuntimeReplaceable ones are runtime replaceable, thus, their original parameters are not going to be resolved to PrettyAttribute and remain debug style string if we directly implement their `sql` methods with their parameters' `sql` methods.

This PR is raised with suggestions by maropu and cloud-fan https://github.com/apache/spark/pull/28402/files#r417656589. In this PR, we re-implement the `sql` methods of  the RuntimeReplaceable ones with toPettySQL

### Why are the changes needed?

Consistency of schema output between RuntimeReplaceable expressions and normal ones.

For example, `date_format` vs `to_timestamp`, before this PR, they output differently

#### Before
```sql
select date_format(timestamp '2019-10-06', 'yyyy-MM-dd uuuu')
struct<date_format(TIMESTAMP '2019-10-06 00:00:00', yyyy-MM-dd uuuu):string>

select to_timestamp("2019-10-06S10:11:12.12345", "yyyy-MM-dd'S'HH:mm:ss.SSSSSS")
struct<to_timestamp('2019-10-06S10:11:12.12345', 'yyyy-MM-dd\'S\'HH:mm:ss.SSSSSS'):timestamp>
```
#### After

```sql
select date_format(timestamp '2019-10-06', 'yyyy-MM-dd uuuu')
struct<date_format(TIMESTAMP '2019-10-06 00:00:00', yyyy-MM-dd uuuu):string>

select to_timestamp("2019-10-06T10:11:12'12", "yyyy-MM-dd'T'HH:mm:ss''SSSS")

struct<to_timestamp(2019-10-06T10:11:12'12, yyyy-MM-dd'T'HH:mm:ss''SSSS):timestamp>

````

### Does this PR introduce _any_ user-facing change?

Yes, the schema output style changed for the runtime replaceable expressions as shown in the above example

### How was this patch tested?
regenerate all related tests

Closes #28420 from yaooqinn/SPARK-31615.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-05-07 14:40:26 +09:00
Liang-Chi Hsieh 9bf738724a [SPARK-31365][SQL][FOLLOWUP] Refine config document for nested predicate pushdown
### What changes were proposed in this pull request?

This is a followup to address the https://github.com/apache/spark/pull/28366#discussion_r420611872 by refining the SQL config document.

### Why are the changes needed?

Make developers less confusing.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Only doc change.

Closes #28468 from viirya/SPARK-31365-followup.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-05-07 09:57:08 +09:00
HyukjinKwon 5c5dd77d6a [SPARK-31647][SQL] Deprecate 'spark.sql.optimizer.metadataOnly' configuration
### What changes were proposed in this pull request?

This PR proposes to deprecate 'spark.sql.optimizer.metadataOnly' configuration and remove it in the future release.

### Why are the changes needed?

This optimization can cause a potential correctness issue, see also SPARK-26709.
Also, it seems difficult to extend the optimization. Basically you should whitelist all available functions. It costs some maintenance overhead, see also SPARK-31590.

Looks we should just better let users use `SparkSessionExtensions` instead if they must use, and remove it in Spark side.

### Does this PR introduce _any_ user-facing change?

Yes, setting `spark.sql.optimizer.metadataOnly` will show a deprecation warning:

```scala
scala> spark.conf.unset("spark.sql.optimizer.metadataOnly")
```
```
20/05/06 12:57:23 WARN SQLConf: The SQL config 'spark.sql.optimizer.metadataOnly' has been
 deprecated in Spark v3.0 and may be removed in the future. Avoid to depend on this optimization
 to prevent a potential correctness issue. If you must use, use 'SparkSessionExtensions' instead to
inject it as a custom rule.
```
```scala
scala> spark.conf.set("spark.sql.optimizer.metadataOnly", "true")
```
```
20/05/06 12:57:44 WARN SQLConf: The SQL config 'spark.sql.optimizer.metadataOnly' has been
deprecated in Spark v3.0 and may be removed in the future. Avoid to depend on this optimization
 to prevent a potential correctness issue. If you must use, use 'SparkSessionExtensions' instead to
inject it as a custom rule.
```

### How was this patch tested?

Manually tested.

Closes #28459 from HyukjinKwon/SPARK-31647.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-05-07 09:00:59 +09:00
Liang-Chi Hsieh 4952f1a03c [SPARK-31365][SQL] Enable nested predicate pushdown per data sources
### What changes were proposed in this pull request?

This patch proposes to replace `NESTED_PREDICATE_PUSHDOWN_ENABLED` with `NESTED_PREDICATE_PUSHDOWN_V1_SOURCE_LIST` which can configure which v1 data sources are enabled with nested predicate pushdown.

### Why are the changes needed?

We added nested predicate pushdown feature that is configured by `NESTED_PREDICATE_PUSHDOWN_ENABLED`. However, this config is all or nothing config, and applies on all data sources.

In order to not introduce API breaking change after enabling nested predicate pushdown, we'd like to set nested predicate pushdown per data sources. Please also refer to the comments https://github.com/apache/spark/pull/27728#discussion_r410829720.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Added/Modified unit tests.

Closes #28366 from viirya/SPARK-31365.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-06 04:50:06 +00:00
Max Gekk bd26429931 [SPARK-31641][SQL] Fix days conversions by JSON legacy parser
### What changes were proposed in this pull request?
Perform days rebasing while converting days from JSON string field. In Spark 2.4 and earlier versions, the days are interpreted as days since the epoch in the hybrid calendar (Julian + Gregorian since 1582-10-15). Since Spark 3.0, the base calendar was switched to Proleptic Gregorian calendar, so, the days should be rebased to represent the same local date.

### Why are the changes needed?
The changes fix a bug and restore compatibility with Spark 2.4 in which:
```scala
scala> spark.read.schema("d date").json(Seq("{'d': '-141704'}").toDS).show
+----------+
|         d|
+----------+
|1582-01-01|
+----------+
```

### Does this PR introduce _any_ user-facing change?
Yes.

Before:
```scala
scala> spark.read.schema("d date").json(Seq("{'d': '-141704'}").toDS).show
+----------+
|         d|
+----------+
|1582-01-11|
+----------+
```

After:
```scala
scala> spark.read.schema("d date").json(Seq("{'d': '-141704'}").toDS).show
+----------+
|         d|
+----------+
|1582-01-01|
+----------+
```

### How was this patch tested?
Add a test to `JsonSuite`.

Closes #28453 from MaxGekk/json-rebase-legacy-days.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-05 14:15:31 +00:00
Max Gekk bef5828e12 [SPARK-31630][SQL] Fix perf regression by skipping timestamps rebasing after some threshold
### What changes were proposed in this pull request?
Skip timestamps rebasing after a global threshold when there is no difference between Julian and Gregorian calendars. This allows to avoid checking hash maps of switch points, and fixes perf regressions in `toJavaTimestamp()` and `fromJavaTimestamp()`.

### Why are the changes needed?
The changes fix perf regressions of conversions to/from external type `java.sql.Timestamp`.

Before (see the PR's results https://github.com/apache/spark/pull/28440):
```
================================================================================================
Conversion from/to external types
================================================================================================

OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2  2.50GHz
To/from Java's date-time:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Timestamp                             376            388          10         13.3          75.2       1.1X
Collect java.sql.Timestamp                         1878           1937          64          2.7         375.6       0.2X
```

After:
```
================================================================================================
Conversion from/to external types
================================================================================================

OpenJDK 64-Bit Server VM 1.8.0_252-8u252-b09-1~18.04-b09 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2  2.50GHz
To/from Java's date-time:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Timestamp                             249            264          24         20.1          49.8       1.7X
Collect java.sql.Timestamp                         1503           1523          24          3.3         300.5       0.3X
```

Perf improvements in average of:

1. From java.sql.Timestamp is ~ 34%
2. To java.sql.Timestamps is ~16%

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By existing test suites `DateTimeUtilsSuite` and `RebaseDateTimeSuite`.

Closes #28441 from MaxGekk/opt-rebase-common-threshold.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-05-05 14:11:53 +00:00
Max Gekk 372ccba063
[SPARK-31639] Revert SPARK-27528 Use Parquet logical type TIMESTAMP_MICROS by default
### What changes were proposed in this pull request?
This reverts commit 43a73e387c. It sets `INT96` as the timestamp type while saving timestamps to parquet files.

### Why are the changes needed?
To be compatible with Hive and Presto that don't support the `TIMESTAMP_MICROS` type in current stable releases.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By existing test suites.

Closes #28450 from MaxGekk/parquet-int96.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-05-04 17:27:02 -07:00
Wenchen Fan f72220b8ab [SPARK-31606][SQL] Reduce the perf regression of vectorized parquet reader caused by datetime rebase
### What changes were proposed in this pull request?

Push the rebase logic to the lower level of the parquet vectorized reader, to make the final code more vectorization-friendly.

### Why are the changes needed?

Parquet vectorized reader is carefully implemented, to make it more likely to be vectorized by the JVM. However, the newly added datetime rebase degrade the performance a lot, as it breaks vectorization, even if the datetime values don't need to rebase (this is very likely as dates before 1582 is rare).

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

Run part of the `DateTimeRebaseBenchmark` locally. The results:
before this patch
```
[info] Load dates from parquet:                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off                     2677           2838         142         37.4          26.8       1.0X
[info] after 1582, vec on, rebase on                      3828           4331         805         26.1          38.3       0.7X
[info] before 1582, vec on, rebase off                    2903           2926          34         34.4          29.0       0.9X
[info] before 1582, vec on, rebase on                     4163           4197          38         24.0          41.6       0.6X

[info] Load timestamps from parquet:             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off                     3537           3627         104         28.3          35.4       1.0X
[info] after 1900, vec on, rebase on                      6891           7010         105         14.5          68.9       0.5X
[info] before 1900, vec on, rebase off                    3692           3770          72         27.1          36.9       1.0X
[info] before 1900, vec on, rebase on                     7588           7610          30         13.2          75.9       0.5X
```

After this patch
```
[info] Load dates from parquet:                  Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1582, vec on, rebase off                     2758           2944         197         36.3          27.6       1.0X
[info] after 1582, vec on, rebase on                      2908           2966          51         34.4          29.1       0.9X
[info] before 1582, vec on, rebase off                    2840           2878          37         35.2          28.4       1.0X
[info] before 1582, vec on, rebase on                     3407           3433          24         29.4          34.1       0.8X

[info] Load timestamps from parquet:             Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] after 1900, vec on, rebase off                     3861           4003         139         25.9          38.6       1.0X
[info] after 1900, vec on, rebase on                      4194           4283          77         23.8          41.9       0.9X
[info] before 1900, vec on, rebase off                    3849           3937          79         26.0          38.5       1.0X
[info] before 1900, vec on, rebase on                     7512           7546          55         13.3          75.1       0.5X
```

Date type is 30% faster if the values don't need to rebase, 20% faster if need to rebase.
Timestamp type is 60% faster if the values don't need to rebase, no difference if need to rebase.

Closes #28406 from cloud-fan/perf.

Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-04 15:30:10 +09:00
Pablo Langa 4fecc20f6e [SPARK-31500][SQL] collect_set() of BinaryType returns duplicate elements
### What changes were proposed in this pull request?

The collect_set() aggregate function should produce a set of distinct elements. When the column argument's type is BinayType this is not the case.

Example:
```scala
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window

case class R(id: String, value: String, bytes: Array[Byte])
def makeR(id: String, value: String) = R(id, value, value.getBytes)
val df = Seq(makeR("a", "dog"), makeR("a", "cat"), makeR("a", "cat"), makeR("b", "fish")).toDF()
// In the example below "bytesSet" erroneously has duplicates but "stringSet" does not (as expected).
df.agg(collect_set('value) as "stringSet", collect_set('bytes) as "byteSet").show(truncate=false)
// The same problem is displayed when using window functions.
val win = Window.partitionBy('id).rowsBetween(Window.unboundedPreceding, Window.unboundedFollowing)
val result = df.select(
  collect_set('value).over(win) as "stringSet",
  collect_set('bytes).over(win) as "bytesSet"
)
.select('stringSet, 'bytesSet, size('stringSet) as "stringSetSize", size('bytesSet) as "bytesSetSize")
.show()
```

We use a HashSet buffer to accumulate the results, the problem is that arrays equality in Scala don't behave as expected, arrays ara just plain java arrays and the equality don't compare the content of the arrays
Array(1, 2, 3) == Array(1, 2, 3)  => False
The result is that duplicates are not removed in the hashset

The solution proposed is that in the last stage, when we have all the data in the Hashset buffer, we delete duplicates changing the type of the elements and then transform it to the original type.
This transformation is only applied when we have a BinaryType

### Why are the changes needed?
Fix the bug explained

### Does this PR introduce any user-facing change?
Yes. Now `collect_set()` correctly deduplicates array of byte.

### How was this patch tested?
Unit testing

Closes #28351 from planga82/feature/SPARK-31500_COLLECT_SET_bug.

Authored-by: Pablo Langa <soypab@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-05-01 22:09:04 +09:00
Yuanjian Li aec8b69435 [SPARK-28424][TESTS][FOLLOW-UP] Add test cases for all interval units
### What changes were proposed in this pull request?
Add test cases covering all interval units:  MICROSECOND MILLISECOND SECOND MINUTE HOUR DAY WEEK MONTH YEAR

### Why are the changes needed?
For test coverage.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Test only.

Closes #28418 from xuanyuanking/SPARK-28424.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-05-01 10:32:37 +09:00
Max Gekk c09cfb9808 [SPARK-31557][SQL] Fix timestamps rebasing in legacy parsers
### What changes were proposed in this pull request?
In the PR, I propose to fix two legacy timestamp formatter `LegacySimpleTimestampFormatter` and `LegacyFastTimestampFormatter` to perform micros rebasing in parsing/formatting from/to strings.

### Why are the changes needed?
Legacy timestamps formatters operate on the hybrid calendar (Julian + Gregorian), so, the input micros should be rebased to have the same date-time fields as in Proleptic Gregorian calendar used by Spark SQL, see SPARK-26651.

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
Added tests to `TimestampFormatterSuite`

Closes #28408 from MaxGekk/fix-rebasing-in-legacy-timestamp-formatter.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-30 12:45:32 +00:00
Wenchen Fan 636119c54b [SPARK-31607][SQL] Improve the perf of CTESubstitution
### What changes were proposed in this pull request?

In `CTESubstitution`, resolve CTE relations first, then traverse the main plan only once to substitute CTE relations.

### Why are the changes needed?

Currently we will traverse the main query many times (if there are many CTE relations), which can be pretty slow if the main query is large.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

local perf test
```
scala> :pa
// Entering paste mode (ctrl-D to finish)

def test(i: Int): Unit = 1.to(i).foreach { _ =>
  spark.sql("""
    with
    t1 as (select 1),
    t2 as (select 1),
    t3 as (select 1),
    t4 as (select 1),
    t5 as (select 1),
    t6 as (select 1),
    t7 as (select 1),
    t8 as (select 1),
    t9 as (select 1)
    select * from t1, t2, t3, t4, t5, t6, t7, t8, t9""").queryExecution.assertAnalyzed()
}

// Exiting paste mode, now interpreting.

test: (i: Int)Unit

scala> test(10000)

scala> println(org.apache.spark.sql.catalyst.rules.RuleExecutor.dumpTimeSpent)
```

The result before this patch
```
Rule                                       Effective Time / Total Time                     Effective Runs / Total Runs
CTESubstitution                            3328796344 / 3924576425                         10000 / 20000
```
The result after this patch
```
Rule                                       Effective Time / Total Time                     Effective Runs / Total Runs
CTESubstitution                            1503085936 / 2091992092                         10000 / 20000
```
About 2 times faster.

Closes #28407 from cloud-fan/cte.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-30 12:11:16 +00:00
Kent Yao 9241f8282f [SPARK-31586][SQL][FOLLOWUP] Restore SQL string for datetime - interval operations
### What changes were proposed in this pull request?

Because of  ebc8fa50d0 and  beec8d535f, the SQL output strings for date/timestamp - interval operation will have a malformed format, such as `struct<dateval:date,dateval + (- INTERVAL '2 years 2 months').....`

This PR restore this behavior by adding one `RuntimeReplaceable `implementation for both of the operations to have their pretty SQL strings back.

### Why are the changes needed?

restore the SQL string for datetime operations

### Does this PR introduce any user-facing change?

NO, we are restoring here
### How was this patch tested?

added unit tests

Closes #28402 from yaooqinn/SPARK-31586-F.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-30 03:31:29 +00:00
Max Gekk 73eac7565d [SPARK-31557][SQL][TESTS][FOLLOWUP] Check rebasing in all legacy formatters
### What changes were proposed in this pull request?
- Check all available legacy formats in the tests added by https://github.com/apache/spark/pull/28345
- Check dates rebasing in legacy parsers for only one direction either days -> string or string -> days.

### Why are the changes needed?
Round trip tests can hide issues in dates rebasing. For example, if we remove rebasing from legacy parsers (from `parse()` and `format()`) the tests will pass.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By running `DateFormatterSuite`.

Closes #28398 from MaxGekk/test-rebasing-in-legacy-date-formatter.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-29 07:19:34 +00:00
Takeshi Yamamuro 97f2c03d3b
[SPARK-31594][SQL] Do not display the seed of rand/randn with no argument in output schema
### What changes were proposed in this pull request?

This PR intends to update `sql` in `Rand`/`Randn` with no argument to make a column name deterministic.

Before this PR (a column name changes run-by-run):
```
scala> sql("select rand()").show()
+-------------------------+
|rand(7986133828002692830)|
+-------------------------+
|       0.9524061403696937|
+-------------------------+
```
After this PR (a column name fixed):
```
scala> sql("select rand()").show()
+------------------+
|            rand()|
+------------------+
|0.7137935639522275|
+------------------+

// If a seed given, it is still shown in a column name
// (the same with the current behaviour)
scala> sql("select rand(1)").show()
+------------------+
|           rand(1)|
+------------------+
|0.6363787615254752|
+------------------+

// We can still check a seed in explain output:
scala> sql("select rand()").explain()
== Physical Plan ==
*(1) Project [rand(-2282124938778456838) AS rand()#0]
+- *(1) Scan OneRowRelation[]
```

Note: This fix comes from #28194; the ongoing PR tests the output schema of expressions, so their schemas must be deterministic for the tests.

### Why are the changes needed?

To make output schema deterministic.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added unit tests.

Closes #28392 from maropu/SPARK-31594.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-29 00:14:50 -07:00
Terry Kim 36803031e8 [SPARK-30282][SQL][FOLLOWUP] SHOW TBLPROPERTIES should support views
### What changes were proposed in this pull request?

This PR addresses two things:
- `SHOW TBLPROPERTIES` should supports view (a regression introduced by #26921)
- `SHOW TBLPROPERTIES` on a temporary view should return empty result (2.4 behavior instead of throwing `AnalysisException`.

### Why are the changes needed?

It's a bug.

### Does this PR introduce any user-facing change?

Yes, now `SHOW TBLPROPERTIES` works on views:
```
scala> sql("CREATE VIEW view TBLPROPERTIES('p1'='v1', 'p2'='v2') AS SELECT 1 AS c1")
scala> sql("SHOW TBLPROPERTIES view").show(truncate=false)
+---------------------------------+-------------+
|key                              |value        |
+---------------------------------+-------------+
|view.catalogAndNamespace.numParts|2            |
|view.query.out.col.0             |c1           |
|view.query.out.numCols           |1            |
|p2                               |v2           |
|view.catalogAndNamespace.part.0  |spark_catalog|
|p1                               |v1           |
|view.catalogAndNamespace.part.1  |default      |
+---------------------------------+-------------+
```
And for a temporary view:
```
scala> sql("CREATE TEMPORARY VIEW tview TBLPROPERTIES('p1'='v1', 'p2'='v2') AS SELECT 1 AS c1")
scala> sql("SHOW TBLPROPERTIES tview").show(truncate=false)
+---+-----+
|key|value|
+---+-----+
+---+-----+
```

### How was this patch tested?

Added tests.

Closes #28375 from imback82/show_tblproperties_followup.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-29 07:06:45 +00:00
Kent Yao ea525fe8c0 [SPARK-31597][SQL] extracting day from intervals should be interval.days + days in interval.microsecond
### What changes were proposed in this pull request?

With suggestion from cloud-fan https://github.com/apache/spark/pull/28222#issuecomment-620586933

I Checked with both Presto and PostgresSQL, one is implemented intervals with ANSI style year-month/day-time, and the other is mixed and Non-ANSI. They both add the exceeded days in interval time part to the total days of the operation which extracts day from interval values.

```sql

presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp)));
_col0
-------
14
(1 row)

Query 20200428_135239_00000_ahn7x, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:01 [0 rows, 0B] [0 rows/s, 0B/s]

presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:01' as timestamp)));
_col0
-------
13
(1 row)

Query 20200428_135246_00001_ahn7x, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]

presto>

```

```sql

postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp)));
date_part
-----------
14
(1 row)

postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:01' as timestamp)));
date_part
-----------
13

```

```
spark-sql> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:01' as timestamp)));
0
spark-sql> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp)));
0
```

In ANSI standard, the day is exact 24 hours, so we don't need to worry about the conceptual day for interval extraction. The meaning of the conceptual day only takes effect when we add it to a zoned timestamp value.

### Why are the changes needed?

Both satisfy the ANSI standard and common use cases in modern SQL platforms

### Does this PR introduce any user-facing change?

No, it new in 3.0
### How was this patch tested?

add more uts

Closes #28396 from yaooqinn/SPARK-31597.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-29 06:56:33 +00:00
Max Gekk 86761861c2 [SPARK-31563][SQL][FOLLOWUP] Create literals directly from Catalyst's internal value in InSet.sql
### What changes were proposed in this pull request?
In the PR, I propose to simplify the code of `InSet.sql` and create `Literal` instances directly from Catalyst's internal values by using the default `Literal` constructor.

### Why are the changes needed?
This simplifies code and avoids unnecessary conversions to external types.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By existing test `SPARK-31563: sql of InSet for UTF8String collection` in `ColumnExpressionSuite`.

Closes #28399 from MaxGekk/fix-InSet-sql-followup.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-29 06:44:22 +00:00
Kent Yao beec8d535f [SPARK-31586][SQL] Replace expression TimeSub(l, r) with TimeAdd(l -r)
### What changes were proposed in this pull request?

The implementation of TimeSub for the operation of timestamp subtracting interval is almost repetitive with TimeAdd. We can replace it with TimeAdd(l, -r) since there are equivalent.

Suggestion from https://github.com/apache/spark/pull/28310#discussion_r414259239

Besides, the Coercion rules for TimeAdd/TimeSub(date, interval) are useless anymore, so remove them in this PR since they are touched in this PR.

### Why are the changes needed?

remove redundant and useless code for easy maintenance

### Does this PR introduce any user-facing change?

Yes, the SQL string of `datetime - interval` become `datetime + (- interval)`
### How was this patch tested?

modified existing unit tests.

Closes #28381 from yaooqinn/SPARK-31586.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-28 14:01:07 +00:00
Yuanjian Li 6ed2dfbba1 [SPARK-31519][SQL] Cast in having aggregate expressions returns the wrong result
### What changes were proposed in this pull request?
Add a new logical node AggregateWithHaving, and the parser should create this plan for HAVING. The analyzer resolves it to Filter(..., Aggregate(...)).

### Why are the changes needed?
The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING query, and Spark has a special analyzer rule ResolveAggregateFunctions to resolve the aggregate functions and grouping columns in the Filter operator.

It works for simple cases in a very tricky way as it relies on rule execution order:
1. Rule ResolveReferences hits the Aggregate operator and resolves attributes inside aggregate functions, but the function itself is still unresolved as it's an UnresolvedFunction. This stops resolving the Filter operator as the child Aggrege operator is still unresolved.
2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege operator resolved.
3. Rule ResolveAggregateFunctions resolves the Filter operator if its child is a resolved Aggregate. This rule can correctly resolve the grouping columns.

In the example query, I put a CAST, which needs to be resolved by rule ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 3 as the Aggregate operator is unresolved at that time. Then the analyzer starts next round and the Filter operator is resolved by ResolveReferences, which wrongly resolves the grouping columns.

See the demo below:
```
SELECT SUM(a) AS b, '2020-01-01' AS fake FROM VALUES (1, 10), (2, 20) AS T(a, b) GROUP BY b HAVING b > 10
```
The query's result is
```
+---+----------+
|  b|      fake|
+---+----------+
|  2|2020-01-01|
+---+----------+
```
But if we add CAST, it will return an empty result.
```
SELECT SUM(a) AS b, CAST('2020-01-01' AS DATE) AS fake FROM VALUES (1, 10), (2, 20) AS T(a, b) GROUP BY b HAVING b > 10
```

### Does this PR introduce any user-facing change?
Yes, bug fix for cast in having aggregate expressions.

### How was this patch tested?
New UT added.

Closes #28294 from xuanyuanking/SPARK-31519.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-28 08:11:41 +00:00
Wenchen Fan 2f4f38b6f1
[SPARK-31577][SQL] Fix case-sensitivity and forward name conflict problems when check name conflicts of CTE relations
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/28318, to make the code more readable, by adding some comments to explain the trick and simplify the code to use a boolean flag instead of 2 string sets.

This PR also fixes various problems:
1. the name check should consider case sensitivity
2. forward name conflicts like `with t as (with t2 as ...), t2 as ...` is not a real conflict and we shouldn't fail.

### Why are the changes needed?

correct the behavior

### Does this PR introduce any user-facing change?

yes, fix the fore-mentioned behaviors.

### How was this patch tested?

new tests

Closes #28371 from cloud-fan/followup.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-27 16:47:39 -07:00
yi.wu 7df658414b [SPARK-31529][SQL] Remove extra whitespaces in formatted explain
### What changes were proposed in this pull request?

Remove all the extra whitespaces in the formatted explain.

### Why are the changes needed?

The number of extra whitespaces of the formatted explain becomes different between master and branch-3.0. This causes a problem that whenever we backport formatted explain related tests from master to branch-3.0, it will fail branch-3.0. Besides, extra whitespaces are always disallowed in Spark. Thus, we should remove them as possible as we can.

### Does this PR introduce any user-facing change?

No, formatted explain is newly added in Spark 3.0.

### How was this patch tested?

Updated sql query tests.

Closes #28315 from Ngone51/fix_extra_spaces.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-27 07:39:24 +00:00
Kent Yao ebc8fa50d0 [SPARK-31527][SQL] date add/subtract interval only allow those day precision in ansi mode
### What changes were proposed in this pull request?

To follow ANSI,the expressions - `date + interval`, `interval + date` and `date - interval` should only accept intervals which the `microseconds` part is 0.

### Why are the changes needed?

Better ANSI compliance

### Does this PR introduce any user-facing change?

No, this PR should target 3.0.0 in which this feature is newly added.

### How was this patch tested?

add more unit tests

Closes #28310 from yaooqinn/SPARK-31527.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-27 05:28:46 +00:00
Bruce Robbins a911287244 [SPARK-31557][SQL] Legacy time parser should return Gregorian days rather than Julian days
### What changes were proposed in this pull request?

This PR modifies LegacyDateFormatter#parse to return proleptic Gregorian days rather than hybrid Julian days.

### Why are the changes needed?

The legacy time parser currently returns epoch days in the hybrid Julian calendar. However, the callers to the legacy parser (e.g., UnivocityParser, JacksonParser) expect epoch days in the proleptic Gregorian calendar. As a result, pre-Gregorian dates like '1000-01-01' get interpreted as '1000-01-06'.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Manual testing and modified existing unit tests.

Closes #28345 from bersprockets/SPARK-31557.

Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-27 05:00:36 +00:00
Peter Toth 4f53bfbbd5
[SPARK-31535][SQL] Fix nested CTE substitution
### What changes were proposed in this pull request?

This PR fixes a CTE substitution issue so as to the following SQL return the correct empty result:
```
WITH t(c) AS (SELECT 1)
SELECT * FROM t
WHERE c IN (
  WITH t(c) AS (SELECT 2)
  SELECT * FROM t
)
```
Before this PR the result was `1`.

### Why are the changes needed?
To fix a correctness issue.

### Does this PR introduce any user-facing change?
Yes, fixes a correctness issue.

### How was this patch tested?
Added new test case.

Closes #28318 from peter-toth/SPARK-31535-fix-nested-cte-substitution.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-26 15:31:32 -07:00
Takeshi Yamamuro e01125db0d
[SPARK-31562][SQL] Update ExpressionDescription for substring, current_date, and current_timestamp
### What changes were proposed in this pull request?

This PR intends to add entries for substring, current_date, and current_timestamp in the SQL built-in function documents. Specifically, the entries are as follows;

 - SELECT current_date;
 - SELECT current_timestamp;
 - SELECT substring('abcd' FROM 1);
 - SELECT substring('abcd' FROM 1 FOR 2);

### Why are the changes needed?

To make the SQL (built-in functions) references complete.

### Does this PR introduce any user-facing change?

<img width="1040" alt="Screen Shot 2020-04-25 at 16 51 07" src="https://user-images.githubusercontent.com/692303/80274851-6ca5ee00-8718-11ea-9a35-9ae82008cb4b.png">

<img width="974" alt="Screen Shot 2020-04-25 at 17 24 24" src="https://user-images.githubusercontent.com/692303/80275032-a88d8300-8719-11ea-92ec-95b80169ae28.png">

<img width="862" alt="Screen Shot 2020-04-25 at 17 27 48" src="https://user-images.githubusercontent.com/692303/80275114-36696e00-871a-11ea-8e39-02e93eabb92f.png">

### How was this patch tested?

Added test examples.

Closes #28342 from maropu/SPARK-31562.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-26 11:46:52 -07:00
Max Gekk 7d8216a664
[SPARK-31563][SQL] Fix failure of InSet.sql for collections of Catalyst's internal types
### What changes were proposed in this pull request?
In the PR, I propose to fix the `InSet.sql` method for the cases when input collection contains values of internal Catalyst's types, for instance `UTF8String`. Elements of the input set `hset` are converted to Scala types, and wrapped by `Literal` to properly form SQL view of the input collection.

### Why are the changes needed?
The changes fixed the bug in `InSet.sql` that makes wrong assumption about types of collection elements. See more details in SPARK-31563.

### Does this PR introduce any user-facing change?
Highly likely, not.

### How was this patch tested?
Added a test to `ColumnExpressionSuite`

Closes #28343 from MaxGekk/fix-InSet-sql.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-25 09:29:51 -07:00
Kent Yao f92652d0b5
[SPARK-31528][SQL] Remove millennium, century, decade from trunc/date_trunc fucntions
### What changes were proposed in this pull request?

Similar to https://jira.apache.org/jira/browse/SPARK-31507, millennium, century, and decade are not commonly used in most modern platforms.

For example
Negative:
https://docs.snowflake.com/en/sql-reference/functions-date-time.html#supported-date-and-time-parts
https://prestodb.io/docs/current/functions/datetime.html#date_trunc
https://teradata.github.io/presto/docs/148t/functions/datetime.html#date_trunc
https://www.oracletutorial.com/oracle-date-functions/oracle-trunc/

Positive:
https://docs.aws.amazon.com/redshift/latest/dg/r_Dateparts_for_datetime_functions.html
https://www.postgresql.org/docs/9.1/functions-datetime.html#FUNCTIONS-DATETIME-TRUNC

This PR removes these `fmt`s support for trunc and date_trunc functions.

### Why are the changes needed?

clean uncommon datetime unit for easy maintenance, we can add them back if they are found very useful later.

### Does this PR introduce any user-facing change?
no, targeting 3.0.0, these are newly added in 3.0.0

### How was this patch tested?

remove and modify existing units tests

Closes #28313 from yaooqinn/SPARK-31528.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-24 18:28:41 -07:00
Kent Yao caf3ab8411
[SPARK-31552][SQL] Fix ClassCastException in ScalaReflection arrayClassFor
### What changes were proposed in this pull request?

the 2 method `arrayClassFor` and `dataTypeFor` in `ScalaReflection` call each other circularly, the cases in `dataTypeFor` are not fully handled in `arrayClassFor`

For example:
```scala
scala> implicit def newArrayEncoder[T <: Array[_] : TypeTag]: Encoder[T] = ExpressionEncoder()
newArrayEncoder: [T <: Array[_]](implicit evidence$1: reflect.runtime.universe.TypeTag[T])org.apache.spark.sql.Encoder[T]

scala> val decOne = Decimal(1, 38, 18)
decOne: org.apache.spark.sql.types.Decimal = 1E-18

scala> val decTwo = Decimal(2, 38, 18)
decTwo: org.apache.spark.sql.types.Decimal = 2E-18

scala> val decSpark = Array(decOne, decTwo)
decSpark: Array[org.apache.spark.sql.types.Decimal] = Array(1E-18, 2E-18)

scala> Seq(decSpark).toDF()
java.lang.ClassCastException: org.apache.spark.sql.types.DecimalType cannot be cast to org.apache.spark.sql.types.ObjectType
  at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$arrayClassFor$1(ScalaReflection.scala:131)
  at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
  at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
  at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
  at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
  at org.apache.spark.sql.catalyst.ScalaReflection$.arrayClassFor(ScalaReflection.scala:120)
  at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$dataTypeFor$1(ScalaReflection.scala:105)
  at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
  at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
  at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
  at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
  at org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:88)
  at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerForType$1(ScalaReflection.scala:399)
  at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
  at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879)
  at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878)
  at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49)
  at org.apache.spark.sql.catalyst.ScalaReflection$.serializerForType(ScalaReflection.scala:393)
  at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:57)
  at newArrayEncoder(<console>:57)
  ... 53 elided

scala>
```

In this PR, we add the missing cases to `arrayClassFor`

### Why are the changes needed?

bugfix as described above

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

add a test for array encoders

Closes #28324 from yaooqinn/SPARK-31552.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-24 18:04:26 -07:00
Yuming Wang b10263b8e5 [SPARK-30724][SQL] Support 'LIKE ANY' and 'LIKE ALL' operators
### What changes were proposed in this pull request?

`LIKE ANY/SOME` and `LIKE ALL` operators are mostly used when we are matching a text field with numbers of patterns. For example:

Teradata / Hive 3.0 / Snowflake:
```sql
--like any
select 'foo' LIKE ANY ('%foo%','%bar%');

--like all
select 'foo' LIKE ALL ('%foo%','%bar%');
```
PostgreSQL:
```sql
-- like any
select 'foo' LIKE ANY (array['%foo%','%bar%']);

-- like all
select 'foo' LIKE ALL (array['%foo%','%bar%']);
```

This PR add support these two operators.

More details:
https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/4~AyrPNmDN0Xk4SALLo6aQ
https://issues.apache.org/jira/browse/HIVE-15229
https://docs.snowflake.net/manuals/sql-reference/functions/like_any.html

### Why are the changes needed?

To smoothly migrate SQLs to Spark SQL.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Unit test.

Closes #27477 from wangyum/SPARK-30724.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-04-24 22:20:32 +09:00
Max Gekk 26165427c7 [SPARK-31488][SQL] Support java.time.LocalDate in Parquet filter pushdown
### What changes were proposed in this pull request?
1. Modified `ParquetFilters.valueCanMakeFilterOn()` to accept filters with `java.time.LocalDate` attributes.
2. Modified `ParquetFilters.dateToDays()` to support both types `java.sql.Date` and `java.time.LocalDate` in conversions to days.
3. Add implicit conversion from `LocalDate` to `Expression` (`Literal`).

### Why are the changes needed?
To support pushed down filters with `java.time.LocalDate` attributes. Before the changes, date filters are not pushed down to Parquet datasource when `spark.sql.datetime.java8API.enabled` is `true`.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Added a test to `ParquetFilterSuite`

Closes #28259 from MaxGekk/parquet-filter-java8-date-time.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-24 02:21:53 +00:00
Takeshi Yamamuro 42f496f6ac [SPARK-31526][SQL][TESTS] Add a new test suite for ExpressionInfo
### What changes were proposed in this pull request?

This PR intends to add a new test suite for `ExpressionInfo`. Major changes are as follows;

 - Added a new test suite named `ExpressionInfoSuite`
 - To improve test coverage, added a test for error handling in `ExpressionInfoSuite`
 - Moved the `ExpressionInfo`-related tests from `UDFSuite` to `ExpressionInfoSuite`
 - Moved the related tests from `SQLQuerySuite` to `ExpressionInfoSuite`
 - Added a comment in `ExpressionInfoSuite` (followup of https://github.com/apache/spark/pull/28224)

### Why are the changes needed?

To improve test suites/coverage.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added tests.

Closes #28308 from maropu/SPARK-31526.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-24 11:19:20 +09:00
Yuanjian Li ca90e1932d [SPARK-31515][SQL] Canonicalize Cast should consider the value of needTimeZone
### What changes were proposed in this pull request?
Override the canonicalized fields with respect to the result of `needsTimeZone`.

### Why are the changes needed?
The current approach breaks sematic equal of two cast expressions that don't relate with datetime type. If we don't need to use `timeZone` information casting `from` type to `to` type, then the timeZoneId should not influence the canonicalize result.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
New UT added.

Closes #28288 from xuanyuanking/SPARK-31515.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-04-23 14:32:10 +09:00
Kent Yao 3b5792114a [SPARK-31474][SQL][FOLLOWUP] Replace _FUNC_ placeholder with functionname in the note field of expression info
### What changes were proposed in this pull request?

\_FUNC\_ is used in note() of `ExpressionDescription` since https://github.com/apache/spark/pull/28248, it can be more cases later, we should replace it with function name for documentation

### Why are the changes needed?

doc fix

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

pass Jenkins, and verify locally with Jekyll serve

Closes #28305 from yaooqinn/SPARK-31474-F.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-23 13:33:04 +09:00
Max Gekk e7856a7902 [MINOR][SQL] Add comments for filters values and return values of Row.get()/apply()
### What changes were proposed in this pull request?
- Document row field values of `DATE` and `TIMESTAMP` type returned by `Row.get()` and `Row.apply`.
- Refer to `Row.get()` from the description of filter values

### Why are the changes needed?
Reflect current behaviour of Row's method `apply()` and `get()` in comments to inform users about different return types that are depended on the SQL config settings `spark.sql.datetime.java8API.enabled`.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Run `$ ./dev/scalastyle`

Closes #28300 from MaxGekk/doc-filter-date-time.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-23 04:23:33 +00:00
Kent Yao 37d2e037ed [SPARK-31507][SQL] Remove uncommon fields support and update some fields with meaningful names for extract function
### What changes were proposed in this pull request?

Extracting millennium, century, decade, millisecond, microsecond and epoch from datetime is neither ANSI standard nor quite common in modern SQL platforms. Most of the systems listing below does not support these except PostgreSQL and redshift.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

https://docs.oracle.com/cd/B19306_01/server.102/b14200/functions050.htm

https://prestodb.io/docs/current/functions/datetime.html

https://docs.cloudera.com/documentation/enterprise/5-8-x/topics/impala_datetime_functions.html

https://docs.snowflake.com/en/sql-reference/functions-date-time.html#label-supported-date-time-parts

https://www.postgresql.org/docs/9.1/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT

This PR removes these extract fields support from extract function for date and timestamp values

`isoyear` is PostgreSQL specific but `yearofweek` is more commonly used across platforms
`isodow` is PostgreSQL specific but `iso` as a suffix is more commonly used across platforms so, `dow_iso` and `dayofweek_iso` is used to replace it.

For historical reasons, we have [`dayofweek`, `dow`] implemented for representing a non-ISO day-of-week and a newly added `isodow` from PostgreSQL for ISO day-of-week. Many other systems only have one week-numbering system support and use either full names or abbreviations. Things in spark become a little bit complicated.
1. because of the existence of `isodow`, so we need to add iso-prefix to `dayofweek` to make a pair for it too. [`dayofweek`, `isodayofweek`, `dow` and `isodow`]
2. because there are rare `iso`-prefixed systems and more systems choose `iso`-suffixed way, so we may result in [`dayofweek`, `dayofweekiso`, `dow`, `dowiso`]
3. `dayofweekiso` looks nice and has use cases in the platforms listed above, e.g. snowflake, but `dowiso` looks weird and no use cases found.
4. with a discussion the community,we have agreed with an underscore before `iso` may look much better because `isodow` is new and there is no standard for `iso` kind of things, so this may be good for us to make it simple and clear for end-users if they are well documented too.

Thus, we finally result in [`dayofweek`, `dow`] for Non-ISO day-of-week system and [`dayofweek_iso`, `dow_iso`] for ISO system

### Why are the changes needed?

Remove some nonstandard and uncommon features as we can add them back if necessary

### Does this PR introduce any user-facing change?

NO, we should target this to 3.0.0 and these are added during 3.0.0

### How was this patch tested?

Remove unused tests

Closes #28284 from yaooqinn/SPARK-31507.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-22 10:24:49 +00:00
Wenchen Fan b209b5f406
[SPARK-31503][SQL] fix the SQL string of the TRIM functions
### What changes were proposed in this pull request?

override the `sql` method of `StringTrim`, `StringTrimLeft` and `StringTrimRight`, to use the standard SQL syntax.

### Why are the changes needed?

The current implementation is wrong. It gives you a SQL string that returns different result.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

new tests

Closes #28281 from cloud-fan/sql.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-21 11:22:18 -07:00
Kent Yao 1985437110 [SPARK-31474][SQL] Consistency between dayofweek/dow in extract exprsession and dayofweek function
### What changes were proposed in this pull request?
```sql
spark-sql> SELECT extract(dayofweek from '2009-07-26');
1
spark-sql> SELECT extract(dow from '2009-07-26');
0
spark-sql> SELECT extract(isodow from '2009-07-26');
7
spark-sql> SELECT dayofweek('2009-07-26');
1
spark-sql> SELECT weekday('2009-07-26');
6
```
Currently, there are 4 types of day-of-week range:
1. the function `dayofweek`(2.3.0) and extracting `dayofweek`(2.4.0) result as of Sunday(1) to Saturday(7)
2. extracting `dow`(3.0.0) results as of Sunday(0) to Saturday(6)
3. extracting` isodow` (3.0.0) results as of Monday(1) to Sunday(7)
4. the function `weekday`(2.4.0) results as of Monday(0) to Sunday(6)

Actually, extracting `dayofweek` and `dow` are both derived from PostgreSQL but have different meanings.
https://issues.apache.org/jira/browse/SPARK-23903
https://issues.apache.org/jira/browse/SPARK-28623

In this PR, we make extracting `dow` as same as extracting `dayofweek` and the `dayofweek` function for historical reason and not breaking anything.

Also, add more documentation to the extracting function to make extract field more clear to understand.

### Why are the changes needed?

Consistency insurance

### Does this PR introduce any user-facing change?

yes, doc updated and extract `dow` is as same as `dayofweek`

### How was this patch tested?

1. modified ut
2. local SQL doc verification
#### before
![image](https://user-images.githubusercontent.com/8326978/79601949-3535b100-811c-11ea-957b-a33d68641181.png)

#### after
![image](https://user-images.githubusercontent.com/8326978/79601847-12a39800-811c-11ea-8ff6-aa329255d099.png)

Closes #28248 from yaooqinn/SPARK-31474.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-21 11:55:33 +00:00
Takeshi Yamamuro e42dbe7cd4 [SPARK-31429][SQL][DOC] Automatically generates a SQL document for built-in functions
### What changes were proposed in this pull request?

This PR intends to add a Python script to generates a SQL document for built-in functions and the document in SQL references.

### Why are the changes needed?

To make SQL references complete.

### Does this PR introduce any user-facing change?

Yes;

![a](https://user-images.githubusercontent.com/692303/79406712-c39e1b80-7fd2-11ea-8b85-9f9cbb6efed3.png)
![b](https://user-images.githubusercontent.com/692303/79320526-eb46a280-7f44-11ea-8639-90b1fb2b8848.png)
![c](https://user-images.githubusercontent.com/692303/79320707-3365c500-7f45-11ea-9984-69ffe800fb87.png)

### How was this patch tested?

Manually checked and added tests.

Closes #28224 from maropu/SPARK-31429.

Lead-authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-21 10:55:13 +09:00
Wenchen Fan 69f9ee18b6
[SPARK-31452][SQL] Do not create partition spec for 0-size partitions in AQE
### What changes were proposed in this pull request?

This PR skips creating the partition specs in `ShufflePartitionsUtil` for 0-size partitions, which avoids launching unnecessary tasks that do nothing.

### Why are the changes needed?

launching tasks that do nothing is a waste.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

updated tests

Closes #28226 from cloud-fan/aqe.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-20 13:50:07 -07:00
gatorsmile 6c792a79c1 [SPARK-31234][SQL][FOLLOW-UP] ResetCommand should not affect static SQL Configuration
### What changes were proposed in this pull request?
This PR is the follow-up PR of https://github.com/apache/spark/pull/28003

- add a migration guide
- add an end-to-end test case.

### Why are the changes needed?
The original PR made the major behavior change in the user-facing RESET command.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Added a new end-to-end test

Closes #28265 from gatorsmile/spark-31234followup.

Authored-by: gatorsmile <gatorsmile@gmail.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
2020-04-20 13:08:55 -07:00
Max Gekk 88d39e5a89 [SPARK-31385][SQL] Restrict micros rebasing via switch arrays up to 2037 year
### What changes were proposed in this pull request?
1. Generate rebasing arrays for micros up to 2037 in `RebaseDateTimeSuite.generateRebaseJson()`.
2. Exclude 4 time zones from the black list in `generateRebaseJson()`.
3. Re-generate JSON files with rebasing info - `gregorian-julian-rebase-micros.json` and `julian-gregorian-rebase-micros.json`.

### Why are the changes needed?
1. `sun.util.calendar.ZoneInfo` resolves DST after 2037 year incorrectly. See aa318070b2/jdk/src/share/classes/sun/util/calendar/ZoneInfo.java (L55-L62) . By restricting the rebase arrays to 2037 year, we follow the behaviour of `ZoneInfo` which uses DST of 2037 for all years beyond 2037.
2. To enable optimization of micros rebasing via switch arrays for the time zones:
    - Asia/Tehran
    - Iran
    - Africa/Casablanca
    - Africa/El_Aaiun

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By existing test suites `RebaseDateTimeUtils`, `DateTimeUtilsSuite` and `DateFunctionsSuite`.

Closes #28253 from MaxGekk/fix-4-time-zones-rebasing.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-20 06:35:16 +00:00
Takeshi Yamamuro 74aed8cc8b
[SPARK-31476][SQL] Add an ExpressionInfo entry for EXTRACT
### What changes were proposed in this pull request?

This PR intends to add an ExpressionInfo entry for EXTRACT for better documentations.
This PR comes from the comment in https://github.com/apache/spark/pull/21479#discussion_r409900080

### Why are the changes needed?

To make SQL documentations complete.

### Does this PR introduce any user-facing change?

Yes, this PR updates the `Spark SQL, Built-in Functions` page.

### How was this patch tested?

Run the example tests.

Closes #28251 from maropu/AddExtractExpr.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-18 13:37:12 -07:00
gatorsmile 6bf5f01a4a [SPARK-31477][SQL] Dump codegen and compile time in BenchmarkQueryTest
### What changes were proposed in this pull request?
This PR is to dump the codegen and compilation time for benchmark query tests.

### Why are the changes needed?
Measure the codegen and compilation time costs in TPC-DS queries

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Manual test in my local laptop:
```
23:13:12.845 WARN org.apache.spark.sql.TPCDSQuerySuite:
=== Metrics of Whole-stage Codegen ===
Total code generation time: 21.275102261 seconds
Total compilation time: 12.223771828 seconds
```

Closes #28252 from gatorsmile/testMastercode.

Authored-by: gatorsmile <gatorsmile@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-04-18 20:59:45 +09:00
Takeshi Yamamuro a7fb330ed3 [SPARK-31468][SQL] Null types should be implicitly casted to Decimal types
### What changes were proposed in this pull request?

This PR intends to fix a bug that occurs when comparing null types to decimal types in master/branch-3.0;
```
scala> Seq(BigDecimal(10)).toDF("v1").selectExpr("v1 = NULL").explain(true)
org.apache.spark.sql.AnalysisException: cannot resolve '(`v1` = NULL)' due to data type mismatch: differing types in '(`v1` = NULL)' (decimal(38,18) and null).; line 1 pos 0;
'Project [(v1#5 = null) AS (v1 = NULL)#7]
+- Project [value#2 AS v1#5]
   +- LocalRelation [value#2]
...
```
The query above passed in v2.4.5.

### Why are the changes needed?

bugfix

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added tests.

Closes #28241 from maropu/SPARK-31468.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-17 14:11:17 +00:00
Kent Yao 697083c051 [SPARK-31469][SQL] Make extract interval field ANSI compliance
### What changes were proposed in this pull request?

Currently, we can extract `millennium/century/decade/year/quarter/month/week/day/hour/minute/second(with fractions)//millisecond/microseconds` and `epoch` from interval values

While getting the `millennium/century/decade/year`, it means how many the interval `months` part can be converted to that unit-value. The content of `millennium/century/decade` will overlap `year` and each other.

While getting `month/day` and so on, it means the integral remainder of the previous unit. Here all the units including `year` are individual.

So while extracting `year`, `month`, `day`, `hour`, `minute`, `second`, which are ANSI primary datetime units, the semantic is `extracting`, but others might refer to `transforming`.

While getting epoch we have treat month as 30 days which varies the natural Calendar rules we use.

To avoid ambiguity, I suggest we should only support those extract field defined ANSI with their abbreviations.

### Why are the changes needed?

Extracting `millennium`, `century` etc does not obey the meaning of extracting, and they are not so useful and worth maintaining.

The `extract` is ANSI standard expression and `date_part` is its pg-specific alias function. The current support extract-fields are fully bought from PostgreSQL.

With a look at other systems like Presto/Hive, they don't support those ambiguous fields too.

e.g. Hive 2.2.x also take it from PostgreSQL but without introducing those ambiguous fields https://issues.apache.org/jira/secure/attachment/12828349/HIVE-14579

e.g. presto

```sql
presto> select extract(quater from interval '10-0' year to month);
Query 20200417_094723_00020_m8xq4 failed: line 1:8: Invalid EXTRACT field: quater
select extract(quater from interval '10-0' year to month)

presto> select extract(decade from interval '10-0' year to month);
Query 20200417_094737_00021_m8xq4 failed: line 1:8: Invalid EXTRACT field: decade
select extract(decade from interval '10-0' year to month)

```

### Does this PR introduce any user-facing change?

Yes, as we already have previews versions, this PR will remove support for extracting `millennium/century/decade/quarter/week/millisecond/microseconds` and `epoch` from intervals with `date_part` function

### How was this patch tested?

rm some used tests

Closes #28242 from yaooqinn/SPARK-31469.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-17 13:59:02 +00:00
beliefer 1513673f83 [SPARK-30913][SPARK-30841][CORE][SQL][FOLLOWUP] Supplement version information to the configuration of Tests.scala and SQL
### What changes were proposed in this pull request?
I checked all the config of Spark again. find some new commit not add version information.

**Test.scala**
Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.testing.skipValidateCores | 3.1.0 | SPARK-29154 | 474b1bb5c2bce2f83c4dd8e19b9b7c5b3aebd6c4#diff-8b4ea8f3b0cc1e7ce7e943de1abbb165 |  

**SQL**
Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.sql.legacy.integerGroupingId | 3.1.0 | SPARK-30279 | 71c73d58f6e88d2558ed2e696897767d93bac60f#diff-9a6b543db706f1a90f790783d6930a13 |  

The two config only exists in branch master.

### Why are the changes needed?
Supplement version information.

### Does this PR introduce any user-facing change?
'No'.

### How was this patch tested?
Jenkins test.

Closes #28233 from beliefer/sql-conf-version-legacy-integerGroupingId.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-17 17:10:48 +09:00
yi.wu 40f9dbb628 [SPARK-31425][SQL][CORE] UnsafeKVExternalSorter/VariableLengthRowBasedKeyValueBatch should also respect UnsafeAlignedOffset
### What changes were proposed in this pull request?

Make `UnsafeKVExternalSorter` / `VariableLengthRowBasedKeyValueBatch ` also respect `UnsafeAlignedOffset` when reading the record and update some out of date comemnts.

### Why are the changes needed?

Since `BytesToBytesMap` respects `UnsafeAlignedOffset` when writing the record, `UnsafeKVExternalSorter` should also respect `UnsafeAlignedOffset` when reading the record from `BytesToBytesMap` otherwise it will causes data correctness issue.

Unlike `UnsafeKVExternalSorter` may reading records from `BytesToBytesMap`, `VariableLengthRowBasedKeyValueBatch` writes and reads records by itself. Thus, similar to #22053 and [comment](https://github.com/apache/spark/pull/22053#issuecomment-411975239) there, fix for `VariableLengthRowBasedKeyValueBatch` more likely an improvement for the support of SPARC platform.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manually tested `HashAggregationQueryWithControlledFallbackSuite` with `UAO_SIZE=8`  to simulate SPARC platform. And tests only pass with this fix.

Closes #28195 from Ngone51/fix_uao.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-17 04:48:27 +00:00
herman fab4ca5156
[SPARK-31450][SQL] Make ExpressionEncoder thread-safe
### What changes were proposed in this pull request?
This PR moves the `ExpressionEncoder.toRow` and `ExpressionEncoder.fromRow` functions into their own function objects(`ExpressionEncoder.Serializer` & `ExpressionEncoder.Deserializer`). This effectively makes the `ExpressionEncoder` stateless, thread-safe and (more) reusable. The function objects are not thread safe, however they are documented as such and should be used in a more limited scope (making it easier to reason about thread safety).

### Why are the changes needed?
ExpressionEncoders are not thread-safe. We had various (nasty) bugs because of this.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing tests.

Closes #28223 from hvanhovell/SPARK-31450.

Authored-by: herman <herman@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-16 18:47:46 -07:00
Peter Toth 7ad6ba36f2 [SPARK-30564][SQL] Revert Block.length and use comment placeholders in HashAggregateExec
### What changes were proposed in this pull request?
SPARK-21870 (cb0cddf#diff-06dc5de6163687b7810aa76e7e152a76R146-R149) caused significant performance regression in cases where the source code size is fairly large as `HashAggregateExec` uses `Block.length` to decide on splitting the code. The change in `length` makes sense as the comment and extra new lines shouldn't be taken into account when deciding on splitting, but the regular expression based approach is very slow and adds a big relative overhead to cases where the execution is quick (small number of rows).
This PR:
- restores `Block.length` to its original form
- places comments in `HashAggragateExec` with `CodegenContext.registerComment` so as to appear only when comments are enabled (`spark.sql.codegen.comments=true`)

Before this PR:
```
deeply nested struct field r/w:           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
250 deep x 400 rows (read in-mem)                  1137           1143           8          0.1       11368.3       0.0X
```

After this PR:
```
deeply nested struct field r/w:           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
250 deep x 400 rows (read in-mem)                   167            180           7          0.6        1674.3       0.1X
```
### Why are the changes needed?
To fix performance regression.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing UTs.

Closes #28083 from peter-toth/SPARK-30564-use-comment-placeholders.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-04-16 17:52:22 +09:00
Max Gekk c76c31e2c6 [SPARK-31455][SQL] Fix rebasing of not-existed timestamps
### What changes were proposed in this pull request?
In the PR, I propose to change rebasing of not-existed timestamps in the hybrid calendar (Julian + Gregorian since 1582-10-15) in the range [1582-10-05, 1582-10-15). Not existed timestamps from the range are shifted to the first valid date in the hybrid calendar - 1582-10-15. The changes affect only `rebaseGregorianToJulianMicros()` because reverse rebasing from the hybrid timestamps to Proleptic Gregorian timestamps does not have such problem.

The shifting affects only the date part of timestamps while keeping the time part as is. For example:
```
1582-10-10 00:11:22.334455 -> 1582-10-15 00:11:22.334455
```

### Why are the changes needed?
Currently, not-existed timestamps are shifted by standard difference between Julian and Gregorian calendar on 1582-10-04, for example 1582-10-14 00:00:00 -> 1582-10-24 00:00:00. That contradicts to shifting of not existed dates in other cases, for example:
```
scala> sql("select timestamp'1990-9-31 12:12:12'").show
+----------------------------------+
|TIMESTAMP('1990-10-01 12:12:12.0')|
+----------------------------------+
|               1990-10-01 12:12:12|
+----------------------------------+
```

### Does this PR introduce any user-facing change?
Yes, this impacts on conversion of Spark SQL `TIMESTAMP` values to external timestamps based on non-Proleptic Gregorian calendar. For example, while saving the 1582-10-14 12:13:14 date to ORC files, it will be shifted to the next valid date 1582-10-15 12:13:14.

### How was this patch tested?
- Added tests to `RebaseDateTimeSuite` and to `OrcSourceSuite`
- By existing test suites `DateTimeUtilsSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`, `CollectionExpressionsSuite`, `ParquetIOSuite`.

Closes #28227 from MaxGekk/fix-not-exist-timestamps.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-16 02:54:38 +00:00
Max Gekk 2b10d70bad [SPARK-31423][SQL] Fix rebasing of not-existed dates
### What changes were proposed in this pull request?
In the PR, I propose to change rebasing of not-existed dates in the hybrid calendar (Julian + Gregorian since 1582-10-15) in the range (1582-10-04, 1582-10-15). Not existed dates from the range are shifted to the first valid date in the hybrid calendar - 1582-10-15. The changes affect only `rebaseGregorianToJulianDays()` because reverse rebasing from the hybrid dates to Proleptic Gregorian dates does not have such problem.

### Why are the changes needed?
Currently, not-existed dates are shifted by standard difference between Julian and Gregorian calendar on 1582-10-04, for example 1582-10-14 -> 1582-10-24. That's contradict to shifting not existed dates in other cases, for example:
```
scala> sql("select date'1990-9-31'").show
+-----------------+
|DATE '1990-10-01'|
+-----------------+
|       1990-10-01|
+-----------------+
```

### Does this PR introduce any user-facing change?
Yes, this impacts on conversion of Spark SQL `DATE` values to external dates based on non-Proleptic Gregorian calendar. For example, while saving the 1582-10-14 date to ORC files, it will be shifted to the next valid date 1582-10-15.

### How was this patch tested?
- Added tests to `RebaseDateTimeSuite` and to `OrcSourceSuite`
- By existing test suites `DateTimeUtilsSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`, `CollectionExpressionsSuite`, `ParquetIOSuite`.

Closes #28225 from MaxGekk/fix-not-exist-dates.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-15 16:33:56 +00:00
Max Gekk 744c2480b5 [SPARK-31443][SQL] Fix perf regression of toJavaDate
### What changes were proposed in this pull request?
Optimise the `toJavaDate()` method of `DateTimeUtils` by:
1. Re-using `rebaseGregorianToJulianDays` optimised by #28067
2. Creating `java.sql.Date` instances from milliseconds in UTC since the epoch instead of date-time fields. This allows to avoid "normalization" inside of  `java.sql.Date`.

Also new benchmark for collecting dates is added to `DateTimeBenchmark`.

### Why are the changes needed?
The changes fix the performance regression of collecting `DATE` values comparing to Spark 2.4 (see `DateTimeBenchmark` in https://github.com/MaxGekk/spark/pull/27):

Spark 2.4.6-SNAPSHOT:
```
To/from Java's date-time:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Date                                  559            603          38          8.9         111.8       1.0X
Collect dates                                      2306           3221        1558          2.2         461.1       0.2X
```
Before the changes:
```
To/from Java's date-time:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Date                                 1052           1130          73          4.8         210.3       1.0X
Collect dates                                      3251           4943        1624          1.5         650.2       0.3X
```
After:
```
To/from Java's date-time:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Date                                  416            419           3         12.0          83.2       1.0X
Collect dates                                      1928           2759        1180          2.6         385.6       0.2X
```

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
- By existing tests suites, in particular, `DateTimeUtilsSuite`, `RebaseDateTimeSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`.
- Re-run `DateTimeBenchmark` in the environment:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 |

Closes #28212 from MaxGekk/optimize-toJavaDate.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-15 06:19:12 +00:00
Max Gekk 2c5d489679 [SPARK-31439][SQL] Fix perf regression of fromJavaDate
### What changes were proposed in this pull request?
In the PR, I propose to re-use optimized implementation of days rebase function `rebaseJulianToGregorianDays()` introduced by the PR #28067 in conversion of `java.sql.Date` values to Catalyst's `DATE` values. The function `fromJavaDate` in `DateTimeUtils` was re-written by taking the implementation from Spark 2.4, and by rebasing the final results via `rebaseJulianToGregorianDays()`.

Also I updated `DateTimeBenchmark`, and added a benchmark for conversion from `java.sql.Date`.

### Why are the changes needed?
The PR fixes the regression of parallelizing a collection of `java.sql.Date` values, and improves performance of converting external values to Catalyst's `DATE` values:
- x4 on the master branch
- 30% against Spark 2.4.6-SNAPSHOT

Spark 2.4.6-SNAPSHOT:
```
To/from java.sql.Timestamp:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Date                                  614            655          43          8.1         122.8       1.0X
```

Before the changes:
```
To/from java.sql.Timestamp:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Date                                 1154           1206          46          4.3         230.9       1.0X
```

After:
```
To/from java.sql.Timestamp:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
From java.sql.Date                                  427            434           7         11.7          85.3       1.0X
```

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
- By existing tests suites, in particular, `DateTimeUtilsSuite`, `RebaseDateTimeSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`.
- Re-run `DateTimeBenchmark` in the environment:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 |

Closes #28205 from MaxGekk/optimize-fromJavaDate.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-14 14:44:00 +00:00
Wenchen Fan 6b88d136de [SPARK-31402][SQL][FOLLOWUP] Refine code comments in RebaseDateTime
### What changes were proposed in this pull request?

Refine the code comments of days rebasing, to be consistent with the micros rebasing. i.e. one method is the actual implementation and the other variant is the optimized version.

### Why are the changes needed?

improve code comments

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

N/A

Closes #28199 from cloud-fan/comment.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-14 08:06:55 +00:00
Max Gekk a0f8cc08a3 [SPARK-31426][SQL] Fix perf regressions of toJavaTimestamp/fromJavaTimestamp
### What changes were proposed in this pull request?
Reuse the `rebaseGregorianToJulianMicros()` and `rebaseJulianToGregorianMicros()` functions introduced by the PR #28119 in `DateTimeUtils`.`toJavaTimestamp()` and `fromJavaTimestamp()`. Actually, new implementation is derived from Spark 2.4 + rebasing via pre-calculated rebasing maps.

### Why are the changes needed?
The changes speed up conversions to/from java.sql.Timestamp, and as a consequence the PR improve performance of ORC datasource in loading/saving timestamps:
- Saving ~ **x2.8 faster** in master, and -11% against Spark 2.4.6
- Loading - **x3.2-4.5 faster** in master, -5% against Spark 2.4.6

Before:
```
Save timestamps to ORC:                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
after 1582                                        59877          59877           0          1.7         598.8       0.0X
before 1582                                       61361          61361           0          1.6         613.6       0.0X

Load timestamps from ORC:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, vec off                               48197          48288         118          2.1         482.0       1.0X
after 1582, vec on                                38247          38351         128          2.6         382.5       1.3X
before 1582, vec off                              53179          53359         249          1.9         531.8       0.9X
before 1582, vec on                               44076          44268         269          2.3         440.8       1.1X
```

After:
```
Save timestamps to ORC:                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
after 1582                                        21250          21250           0          4.7         212.5       0.1X
before 1582                                       22105          22105           0          4.5         221.0       0.1X

Load timestamps from ORC:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, vec off                               14903          14933          40          6.7         149.0       1.0X
after 1582, vec on                                 8342           8426          73         12.0          83.4       1.8X
before 1582, vec off                              15528          15575          76          6.4         155.3       1.0X
before 1582, vec on                                9025           9075          61         11.1          90.2       1.7X
```

Spark 2.4.6-SNAPSHOT:
```
Save timestamps to ORC:                   Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
after 1582                                        18858          18858           0          5.3         188.6       1.0X
before 1582                                       18508          18508           0          5.4         185.1       1.0X

Load timestamps from ORC:                 Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
after 1582, vec off                               14063          14177         143          7.1         140.6       1.0X
after 1582, vec on                                 5955           6029         100         16.8          59.5       2.4X
before 1582, vec off                              14119          14126           7          7.1         141.2       1.0X
before 1582, vec on                                5991           6007          25         16.7          59.9       2.3X
```

### Does this PR introduce any user-facing change?
Yes, the `to_utc_timestamp` function returns the later local timestamp in the case of overlapping local timestamps at daylight saving time. it's changed back to the 2.4 behavior.

### How was this patch tested?
- By existing test suite `DateTimeUtilsSuite`, `RebaseDateTimeSuite`, `DateFunctionsSuite`, `DateExpressionsSuites`, `ParquetIOSuite`, `OrcHadoopFsRelationSuite`.
- Re-generating results of the benchmarks `DateTimeBenchmark` and `DateTimeRebaseBenchmark` in the environment:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 |

Closes #28189 from MaxGekk/optimize-to-from-java-timestamp.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-14 04:50:20 +00:00
Max Gekk cf63ad61f5 [SPARK-31402][SQL] Fix rebasing of BCE dates/timestamps
### What changes were proposed in this pull request?
In the PR, I propose to fallback to rebasing via local dates/timestamps for days/micros of before common era (BCE).

### Why are the changes needed?
It fixes the bug of rebasing dates/timestamps of BCE.

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
- By existing tests in `RebaseDateTimeSuite` and `DateTimeUtilsSuite`
- Added tests for negative years to `RebaseDateTimeSuite`

Closes #28172 from MaxGekk/fix-era-in-date-micros-rebasing.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-13 06:07:31 +00:00
Kent Yao d65f534c5a [SPARK-31414][SQL] Fix performance regression with new TimestampFormatter for json and csv time parsing
### What changes were proposed in this pull request?

With benchmark original, where the timestamp values are valid to the new parser

the result is
```scala
[info] Running benchmark: Read dates and timestamps
[info]   Running case: timestamp strings
[info]   Stopped after 3 iterations, 5781 ms
[info]   Running case: parse timestamps from Dataset[String]
[info]   Stopped after 3 iterations, 44764 ms
[info]   Running case: infer timestamps from Dataset[String]
[info]   Stopped after 3 iterations, 93764 ms
[info]   Running case: from_json(timestamp)
[info]   Stopped after 3 iterations, 59021 ms
```
When we modify the benchmark to

```scala
     def timestampStr: Dataset[String] = {
        spark.range(0, rowsNum, 1, 1).mapPartitions { iter =>
          iter.map(i => s"""{"timestamp":"1970-01-01T01:02:03.${i % 100}"}""")
        }.select($"value".as("timestamp")).as[String]
      }

      readBench.addCase("timestamp strings", numIters) { _ =>
        timestampStr.noop()
      }

      readBench.addCase("parse timestamps from Dataset[String]", numIters) { _ =>
        spark.read.schema(tsSchema).json(timestampStr).noop()
      }

      readBench.addCase("infer timestamps from Dataset[String]", numIters) { _ =>
        spark.read.json(timestampStr).noop()
      }
```
where the timestamp values are invalid for the new parser which causes a fallback to legacy parser(2.4).
the result is

```scala
[info] Running benchmark: Read dates and timestamps
[info]   Running case: timestamp strings
[info]   Stopped after 3 iterations, 5623 ms
[info]   Running case: parse timestamps from Dataset[String]
[info]   Stopped after 3 iterations, 506637 ms
[info]   Running case: infer timestamps from Dataset[String]
[info]   Stopped after 3 iterations, 509076 ms
```
About 10x perf-regression

BUT if we modify the timestamp pattern to `....HH:mm:ss[.SSS][XXX]` which make all timestamp values valid for the new parser to prohibit fallback, the result is

```scala
[info] Running benchmark: Read dates and timestamps
[info]   Running case: timestamp strings
[info]   Stopped after 3 iterations, 5623 ms
[info]   Running case: parse timestamps from Dataset[String]
[info]   Stopped after 3 iterations, 506637 ms
[info]   Running case: infer timestamps from Dataset[String]
[info]   Stopped after 3 iterations, 509076 ms
```

### Why are the changes needed?

 Fix performance regression.

### Does this PR introduce any user-facing change?

NO
### How was this patch tested?

new tests added.

Closes #28181 from yaooqinn/SPARK-31414.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-13 03:11:28 +00:00
Kousuke Saruta 6cd0bef7fe
[SPARK-31416][SQL] Check more strictly that a field name can be used as a valid Java identifier for codegen
### What changes were proposed in this pull request?

Check more strictly that a field name can be used as a valid Java identifier in `ScalaReflection.serializerFor`
To check that, `SourceVersion` is used so that we need not add reserved keywords to be checked manually for the future Java versions (e.g, underscore, var, yield), .

### Why are the changes needed?

In the current implementation, `enum` is not checked even though it's a reserved keyword.
Also, there are lots of characters and sequences of character including numeric literals but they are not checked.
So we can't get better error message with following code.
```
case class  Data(`0`: Int)
Seq(Data(1)).toDF.show

20/04/11 03:24:24 ERROR CodeGenerator: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 43, Column 1: Expression "value_0 = value_3" is not a type
org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 43, Column 1: Expression "value_0 = value_3" is not a type

...

```

### Does this PR introduce any user-facing change?

Yes. With this change and the code example above, we can get following error message.
```
java.lang.UnsupportedOperationException: `0` is not a valid identifier of Java and cannot be used as field name
- root class: "Data"

...
```

### How was this patch tested?

Add another assertion to existing test case.

Closes #28184 from sarutak/improve-identifier-check.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-12 13:14:41 -07:00
Dilip Biswal f0e2fc37d1 [SPARK-25154][SQL] Support NOT IN sub-queries inside nested OR conditions
### What changes were proposed in this pull request?

Currently NOT IN subqueries (predicated null aware subquery) are not allowed inside OR expressions. We currently catch this condition in checkAnalysis and throw an error.

This PR enhances the subquery rewrite to support this type of queries.

Query
```SQL
SELECT * FROM s1 WHERE a > 5 or b NOT IN (SELECT c FROM s2);
```
Optimized Plan
```SQL
== Optimized Logical Plan ==
Project [a#3, b#4]
+- Filter ((a#3 > 5) || NOT exists#7)
   +- Join ExistenceJoin(exists#7), ((b#4 = c#5) || isnull((b#4 = c#5)))
      :- HiveTableRelation `default`.`s1`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [a#3, b#4]
      +- Project [c#5]
         +- HiveTableRelation `default`.`s2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [c#5, d#6]
```
This is rework from #22141.
The original author of this PR is dilipbiswal.

Closes #22141

### Why are the changes needed?

For better usability.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added new tests in SQLQueryTestSuite, RewriteSubquerySuite and SubquerySuite.
Output from DB2 as a reference:
[nested-not-db2.txt](https://github.com/apache/spark/files/2299945/nested-not-db2.txt)

Closes #28158 from maropu/pr22141.

Lead-authored-by: Dilip Biswal <dkbiswal@gmail.com>
Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Co-authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-04-11 08:28:11 +09:00
Kent Yao a454510917 [SPARK-31392][SQL] Support CalendarInterval to be reflect to CalendarntervalType
### What changes were proposed in this pull request?

Since 3.0.0, we make CalendarInterval public for input, it's better for it to be inferred to CalendarIntervalType.
In the PR, we add a rule for CalendarInterval to be mapped to CalendarIntervalType in ScalaRelection, then records(e.g case class, tuples ...) contains interval fields are able to convert to a Dataframe.

### Why are the changes needed?

CalendarInterval is public but can not be used as input for Datafame.

```scala
scala> import org.apache.spark.unsafe.types.CalendarInterval
import org.apache.spark.unsafe.types.CalendarInterval

scala> Seq((1, new CalendarInterval(1, 2, 3))).toDF("a", "b")
java.lang.UnsupportedOperationException: Schema for type org.apache.spark.unsafe.types.CalendarInterval is not supported
  at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$schemaFor$1(ScalaReflection.scala:735)
```

this should be supported as well as
```scala
scala> sql("select interval 2 month 1 day a")
res2: org.apache.spark.sql.DataFrame = [a: interval]
```
### Does this PR introduce any user-facing change?

Yes, records(e.g case class, tuples ...) contains interval fields are able to convert to a Dataframe
### How was this patch tested?

add uts

Closes #28165 from yaooqinn/SPARK-31392.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-10 07:34:01 +00:00
Wenchen Fan 148950fa2b [SPARK-31359][DOC][FOLLOWUP] improve code comments in RebaseDateTime
### What changes were proposed in this pull request?

improve the code comment and make them consistent between `rebaseJulianToGregorian*` and `rebaseGregorianToJulian*`

### Why are the changes needed?

improve readability.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

N/A

Closes #28166 from cloud-fan/comment.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-10 03:43:32 +00:00
Max Gekk e2d9399602 [SPARK-31359][SQL] Speed up timestamps rebasing
### What changes were proposed in this pull request?
In the PR, I propose to optimise the `DateTimeUtils`.`rebaseJulianToGregorianMicros()` and `rebaseGregorianToJulianMicros()` functions, and make them faster by using pre-calculated rebasing tables. This approach allows to avoid expensive conversions via local timestamps. For example, the `America/Los_Angeles` time zone has just a few time points when difference between Proleptic Gregorian calendar and the hybrid calendar (Julian + Gregorian since 1582-10-15) is changed in the time interval 0001-01-01 .. 2100-01-01:

| i | local  timestamp | Proleptic Greg. seconds | Hybrid (Julian+Greg) seconds | difference in minutes|
| -- | ------- |----|----| ---- |
|0|0001-01-01 00:00|-62135568422|-62135740800|-2872|
|1|0100-03-01 00:00|-59006333222|-59006419200|-1432|
|...|...|...|...|...|
|13|1582-10-15 00:00|-12219264422|-12219264000|7|
|14|1883-11-18 12:00|-2717640000|-2717640000|0|

The difference in microseconds between Proleptic and hybrid calendars for any local timestamp in time intervals `[local timestamp(i), local timestamp(i+1))`, and for any microseconds in the time interval `[Gregorian micros(i), Gregorian micros(i+1))` is the same. In this way, we can rebase an input micros by following the steps:
1. Look at the table, and find the time interval where the micros falls to
2. Take the difference between 2 calendars for this time interval
3. Add the difference to the input micros. The result is rebased microseconds that has the same local timestamp representation.

Here are details of the implementation:
- Pre-calculated tables are stored to JSON files `gregorian-julian-rebase-micros.json` and `julian-gregorian-rebase-micros.json` in the resource folder of `sql/catalyst`. The diffs and switch time points are stored as seconds, for example:
```json
[
  {
    "tz" : "America/Los_Angeles",
    "switches" : [ -62135740800, -59006419200, ... , -2717640000 ],
    "diffs" : [ 172378, 85978, ..., 0 ]
  }
]
```
  The JSON files are generated by 2 tests in `RebaseDateTimeSuite` - `generate 'gregorian-julian-rebase-micros.json'` and `generate 'julian-gregorian-rebase-micros.json'`. Both tests are disabled by default.
  The `switches` time points are ordered from old to recent timestamps. This condition is checked by the test `validate rebase records in JSON files` in `RebaseDateTimeSuite`. Also sizes of the `switches` and `diffs` arrays are the same (this is checked by the same test).

- The **_Asia/Tehran, Iran, Africa/Casablanca and Africa/El_Aaiun_** time zones weren't added to the JSON files, see [SPARK-31385](https://issues.apache.org/jira/browse/SPARK-31385)
- The rebase info from the JSON files is placed to hash tables - `gregJulianRebaseMap` and `julianGregRebaseMap`. I use `AnyRefMap` because it is almost 2 times faster than Scala's immutable Map. Also I tried `java.util.HashMap` but it has worse lookup time than `AnyRefMap` in our case.
The hash maps store the switch time points and diffs in microseconds precision to avoid conversions from microseconds to seconds in the runtime.

- I moved the code related to days and microseconds rebasing to the separate object `RebaseDateTime` to do not pollute `DateTimeUtils`. Tests related to date-time rebasing are moved to `RebaseDateTimeSuite` for the same reason.

- I placed rebasing via local timestamp to separate methods that require zone id as the first parameter assuming that the caller has zone id already. This allows to void unnecessary retrieving the default time zone. The methods are marked as `private[sql]` because they are used in `RebaseDateTimeSuite` as reference implementation.

- Modified the `rebaseGregorianToJulianMicros()` and `rebaseJulianToGregorianMicros()` methods in `RebaseDateTime` to look up the rebase tables first of all. If hash maps don't contain rebasing info for the given time zone id, the methods falls back to the implementation via local timestamps. This allows to support time zones specified as zone offsets like '-08:00'.

### Why are the changes needed?
To make timestamps rebasing faster:
- Saving timestamps to parquet files is ~ **x3.8 faster**
- Loading timestamps from parquet files is ~**x2.8 faster**.
- Loading timestamps by Vectorized reader ~**x4.6 faster**.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
- Added the test `validate rebase records in JSON files` to `RebaseDateTimeSuite`. The test validates 2 json files from the resource folder - `gregorian-julian-rebase-micros.json` and `julian-gregorian-rebase-micros.json`, and it checks per each time zone records that
  - the number of switch points is equal to the number of diffs between calendars. If the numbers are different, this will violate the assumption made in `RebaseDateTime.rebaseMicros`.
  - swith points are ordered from old to recent timestamps. This pre-condition is required for linear search in the `rebaseMicros` function.
- Added the test `optimization of micros rebasing - Gregorian to Julian` to `RebaseDateTimeSuite` which iterates over timestamps from 0001-01-01 to 2100-01-01 with the steps 1 ± 0.5 months, and checks that optimised function `RebaseDateTime`.`rebaseGregorianToJulianMicros()` returns the same result as non-optimised one. The check is performed for the UTC, PST, CET, Africa/Dakar, America/Los_Angeles, Antarctica/Vostok, Asia/Hong_Kong, Europe/Amsterdam time zones.
- Added the test `optimization of micros rebasing - Julian to Gregorian` to `RebaseDateTimeSuite` which does similar checks as the test above but for rebasing from the hybrid calendar (Julian + Gregorian) to Proleptic Gregorian calendar.
- The tests for days rebasing are moved from `DateTimeUtilsSuite` to `RebaseDateTimeSuite` because the rebasing related code is moved from `DateTimeUtils` to the separate object `RebaseDateTime`.
- Re-run `DateTimeRebaseBenchmark` at the America/Los_Angeles time zone (it is set explicitly in the PR #28127):

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 64-Bit Server VM 1.8.0_242 and OpenJDK 64-Bit Server VM 11.0.6+10 |

Closes #28119 from MaxGekk/optimize-rebase-micros.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-09 05:23:52 +00:00
iRakson b56242332d
[SPARK-31009][SQL] Support json_object_keys function
### What changes were proposed in this pull request?
A new function `json_object_keys` is proposed in this PR. This function will return all the keys of the outmost json object. It takes Json Object as an argument.

- If invalid json expression is given, `NULL` will be returned.
- If an empty string or json array is given, `NULL` will be returned.
- If valid json object is given, all the keys of the outmost object will be returned as an array.
- For empty json object, empty array is returned.

We can also get JSON object keys using `map_keys+from_json`.  But `json_object_keys` is more efficient.
```
Performance result for json_object = {"a":[1,2,3,4,5], "b":[2,4,5,12333321]}

Intel(R) Core(TM) i7-9750H CPU  2.60GHz
JSON functions:                           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
json_object_keys                                  11666          12361         673          0.9        1166.6       1.0X
from_json+map_keys                                15309          15973         701          0.7        1530.9       0.8X

```

### Why are the changes needed?
This function will help naive users in directly extracting the keys from json string and its fairly intuitive as well. Also its extends the functionality of spark-sql for json strings.

Some of the most popular DBMSs supports this function.
- PostgreSQL
- MySQL
- MariaDB

### Does this PR introduce any user-facing change?
Yes. Now users can extract keys of json objects using this function.

### How was this patch tested?
UTs added.

Closes #27836 from iRakson/jsonKeys.

Authored-by: iRakson <raksonrakesh@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-08 13:04:59 -07:00
iRakson 71022d7130
[SPARK-31008][SQL] Support json_array_length function
### What changes were proposed in this pull request?
At the moment we do not have any function to compute length of JSON array directly.
I propose a  `json_array_length` function which will return the length of the outermost JSON array.

- This function will return length of the outermost JSON array, if JSON array is valid.
```

scala> spark.sql("select json_array_length('[1,2,3,[33,44],{\"key\":[2,3,4]}]')").show
+--------------------------------------------------+
|json_array_length([1,2,3,[33,44],{"key":[2,3,4]}])|
+--------------------------------------------------+
|                                                 5|
+--------------------------------------------------+

scala> spark.sql("select json_array_length('[[1],[2,3]]')").show
+------------------------------+
|json_array_length([[1],[2,3]])|
+------------------------------+
|                             2|
+------------------------------+

```
- In case of any other valid JSON string, invalid JSON string or null array or `NULL` input , `NULL` will be returned.
```
scala> spark.sql("select json_array_length('')").show
+-------------------+
|json_array_length()|
+-------------------+
|               null|
+-------------------+
```

### Why are the changes needed?

- As mentioned in JIRA, this function is supported by presto, postgreSQL, redshift, SQLite, MySQL, MariaDB, IBM DB2.

- for better user experience and ease of use.

```
Performance Result for Json array - [1, 2, 3, 4]

Intel(R) Core(TM) i7-9750H CPU  2.60GHz
JSON functions:                           Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
json_array_length                                  7728           7762          53          1.3         772.8       1.0X
size+from_json                                    12739          12895         199          0.8        1273.9       0.6X

```

### Does this PR introduce any user-facing change?
Yes, now users can get length of a json array by using `json_array_length`.

### How was this patch tested?
Added UT.

Closes #27759 from iRakson/jsonArrayLength.

Authored-by: iRakson <raksonrakesh@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-07 15:34:33 -07:00
Eric Wu a28ed86a38
[SPARK-31113][SQL] Add SHOW VIEWS command
### What changes were proposed in this pull request?
Previously, user can issue `SHOW TABLES` to get info of both tables and views.
This PR (SPARK-31113) implements `SHOW VIEWS` SQL command similar to HIVE to get views only.(https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-ShowViews)

**Hive** -- Only show view names
```
hive> SHOW VIEWS;
OK
view_1
view_2
...
```

**Spark(Hive-Compatible)** -- Only show view names, used in tests and `SparkSQLDriver` for CLI applications
```
SHOW VIEWS IN showdb;
view_1
view_2
...
```

**Spark** -- Show more information database/viewName/isTemporary
```
spark-sql> SHOW VIEWS;
userdb	view_1	false
userdb	view_2	false
...
```

### Why are the changes needed?
`SHOW VIEWS` command provides better granularity to only get information of views.

### Does this PR introduce any user-facing change?
Add new `SHOW VIEWS` SQL command

### How was this patch tested?
Add new test `show-views.sql` and pass existing tests

Closes #27897 from Eric5553/ShowViews.

Authored-by: Eric Wu <492960551@qq.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-07 09:25:01 -07:00
HyukjinKwon 6cf7336a07 [SPARK-30841][SQL][FOLLOW-UP] Change 'version' of spark.sql.execution.pandas.udf.buffer.size to 3.0.0
### What changes were proposed in this pull request?

This PR fixes the added version of `spark.sql.execution.pandas.udf.buffer.size` to 3.0.0 (see also SPARK-27870)

### Why are the changes needed?

To show the correct version added.

### Does this PR introduce any user-facing change?

Yes but only in the unreleased branches. It will change the version shown in SQL documentation.

### How was this patch tested?

Not tested. Jenkins will test it out.

Closes #28144 from HyukjinKwon/SPARK-30841-followup.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-07 23:06:53 +09:00
beliefer 112caea214 [SPARK-31315][SQL][FOLLOWUP][MINOR] Fix some typo and improve comments
### What changes were proposed in this pull request?
This is a minor PR used to fix some typo and improve comments mentioned with https://github.com/apache/spark/pull/28081/files#r402874997

### Why are the changes needed?
Fix some typo and improve comments.

### Does this PR introduce any user-facing change?
'No'.

### How was this patch tested?
Jenkins test.

Closes #28112 from beliefer/fix-typo-in-codegen.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-06 17:43:02 +09:00
Maxim Gekk 09198528b9 [SPARK-31343][SQL][TESTS] Check codegen does not fail on expressions with escape chars in string parameters
### What changes were proposed in this pull request?
In the PR, I propose to add tests to check that code generation doesn't fail if expressions string argument contains escape chars. The PR adds similar tests added by https://github.com/apache/spark/pull/20182 for `from_utc_timestamp` / `to_utc_timestamp`.

### Why are the changes needed?
To prevent regressions in the future.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By running the affected tests

Closes #28115 from MaxGekk/tests-arg-escape.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-06 05:43:29 +00:00
Liang-Chi Hsieh 1f02871489 [SPARK-30921][PYSPARK] Predicates on python udf should not be pushdown through Aggregate
### What changes were proposed in this pull request?

This patch proposed to skip predicates on PythonUDFs to be pushdown through Aggregate.

### Why are the changes needed?

The predicates on PythonUDFs cannot be pushdown through Aggregate. Pushed down predicates cannot be evaluate because PythonUDFs cannot be evaluated on Filter and cause error like:

```
Caused by: java.lang.UnsupportedOperationException: Cannot generate code for expression: mean(input[1, struct<bar:bigint>, true].bar)
        at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode(Expression.scala:304)
        at org.apache.spark.sql.catalyst.expressions.Unevaluable.doGenCode$(Expression.scala:303)
        at org.apache.spark.sql.catalyst.expressions.PythonUDF.doGenCode(PythonUDF.scala:52)
        at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:146)
        at scala.Option.getOrElse(Option.scala:189)
        at org.apache.spark.sql.catalyst.expressions.Expression.genCode(Expression.scala:141)
        at org.apache.spark.sql.catalyst.expressions.CastBase.doGenCode(Cast.scala:821)
        at org.apache.spark.sql.catalyst.expressions.Expression.$anonfun$genCode$3(Expression.scala:146)
        at scala.Option.getOrElse(Option.scala:189)
```

### Does this PR introduce any user-facing change?

Yes. Previously the predicates on PythonUDFs will be pushdown through Aggregate can cause error. After this change, the query can work.

### How was this patch tested?

Unit test.

Closes #28089 from viirya/SPARK-30921.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-04-06 09:36:20 +09:00
gatorsmile 39a6f518cb
[SPARK-29554][SQL][FOLLOWUP] Update Auto-generated Alias Name for Version
### What changes were proposed in this pull request?
The auto-generated alias name of built-in function `version()` is `sparkversion()`. After this PR, it is updated to `version()`.

### Why are the changes needed?
Based on our auto-generated alias name convention for the built-in functions, the alias names should be consistent with the function names.

This built-in function `version` is added in the upcoming Spark 3.0. Thus, we should fix it before the release.

### Does this PR introduce any user-facing change?
Yes. Update the column name in schema if users do not specify the alias.

### How was this patch tested?
Added a test case.

Closes #28131 from gatorsmile/spark-29554followup.

Authored-by: gatorsmile <gatorsmile@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-05 16:37:03 -07:00
Maxim Gekk 820bb9985a [SPARK-31328][SQL] Fix rebasing of overlapped local timestamps during daylight saving time
### What changes were proposed in this pull request?
1. Fix the `rebaseGregorianToJulianMicros()` function in `DateTimeUtils` by passing the daylight saving offset associated with the input `micros` to the constructed instance of `GregorianCalendar`. The problem is in `cal.getTimeInMillis` which returns earliest instant in the case of local date-time overlaps, see https://github.com/AdoptOpenJDK/openjdk-jdk8u/blob/master/jdk/src/share/classes/java/util/GregorianCalendar.java#L2783-L2786 . I fixed the issue by keeping the standard zone offset as is, and set the DST offset only. I don't set `ZONE_OFFSET` because time zone resolution works differently in Java 8 and Java 7 time APIs. So, if I would set the standard zone offsets too, this could change the behavior, and rebasing won't give the same result as Spark 2.4.
2. Fix `rebaseJulianToGregorianMicros()` by changing resulted zoned date-time if `DST_OFFSET` is zero which means the input date-time has passed an autumn daylight savings cutover. So, I take the latest local timestamp out of 2 overlapped timestamps. Otherwise I return a zoned date-time w/o any modification because it is equal to calling the `withEarlierOffsetAtOverlap()` method, so, we can optimize the case.

### Why are the changes needed?
This fixes the bug of loosing of DST offset info in rebasing timestamps via local date-time. For example, there are 2 different timestamps in the `America/Los_Angeles` time zone: `2019-11-03T01:00:00-07:00` and `2019-11-03T01:00:00-08:00`, though they are mapped to the same local date-time `2019-11-03T01:00`, see
<img width="456" alt="Screen Shot 2020-04-02 at 10 19 24" src="https://user-images.githubusercontent.com/1580697/78245697-95a7da00-74f0-11ea-9eba-c08138851cb3.png">
Currently, the UTC timestamp `2019-11-03T09:00:00Z` is converted to `2019-11-03T01:00:00-08:00`, and then to `2019-11-03T01:00:00` (in the original calendar, for instance Proleptic Gregorian calendar) and back to the UTC timestamp `2019-11-03T08:00:00Z` (in the hybrid calendar - Gregorian for the timestamp). That's wrong because the local timestamp must be converted to the original timestamp `2019-11-03T09:00:00Z`.

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
- Added a test to `DateTimeUtilsSuite` which checks that rebased micros are the same as the input during DST. The result must be the same if Java 8 and 7 time API functions return the same time zone offsets.
- Run the following code to check that there is no difference between rebased and original micros for modern timestamps:
```scala
    test("rebasing differences") {
      withDefaultTimeZone(getZoneId("America/Los_Angeles")) {
        val start = instantToMicros(LocalDateTime.of(1, 1, 1, 0, 0, 0)
          .atZone(getZoneId("America/Los_Angeles"))
          .toInstant)
        val end = instantToMicros(LocalDateTime.of(2030, 1, 1, 0, 0, 0)
          .atZone(getZoneId("America/Los_Angeles"))
          .toInstant)

        var micros = start
        var diff = Long.MaxValue
        var counter = 0
        while (micros < end) {
          val rebased = rebaseGregorianToJulianMicros(micros)
          val curDiff = rebased - micros
          if (curDiff != diff) {
            counter += 1
            diff = curDiff
            val ldt = microsToInstant(micros).atZone(getZoneId("America/Los_Angeles")).toLocalDateTime
            println(s"local date-time = $ldt diff = ${diff / MICROS_PER_MINUTE} minutes")
          }
          micros += 30 * MICROS_PER_MINUTE
        }
        println(s"counter = $counter")
      }
    }
```
```
local date-time = 0001-01-01T00:00 diff = -2872 minutes
local date-time = 0100-03-01T00:00 diff = -1432 minutes
local date-time = 0200-03-01T00:00 diff = 7 minutes
local date-time = 0300-03-01T00:00 diff = 1447 minutes
local date-time = 0500-03-01T00:00 diff = 2887 minutes
local date-time = 0600-03-01T00:00 diff = 4327 minutes
local date-time = 0700-03-01T00:00 diff = 5767 minutes
local date-time = 0900-03-01T00:00 diff = 7207 minutes
local date-time = 1000-03-01T00:00 diff = 8647 minutes
local date-time = 1100-03-01T00:00 diff = 10087 minutes
local date-time = 1300-03-01T00:00 diff = 11527 minutes
local date-time = 1400-03-01T00:00 diff = 12967 minutes
local date-time = 1500-03-01T00:00 diff = 14407 minutes
local date-time = 1582-10-15T00:00 diff = 7 minutes
local date-time = 1883-11-18T12:22:58 diff = 0 minutes
counter = 15
```
The code is not added to `DateTimeUtilsSuite` because it takes > 30 seconds.
- By running the updated benchmark `DateTimeRebaseBenchmark` via the command:
```
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeRebaseBenchmark"
```
in the environment:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK 1.8.0_242-8u242/11.0.6+10 |

Closes #28101 from MaxGekk/fix-local-date-overlap.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-03 04:35:31 +00:00
Takeshi Yamamuro d98df7626b [SPARK-31325][SQL][WEB UI] Control a plan explain mode in the events of SQL listeners via SQLConf
### What changes were proposed in this pull request?

This PR intends to add a new SQL config for controlling a plan explain mode in the events of (e.g., `SparkListenerSQLExecutionStart` and `SparkListenerSQLAdaptiveExecutionUpdate`) SQL listeners. In the current master, the output of `QueryExecution.toString` (this is equivalent to the "extended" explain mode) is stored in these events. I think it is useful to control the content via `SQLConf`. For example, the query "Details" content (TPCDS q66 query) of a SQL tab in a Spark web UI will be changed as follows;

Before this PR:
![q66-extended](https://user-images.githubusercontent.com/692303/78211668-950b4580-74e8-11ea-90c6-db52d437534b.png)

After this PR:
![q66-formatted](https://user-images.githubusercontent.com/692303/78211674-9ccaea00-74e8-11ea-9d1d-43c7e2b0f314.png)

### Why are the changes needed?

For better usability.

### Does this PR introduce any user-facing change?

Yes; since Spark 3.1, SQL UI data adopts the `formatted` mode for the query plan explain results. To restore the behavior before Spark 3.0, you can set `spark.sql.ui.explainMode` to `extended`.

### How was this patch tested?

Added unit tests.

Closes #28097 from maropu/SPARK-31325.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-04-02 21:09:16 -07:00
beliefer a9260d0349 [SPARK-31315][SQL] SQLQueryTestSuite: Display the total compile time for generated java code
### What changes were proposed in this pull request?
After my investigation, `SQLQueryTestSuite` spent a lot of time compiling the generated java code.
Take `group-by.sql` as an example.
At first, I added some debug log into `SQLQueryTestSuite`.
Please reference 92b6af740c/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala (L402)
The execution command is as follows:
`build/sbt "~sql/test-only *SQLQueryTestSuite -- -z group-by.sql"`
The output show below:
```
00:56:06.192 WARN org.apache.spark.sql.SQLQueryTestSuite: group-by.sql using configs: spark.sql.codegen.wholeStage=true. run time: 20604
00:56:13.719 WARN org.apache.spark.sql.SQLQueryTestSuite: group-by.sql using configs: spark.sql.codegen.wholeStage=false,spark.sql.codegen.factoryMode=CODEGEN_ONLY. run time: 7526
00:56:18.786 WARN org.apache.spark.sql.SQLQueryTestSuite: group-by.sql using configs: spark.sql.codegen.wholeStage=false,spark.sql.codegen.factoryMode=NO_CODEGEN. run time: 5066
```
According to the log, we know.

Config | Run time(ms)
-- | --
spark.sql.codegen.wholeStage=true | 20604
spark.sql.codegen.wholeStage=false,spark.sql.codegen.factoryMode=CODEGEN_ONLY | 7526
spark.sql.codegen.wholeStage=false,spark.sql.codegen.factoryMode=NO_CODEGEN | 5066

We should display the total compile time for generated java code.

This PR will add the following to `SQLQueryTestSuite`'s output.
```
=== Metrics of Whole Codegen ===
Total compile time: 80.564516529 seconds
```

Note: At first, I wanted to use `CodegenMetrics.METRIC_COMPILATION_TIME` to do this. After many experiments, I found that `CodegenMetrics.METRIC_COMPILATION_TIME` is only effective for a single test case, and cannot play a role in the whole life cycle of `SQLQueryTestSuite`.
I checked the type of  ` CodegenMetrics.METRIC_COMPILATION_TIME` is `Histogram` and the latter preserves 1028 elements.` Histogram` is a metric which calculates the distribution of a value.

### Why are the changes needed?
Display the total compile time for generated java code.

### Does this PR introduce any user-facing change?
'No'.

### How was this patch tested?
Jenkins test.

Closes #28081 from beliefer/output-codegen-compile-time.

Lead-authored-by: beliefer <beliefer@163.com>
Co-authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-02 09:13:22 +00:00
Wenchen Fan 09f036a14c
[SPARK-31322][SQL] rename QueryPlan.collectInPlanAndSubqueries to collectWithSubqueries
### What changes were proposed in this pull request?

rename `QueryPlan.collectInPlanAndSubqueries` to `collectWithSubqueries`

### Why are the changes needed?

The old name is too verbose. `QueryPlan` is internal but it's the core of catalyst and we'd better make the API name clearer before we release it.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

N/A

Closes #28092 from cloud-fan/rename.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-04-01 12:04:40 -07:00
Maxim Gekk c5323d2e8d [SPARK-31318][SQL] Split Parquet/Avro configs for rebasing dates/timestamps in read and in write
### What changes were proposed in this pull request?
In the PR, I propose to replace the following SQL configs:
1.  `spark.sql.legacy.parquet.rebaseDateTime.enabled` by
    - `spark.sql.legacy.parquet.rebaseDateTimeInWrite.enabled` (`false` by default). The config enables rebasing dates/timestamps while saving to Parquet files. If it is set to `true`, dates/timestamps are converted to local date-time in Proleptic Gregorian calendar, date-time fields are extracted, and used in building new local date-time in the hybrid calendar (Julian + Gregorian). The resulted local date-time is converted to days or microseconds since the epoch.
    - `spark.sql.legacy.parquet.rebaseDateTimeInRead.enabled` (`false` by default). The config enables rebasing of dates/timestamps in reading from Parquet files.
2. `spark.sql.legacy.avro.rebaseDateTime.enabled` by
    - `spark.sql.legacy.avro.rebaseDateTimeInWrite.enabled` (`false` by default). It enables dates/timestamps rebasing from Proleptic Gregorian calendar to the hybrid calendar via local date/timestamps.
    - `spark.sql.legacy.avro.rebaseDateTimeInRead.enabled` (`false` by default).  It enables rebasing dates/timestamps from the hybrid calendar to Proleptic Gregorian calendar in read. The rebasing is performed by converting micros/millis/days to a local date/timestamp in the source calendar, interpreting the resulted date/timestamp in the target calendar, and getting the number of micros/millis/days since the epoch 1970-01-01 00:00:00Z.

### Why are the changes needed?
This allows to load dates/timestamps saved by Spark 2.4, and save to Parquet/Avro files without rebasing. And the reverse use case - load data saved by Spark 3.0, and save it in the form which is compatible with Spark 2.4.

### Does this PR introduce any user-facing change?
Yes, users have to use new SQL configs. Old SQL configs are removed by the PR.

### How was this patch tested?
By existing test suites `AvroV1Suite`, `AvroV2Suite` and `ParquetIOSuite`.

Closes #28082 from MaxGekk/split-rebase-configs.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-04-01 04:56:05 +00:00
Wenchen Fan 8b01473e8b [SPARK-31230][SQL] Use statement plans in DataFrameWriter(V2)
### What changes were proposed in this pull request?

Create statement plans in `DataFrameWriter(V2)`, like the SQL API.

### Why are the changes needed?

It's better to leave all the resolution work to the analyzer.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

existing tests

Closes #27992 from cloud-fan/statement.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-31 23:19:46 +08:00
Maxim Gekk bb0b416f0b [SPARK-31297][SQL] Speed up dates rebasing
### What changes were proposed in this pull request?
In the PR, I propose to replace current implementation of the `rebaseGregorianToJulianDays()` and `rebaseJulianToGregorianDays()` functions in `DateTimeUtils` by new one which is based on the fact that difference between Proleptic Gregorian and the hybrid (Julian+Gregorian) calendars was changed only 14 times for entire supported range of valid dates `[0001-01-01, 9999-12-31]`:

| date | Proleptic Greg. days | Hybrid (Julian+Greg) days | diff|
| ---- | ----|----|----|
|0001-01-01|-719162|-719164|-2|
|0100-03-01|-682944|-682945|-1|
|0200-03-01|-646420|-646420|0|
|0300-03-01|-609896|-609895|1|
|0500-03-01|-536847|-536845|2|
|0600-03-01|-500323|-500320|3|
|0700-03-01|-463799|-463795|4|
|0900-03-01|-390750|-390745|5|
|1000-03-01|-354226|-354220|6|
|1100-03-01|-317702|-317695|7|
|1300-03-01|-244653|-244645|8|
|1400-03-01|-208129|-208120|9|
|1500-03-01|-171605|-171595|10|
|1582-10-15|-141427|-141427|0|

For the given days since the epoch, the proposed implementation finds the range of days which the input days belongs to, and adds the diff in days between calendars to the input. The result is rebased days since the epoch in the target calendar.

For example, if need to rebase -650000 days from Proleptic Gregorian calendar to the hybrid calendar. In that case, the input falls to the bucket [-682944, -646420), the diff associated with the range is -1. To get the rebased days in Julian calendar, we should add -1 to -650000, and the result is -650001.

### Why are the changes needed?
To make dates rebasing faster.

### Does this PR introduce any user-facing change?
No, the results should be the same for valid range of the `DATE` type `[0001-01-01, 9999-12-31]`.

### How was this patch tested?
- Added 2 tests to `DateTimeUtilsSuite` for the `rebaseGregorianToJulianDays()` and `rebaseJulianToGregorianDays()` functions. The tests check that results of old and new implementation (optimized version) are the same for all supported dates.
- Re-run `DateTimeRebaseBenchmark` on:

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1 (ami-06f2f779464715dc5) |
| Java | OpenJDK8/11 |

Closes #28067 from MaxGekk/optimize-rebasing.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-31 17:38:47 +08:00
beliefer bed21770af [SPARK-31215][SQL][DOC] Add version information to the static configuration of SQL
### What changes were proposed in this pull request?
Add version information to the static configuration of `SQL`.

I sorted out some information show below.

Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.sql.warehouse.dir | 2.0.0 | SPARK-14994 | 054f991c4350af1350af7a4109ee77f4a34822f0#diff-32bb9518401c0948c5ea19377b5069ab |  
spark.sql.catalogImplementation | 2.0.0 | SPARK-14720 and SPARK-13643 | 8fc267ab3322e46db81e725a5cb1adb5a71b2b4d#diff-6bdad48cfc34314e89599655442ff210 |  
spark.sql.globalTempDatabase | 2.1.0 | SPARK-17338 | 23ddff4b2b2744c3dc84d928e144c541ad5df376#diff-6bdad48cfc34314e89599655442ff210 |  
spark.sql.sources.schemaStringLengthThreshold | 1.3.1 | SPARK-6024 | 6200f0709c5c8440decae8bf700d7859f32ac9d5#diff-41ef65b9ef5b518f77e2a03559893f4d | 1.3
spark.sql.filesourceTableRelationCacheSize | 2.2.0 | SPARK-19265 | 9d9d67c7957f7cbbdbe889bdbc073568b2bfbb16#diff-32bb9518401c0948c5ea19377b5069ab |
spark.sql.codegen.cache.maxEntries | 2.4.0 | SPARK-24727 | b2deef64f604ddd9502a31105ed47cb63470ec85#diff-5081b9388de3add800b6e4a6ddf55c01 |
spark.sql.codegen.comments | 2.0.0 | SPARK-15680 | f0e8738c1ec0e4c5526aeada6f50cf76428f9afd#diff-8bcc5aea39c73d4bf38aef6f6951d42c |  
spark.sql.debug | 2.1.0 | SPARK-17899 | db8784feaa605adcbd37af4bc8b7146479b631f8#diff-32bb9518401c0948c5ea19377b5069ab |  
spark.sql.hive.thriftServer.singleSession | 1.6.0 | SPARK-11089 | 167ea61a6a604fd9c0b00122a94d1bc4b1de24ff#diff-ff50aea397a607b79df9bec6f2a841db |  
spark.sql.extensions | 2.2.0 | SPARK-18127 | f0de600797ff4883927d0c70732675fd8629e239#diff-5081b9388de3add800b6e4a6ddf55c01 |  
spark.sql.queryExecutionListeners | 2.3.0 | SPARK-19558 | bd4eb9ce57da7bacff69d9ed958c94f349b7e6fb#diff-5081b9388de3add800b6e4a6ddf55c01 |  
spark.sql.streaming.streamingQueryListeners | 2.4.0 | SPARK-24479 | 7703b46d2843db99e28110c4c7ccf60934412504#diff-5081b9388de3add800b6e4a6ddf55c01 |  
spark.sql.ui.retainedExecutions | 1.5.0 | SPARK-8861 and SPARK-8862 | ebc3aad272b91cf58e2e1b4aa92b49b8a947a045#diff-81764e4d52817f83bdd5336ef1226bd9 |  
spark.sql.broadcastExchange.maxThreadThreshold | 3.0.0 | SPARK-26601 | 126310ca68f2f248ea8b312c4637eccaba2fdc2b#diff-5081b9388de3add800b6e4a6ddf55c01 |  
spark.sql.subquery.maxThreadThreshold | 2.4.6 | SPARK-30556 | 2fc562cafd71ec8f438f37a28b65118906ab2ad2#diff-5081b9388de3add800b6e4a6ddf55c01 |  
spark.sql.event.truncate.length | 3.0.0 | SPARK-27045 | e60d8fce0b0cf2a6d766ea2fc5f994546550570a#diff-5081b9388de3add800b6e4a6ddf55c01 |
spark.sql.legacy.sessionInitWithConfigDefaults | 3.0.0 | SPARK-27253 | 83f628b57da39ad9732d1393aebac373634a2eb9#diff-5081b9388de3add800b6e4a6ddf55c01 |
spark.sql.defaultUrlStreamHandlerFactory.enabled | 3.0.0 | SPARK-25694 | 8469614c0513fbed87977d4e741649db3fdd8add#diff-5081b9388de3add800b6e4a6ddf55c01 |
spark.sql.streaming.ui.enabled | 3.0.0 | SPARK-29543 | f9b86370cb04b72a4f00cbd4d60873960aa2792c#diff-5081b9388de3add800b6e4a6ddf55c01 |  
spark.sql.streaming.ui.retainedProgressUpdates | 3.0.0 | SPARK-29543 | f9b86370cb04b72a4f00cbd4d60873960aa2792c#diff-5081b9388de3add800b6e4a6ddf55c01 |  
spark.sql.streaming.ui.retainedQueries | 3.0.0 | SPARK-29543 | f9b86370cb04b72a4f00cbd4d60873960aa2792c#diff-5081b9388de3add800b6e4a6ddf55c01 |  

### Why are the changes needed?
Supplemental configuration version information.

### Does this PR introduce any user-facing change?
'No'.

### How was this patch tested?
Exists UT

Closes #27981 from beliefer/add-version-to-sql-static-config.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-31 12:31:25 +09:00
Dongjoon Hyun cda2e30e77
Revert "[SPARK-31280][SQL] Perform propagating empty relation after RewritePredicateSubquery"
This reverts commit f376d24ea1.
2020-03-30 19:14:14 -07:00
Kent Yao f376d24ea1
[SPARK-31280][SQL] Perform propagating empty relation after RewritePredicateSubquery
### What changes were proposed in this pull request?
```sql
scala> spark.sql(" select * from values(1), (2) t(key) where key in (select 1 as key where 1=0)").queryExecution
res15: org.apache.spark.sql.execution.QueryExecution =
== Parsed Logical Plan ==
'Project [*]
+- 'Filter 'key IN (list#39 [])
   :  +- Project [1 AS key#38]
   :     +- Filter (1 = 0)
   :        +- OneRowRelation
   +- 'SubqueryAlias t
      +- 'UnresolvedInlineTable [key], [List(1), List(2)]

== Analyzed Logical Plan ==
key: int
Project [key#40]
+- Filter key#40 IN (list#39 [])
   :  +- Project [1 AS key#38]
   :     +- Filter (1 = 0)
   :        +- OneRowRelation
   +- SubqueryAlias t
      +- LocalRelation [key#40]

== Optimized Logical Plan ==
Join LeftSemi, (key#40 = key#38)
:- LocalRelation [key#40]
+- LocalRelation <empty>, [key#38]

== Physical Plan ==
*(1) BroadcastHashJoin [key#40], [key#38], LeftSemi, BuildRight
:- *(1) LocalTableScan [key#40]
+- Br...
```

`LocalRelation <empty> ` should be able to propagate after subqueries are lift up to joins

### Why are the changes needed?

optimize query

### Does this PR introduce any user-facing change?

no
### How was this patch tested?

add new tests

Closes #28043 from yaooqinn/SPARK-31280.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-29 11:32:22 -07:00
Zhenhua Wang 791d2ba346 [SPARK-31261][SQL] Avoid npe when reading bad csv input with columnNameCorruptRecord specified
### What changes were proposed in this pull request?

SPARK-25387 avoids npe for bad csv input, but when reading bad csv input with `columnNameCorruptRecord` specified, `getCurrentInput` is called and it still throws npe.

### Why are the changes needed?

Bug fix.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Add a test.

Closes #28029 from wzhfy/corrupt_column_npe.

Authored-by: Zhenhua Wang <wzh_zju@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-29 13:30:14 +09:00
Liang-Chi Hsieh aa8776bb59
[SPARK-29721][SQL] Prune unnecessary nested fields from Generate without Project
### What changes were proposed in this pull request?

This patch proposes to prune unnecessary nested fields from Generate which has no Project on top of it.

### Why are the changes needed?

In Optimizer, we can prune nested columns from Project(projectList, Generate). However, unnecessary columns could still possibly be read in Generate, if no Project on top of it. We should prune it too.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Unit test.

Closes #27517 from viirya/SPARK-29721-2.

Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-27 10:47:21 -07:00
Wenchen Fan 8a5d49610d
[MINOR][DOC] Refine comments of QueryPlan regarding subquery
### What changes were proposed in this pull request?

The query plan of Spark SQL is a mutually recursive structure: QueryPlan -> Expression (PlanExpression) -> QueryPlan, but the transformations do not take this into account.

This PR refines the comments of `QueryPlan` to highlight this fact.

### Why are the changes needed?

better document.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

N/A

Closes #28050 from cloud-fan/comment.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-27 09:35:35 -07:00
Maxim Gekk 9f0c010a5c [SPARK-31277][SQL][TESTS] Migrate DateTimeTestUtils from TimeZone to ZoneId
### What changes were proposed in this pull request?
In the PR, I propose to change types of `DateTimeTestUtils` values and functions by replacing `java.util.TimeZone` to `java.time.ZoneId`. In particular:
1. Type of `ALL_TIMEZONES` is changed to `Seq[ZoneId]`.
2. Remove `val outstandingTimezones: Seq[TimeZone]`.
3. Change the type of the time zone parameter in `withDefaultTimeZone` to `ZoneId`.
4. Modify affected test suites.

### Why are the changes needed?
Currently, Spark SQL's date-time expressions and functions have been already ported on Java 8 time API but tests still use old time APIs. In particular, `DateTimeTestUtils` exposes functions that accept only TimeZone instances. This is inconvenient, and CPU consuming because need to convert TimeZone instances to ZoneId instances via strings (zone ids).

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By affected test suites executed by jenkins builds.

Closes #28033 from MaxGekk/with-default-time-zone.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-27 21:14:25 +08:00
Kent Yao 5945d46c11 [SPARK-31225][SQL] Override sql method of OuterReference
### What changes were proposed in this pull request?

OuterReference is one LeafExpression, so it's children is Nil, which makes its SQL representation always be outer(). This makes our explain-command and error msg unclear when OuterReference exists.
e.g.

```scala
org.apache.spark.sql.AnalysisException:
Aggregate/Window/Generate expressions are not valid in where clause of the query.
Expression in where clause: [(in.`value` = max(outer()))]
Invalid expressions: [max(outer())];;
```
This PR override its `sql` method with its `prettyName` and single argment `e`'s `sql` methond

### Why are the changes needed?

improve err message

### Does this PR introduce any user-facing change?

yes, the err msg caused by OuterReference has changed
### How was this patch tested?

modified ut results

Closes #27985 from yaooqinn/SPARK-31225.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-27 15:21:19 +08:00
DB Tsai cb0db21373 [SPARK-25556][SPARK-17636][SPARK-31026][SPARK-31060][SQL][TEST-HIVE1.2] Nested Column Predicate Pushdown for Parquet
### What changes were proposed in this pull request?
1. `DataSourceStrategy.scala` is extended to create `org.apache.spark.sql.sources.Filter` from nested expressions.
2. Translation from nested `org.apache.spark.sql.sources.Filter` to `org.apache.parquet.filter2.predicate.FilterPredicate` is implemented to support nested predicate pushdown for Parquet.

### Why are the changes needed?
Better performance for handling nested predicate pushdown.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
New tests are added.

Closes #27728 from dbtsai/SPARK-17636.

Authored-by: DB Tsai <d_tsai@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-27 14:28:57 +08:00
Wenchen Fan 05498af72e [SPARK-31201][SQL] Add an individual config for skewed partition threshold
### What changes were proposed in this pull request?

Skew join handling comes with an overhead: we need to read some data repeatedly. We should treat a partition as skewed if it's large enough so that it's beneficial to do so.

Currently the size threshold is the advisory partition size, which is 64 MB by default. This is not large enough for the skewed partition size threshold.

This PR adds a new config for the threshold and set default value as 256 MB.

### Why are the changes needed?

Avoid skew join handling that may introduce a  perf regression.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

existing tests

Closes #27967 from cloud-fan/aqe.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-26 22:57:01 +09:00
HyukjinKwon 3bd10ce007 [SPARK-31227][SQL] Non-nullable null type in complex types should not coerce to nullable type
### What changes were proposed in this pull request?

This PR targets for non-nullable null type not to coerce to nullable type in complex types.

Non-nullable fields in struct, elements in an array and entries in map can mean empty array, struct and map. They are empty so it does not need to force the nullability when we find common types.

This PR also reverts and supersedes d7b97a1d0d

### Why are the changes needed?

To make type coercion coherent and consistent. Currently, we correctly keep the nullability even between non-nullable fields:

```scala
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
spark.range(1).select(array(lit(1)).cast(ArrayType(IntegerType, false))).printSchema()
spark.range(1).select(array(lit(1)).cast(ArrayType(DoubleType, false))).printSchema()
```
```scala
spark.range(1).selectExpr("concat(array(1), array(1)) as arr").printSchema()
```

### Does this PR introduce any user-facing change?

Yes.

```scala
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
spark.range(1).select(array().cast(ArrayType(IntegerType, false))).printSchema()
```
```scala
spark.range(1).selectExpr("concat(array(), array(1)) as arr").printSchema()
```

**Before:**

```
org.apache.spark.sql.AnalysisException: cannot resolve 'array()' due to data type mismatch: cannot cast array<null> to array<int>;;
'Project [cast(array() as array<int>) AS array()#68]
+- Range (0, 1, step=1, splits=Some(12))

  at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:149)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$$nestedInanonfun$checkAnalysis$1$2.applyOrElse(CheckAnalysis.scala:140)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$2(TreeNode.scala:333)
  at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
  at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:333)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformUp$1(TreeNode.scala:330)
  at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:399)
  at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:237)
```

```
root
 |-- arr: array (nullable = false)
 |    |-- element: integer (containsNull = true)
```

**After:**

```
root
 |-- array(): array (nullable = false)
 |    |-- element: integer (containsNull = false)
```

```
root
 |-- arr: array (nullable = false)
 |    |-- element: integer (containsNull = false)
```

### How was this patch tested?

Unittests were added and manually tested.

Closes #27991 from HyukjinKwon/SPARK-31227.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-26 15:42:54 +08:00
Maxim Gekk cec9604eae [SPARK-31237][SQL][TESTS] Replace 3-letter time zones by zone offsets
### What changes were proposed in this pull request?
In the PR, I propose to add a few `ZoneId` constant values to the `DateTimeTestUtils` object, and reuse the constants in tests. Proposed the following constants:
- PST = -08:00
- UTC = +00:00
- CEST = +02:00
- CET = +01:00
- JST = +09:00
- MIT = -09:30
- LA = America/Los_Angeles

### Why are the changes needed?
All proposed constant values (except `LA`) are initialized by zone offsets according to their definitions. This will allow to avoid:
- Using of 3-letter time zones that have been already deprecated in JDK, see _Three-letter time zone IDs_ in https://docs.oracle.com/javase/8/docs/api/java/util/TimeZone.html
- Incorrect mapping of 3-letter time zones to zone offsets, see SPARK-31237. For example, `PST` is mapped to `America/Los_Angeles` instead of the `-08:00` zone offset.

Also this should improve stability and maintainability of test suites.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By running affected test suites.

Closes #28001 from MaxGekk/replace-pst.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-26 13:36:00 +08:00
Wenchen Fan 4f274a4de9
[SPARK-31147][SQL] Forbid CHAR type in non-Hive-Serde tables
### What changes were proposed in this pull request?

Spark introduced CHAR type for hive compatibility but it only works for hive tables. CHAR type is never documented and is treated as STRING type for non-Hive tables.

However, this leads to confusing behaviors

**Apache Spark 3.0.0-preview2**
```
spark-sql> CREATE TABLE t(a CHAR(3));

spark-sql> INSERT INTO TABLE t SELECT 'a ';

spark-sql> SELECT a, length(a) FROM t;
a 	2
```

**Apache Spark 2.4.5**
```
spark-sql> CREATE TABLE t(a CHAR(3));

spark-sql> INSERT INTO TABLE t SELECT 'a ';

spark-sql> SELECT a, length(a) FROM t;
a  	3
```

According to the SQL standard, `CHAR(3)` should guarantee all the values are of length 3. Since `CHAR(3)` is treated as STRING so Spark doesn't guarantee it.

This PR forbids CHAR type in non-Hive tables as it's not supported correctly.

### Why are the changes needed?

avoid confusing/wrong behavior

### Does this PR introduce any user-facing change?

yes, now users can't create/alter non-Hive tables with CHAR type.

### How was this patch tested?

new tests

Closes #27902 from cloud-fan/char.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-25 09:25:55 -07:00
Takeshi Yamamuro da49f50621
[SPARK-25121][SQL][FOLLOWUP] Add more unit tests for multi-part identifiers in join strategy hints
### What changes were proposed in this pull request?

This pr intends to add unit tests for the other join hints (`MERGEJOIN`, `SHUFFLE_HASH`, and `SHUFFLE_REPLICATE_NL`). This is a followup PR of #27935.

### Why are the changes needed?

For better test coverage.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added unit tests.

Closes #28013 from maropu/SPARK-25121-FOLLOWUP.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-25 08:37:28 -07:00
Maxim Gekk 27d53de10f [SPARK-31232][SQL][DOCS] Specify formats of spark.sql.session.timeZone
### What changes were proposed in this pull request?
In the PR, I propose to update the doc for `spark.sql.session.timeZone`, and restrict format of config's values to 2 forms:
1. Geographical regions, such as `America/Los_Angeles`.
2. Fixed offsets - a fully resolved offset from UTC. For example, `-08:00`.

### Why are the changes needed?
Other formats such as three-letter time zone IDs are ambitious, and depend on the locale. For example, `CST` could be U.S. `Central Standard Time` and `China Standard Time`. Such formats have been already deprecated in JDK, see [Three-letter time zone IDs](https://docs.oracle.com/javase/8/docs/api/java/util/TimeZone.html).

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By running `./dev/scalastyle`, and manual testing.

Closes #27999 from MaxGekk/doc-session-time-zone.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-25 16:32:28 +08:00
samsetegne 44431d4b1a [SPARK-30822][SQL] Remove semicolon at the end of a sql query
# What changes were proposed in this pull request?
This change proposes ignoring a terminating semicolon from queries submitted by the user (if included) instead of raising a parse exception.

# Why are the changes needed?
When a user submits a directly executable SQL statement terminated with a semicolon, they receive an `org.apache.spark.sql.catalyst.parser.ParseException` of `extraneous input ';' expecting <EOF>`. SQL-92 describes a direct SQL statement as having the format of `<directly executable statement> <semicolon>` and the majority of SQL implementations either require the semicolon as a statement terminator, or make it optional (meaning not raising an exception when it's included, seemingly in recognition that it's a common behavior).

# Does this PR introduce any user-facing change?
No

# How was this patch tested?
Unit test added to `PlanParserSuite`
```
sbt> project catalyst
sbt> testOnly *PlanParserSuite
[info] - case insensitive (565 milliseconds)
[info] - explain (9 milliseconds)
[info] - set operations (41 milliseconds)
[info] - common table expressions (31 milliseconds)
[info] - simple select query (47 milliseconds)
[info] - hive-style single-FROM statement (11 milliseconds)
[info] - multi select query (32 milliseconds)
[info] - query organization (41 milliseconds)
[info] - insert into (12 milliseconds)
[info] - aggregation (24 milliseconds)
[info] - limit (11 milliseconds)
[info] - window spec (11 milliseconds)
[info] - lateral view (17 milliseconds)
[info] - joins (62 milliseconds)
[info] - sampled relations (11 milliseconds)
[info] - sub-query (11 milliseconds)
[info] - scalar sub-query (9 milliseconds)
[info] - table reference (2 milliseconds)
[info] - table valued function (8 milliseconds)
[info] - SPARK-20311 range(N) as alias (2 milliseconds)
[info] - SPARK-20841 Support table column aliases in FROM clause (3 milliseconds)
[info] - SPARK-20962 Support subquery column aliases in FROM clause (4 milliseconds)
[info] - SPARK-20963 Support aliases for join relations in FROM clause (3 milliseconds)
[info] - inline table (23 milliseconds)
[info] - simple select query with !> and !< (5 milliseconds)
[info] - select hint syntax (34 milliseconds)
[info] - SPARK-20854: select hint syntax with expressions (12 milliseconds)
[info] - SPARK-20854: multiple hints (4 milliseconds)
[info] - TRIM function (16 milliseconds)
[info] - OVERLAY function (16 milliseconds)
[info] - precedence of set operations (18 milliseconds)
[info] - create/alter view as insert into table (4 milliseconds)
[info] - Invalid insert constructs in the query (10 milliseconds)
[info] - relation in v2 catalog (3 milliseconds)
[info] - CTE with column alias (2 milliseconds)
[info] - statement containing terminal semicolons (3 milliseconds)
[info] ScalaTest
[info] Run completed in 3 seconds, 129 milliseconds.
[info] Total number of tests run: 36
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 36, failed 0, canceled 0, ignored 0, pending 0
[info] All tests passed.
[info] Passed: Total 36, Failed 0, Errors 0, Passed 36
```

### Current behavior:
#### scala
```scala
scala> val df = sql("select 1")
// df: org.apache.spark.sql.DataFrame = [1: int]
scala> df.show()
// +---+
// |  1|
// +---+
// |  1|
// +---+

scala> val df = sql("select 1;")
// org.apache.spark.sql.catalyst.parser.ParseException:
// extraneous input ';' expecting <EOF>(line 1, pos 8)

// == SQL ==
// select 1;
// --------^^^

//   at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263)
//   at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130)
//   at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:52)
//   at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:76)
//   at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605)
//   at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
//   at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605)
//   ... 47 elided
```
#### pyspark
```python
df = spark.sql('select 1')
df.show()
#+---+
#|  1|
#+---+
#|  1|
#+---+

df = spark.sql('select 1;')
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
#   File "/Users/ssetegne/spark/python/pyspark/sql/session.py", line 646, in sql
#     return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped)
#   File "/Users/ssetegne/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1286, in # __call__
#   File "/Users/ssetegne/spark/python/pyspark/sql/utils.py", line 102, in deco
#     raise converted
# pyspark.sql.utils.ParseException:
# extraneous input ';' expecting <EOF>(line 1, pos 8)

# == SQL ==
# select 1;
# --------^^^
```

### Behavior after proposed fix:
#### scala
```scala
scala> val df = sql("select 1")
// df: org.apache.spark.sql.DataFrame = [1: int]
scala> df.show()
// +---+
// |  1|
// +---+
// |  1|
// +---+

scala> val df = sql("select 1;")
// df: org.apache.spark.sql.DataFrame = [1: int]
scala> df.show()
// +---+
// |  1|
// +---+
// |  1|
// +---+
```
#### pyspark
```python
df = spark.sql('select 1')
df.show()
#+---+
#|    1 |
#+---+
#|    1 |
#+---+

df = spark.sql('select 1;')
df.show()
#+---+
#|    1 |
#+---+
#|    1 |
#+---+
```

Closes #27567 from samsetegne/semicolon.

Authored-by: samsetegne <samuelsetegne@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-25 15:00:15 +08:00
yi.wu f6ff7d0cf8 [SPARK-30127][SQL] Support case class parameter for typed Scala UDF
### What changes were proposed in this pull request?

To support  case class parameter for typed Scala UDF, e.g.

```
case class TestData(key: Int, value: String)
val f = (d: TestData) => d.key * d.value.toInt
val myUdf = udf(f)
val df = Seq(("data", TestData(50, "2"))).toDF("col1", "col2")
checkAnswer(df.select(myUdf(Column("col2"))), Row(100) :: Nil)
```

### Why are the changes needed?

Currently, Spark UDF can only work on data types like java.lang.String, o.a.s.sql.Row, Seq[_], etc. This is inconvenient if user want to apply an operation on one column, and the column is struct type. You must access data from a Row object, instead of domain object like Dataset operations. It will be great if UDF can work on types that are supported by Dataset, e.g. case class.

And here's benchmark result of using case class comparing to row:

```scala

// case class:  58ms 65ms 59ms 64ms 61ms
// row:         59ms 64ms 73ms 84ms 69ms
val f1 = (d: TestData) => s"${d.key}, ${d.value}"
val f2 = (r: Row) => s"${r.getInt(0)}, ${r.getString(1)}"
val udf1 = udf(f1)
// set spark.sql.legacy.allowUntypedScalaUDF=true
val udf2 = udf(f2, StringType)

val df = spark.range(100000).selectExpr("cast (id as int) as id")
    .select(struct('id, lit("str")).as("col"))
df.cache().collect()

// warmup to exclude some extra influence
df.select(udf1('col)).write.mode(SaveMode.Overwrite).format("noop").save()
df.select(udf2('col)).write.mode(SaveMode.Overwrite).format("noop").save()

start = System.currentTimeMillis()
df.select(udf1('col)).write.mode(SaveMode.Overwrite).format("noop").save()
println(System.currentTimeMillis() - start)

start = System.currentTimeMillis()
df.select(udf2('col)).write.mode(SaveMode.Overwrite).format("noop").save()
println(System.currentTimeMillis() - start)

```

### Does this PR introduce any user-facing change?

Yes. User now could be able to use typed Scala UDF with case class as input parameter.

### How was this patch tested?

Added unit tests.

Closes #27937 from Ngone51/udf_caseclass_support.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-24 23:03:57 +08:00
Maxim Gekk 1fd4607d84 [SPARK-31221][SQL] Rebase any date-times in conversions to/from Java types
### What changes were proposed in this pull request?
In the PR, I propose to apply rebasing for all dates/timestamps in conversion functions `fromJavaDate()`, `toJavaDate()`, `toJavaTimestamp()` and `fromJavaTimestamp()`. The rebasing is performed via building a local date-time in an original calendar, extracting date-time fields from the result, and creating new local date-time in the target calendar.

### Why are the changes needed?
The changes are need to be compatible with previous Spark version (2.4.5 and earlier versions) not only before the Gregorian cutover date `1582-10-15` but also for dates after the date. For instance, Gregorian calendar implementation in Java 7 `java.util.GregorianCalendar` is not accurate in resolving time zone offsets as Gregorian calendar introduced since Java 8.

### Does this PR introduce any user-facing change?
Yes, this PR can introduce behavior changes for dates after `1582-10-15`, in particular conversions of zone ids to zone offsets will be much more accurate.

### How was this patch tested?
By existing test suites `DateTimeUtilsSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`, `CollectionExpressionsSuite`, `HiveOrcHadoopFsRelationSuite`, `ParquetIOSuite`.

Closes #27980 from MaxGekk/reuse-rebase-funcs-in-java-funcs.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-24 21:14:25 +08:00
HyukjinKwon bd324007d5 [SPARK-31229][SQL][TESTS] Add unit tests TypeCoercion.findTypeForComplex and Cast.canCast in null <> complex types
### What changes were proposed in this pull request?

This PR (SPARK-31229) is rather a followup of https://github.com/apache/spark/pull/27926 (SPARK-31166). It adds unittests for `TypeCoercion.findTypeForComplex` and `Cast.canCast` about struct, map and array with the respect to null types.

### Why are the changes needed?

To detect which scope was broken in the future easily.

### Does this PR introduce any user-facing change?

No, it's a test-only.

### How was this patch tested?

Unittests were added.

Closes #27990 from HyukjinKwon/SPARK-31166-followup.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-24 14:10:59 +09:00
Wenchen Fan 1d0f54951e [SPARK-31205][SQL] support string literal as the second argument of date_add/date_sub functions
### What changes were proposed in this pull request?

https://github.com/apache/spark/pull/26412 introduced a behavior change that `date_add`/`date_sub` functions can't accept string and double values in the second parameter. This is reasonable as it's error-prone to cast string/double to int at runtime.

However, using string literals as function arguments is very common in SQL databases. To avoid breaking valid use cases that the string literal is indeed an integer, this PR proposes to add ansi_cast for string literal in date_add/date_sub functions. If the string value is not a valid integer, we fail at query compiling time because of constant folding.

### Why are the changes needed?

avoid breaking changes

### Does this PR introduce any user-facing change?

Yes, now 3.0 can run `date_add('2011-11-11', '1')` like 2.4

### How was this patch tested?

new tests.

Closes #27965 from cloud-fan/string.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-24 12:07:22 +08:00
yi.wu 5c4d44bb83 [SPARK-31190][SQL] ScalaReflection should not erasure user defined AnyVal type
### What changes were proposed in this pull request?

Improve `ScalaReflection` to only don't erasure non user defined `AnyVal` type, but still erasure other types, e.g. `Any`. And this brings two benefits:

1. Give better encode error message for some unsupported types, e.g. `Any`

2. Won't miss the walk path for the `AnyVal` type

### Why are the changes needed?

Firstly, PR #15284 added encode(serializeFor/deserializeFor) support for value class, which extends `AnyVal`, by not erasure types. But, this also introduce a problem that when user try to encoder unsupported types, e.g. `Any`, it will fail on `java.lang.ClassNotFoundException: scala.Any` due to the reason that `scala.Any` doesn't erasure to `java.lang.Object`.

Also, in current `getClassNameFromType()`, it always erasure types which could missing walked path for user defined `AnyVal` types.

### Does this PR introduce any user-facing change?

Yes. For the test below:

```
case class Bar(i: Any)
case class Foo(i: Bar) extends AnyVal

test() {
  implicitly[ExpressionEncoder[Foo]]
}
```

Before:

```
java.lang.ClassNotFoundException: scala.Any
 at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
 at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
 at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355)
 ...
````

After:
```
java.lang.UnsupportedOperationException: No Encoder found for Any
 - field (class: "java.lang.Object", name: "i")
 - field (class: "org.apache.spark.sql.catalyst.encoders.Bar", name: "i")
 - root class: "org.apache.spark.sql.catalyst.encoders.Foo"
 at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerFor$1(ScalaReflection.scala:561)
```

### How was this patch tested?

Added unit test and test manually.

Closes #27959 from Ngone51/impr_anyval.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-23 16:28:34 +08:00
Maxim Gekk db6247faa8 [SPARK-31211][SQL] Fix rebasing of 29 February of Julian leap years
### What changes were proposed in this pull request?
In the PR, I propose to fix the issue of rebasing leap years in Julian calendar to Proleptic Gregorian calendar in which the years are not leap years. In the Julian calendar, every four years is a leap year, with a leap day added to the month of February. In Proleptic Gregorian calendar, every year that is exactly divisible by four is a leap year, except for years that are exactly divisible by 100, but these centurial years are leap years, if they are exactly divisible by 400. In this ways, the date **1000-02-29** exists in the Julian calendar but not in Proleptic Gregorian calendar.

I modified the `rebaseJulianToGregorianMicros()` and `rebaseJulianToGregorianDays()` in `DateTimeUtils` by passing 1 as a day number of month while forming `LocalDate` or `LocalDateTime`, and adding the number of days using the `plusDays()` method. For example, **1000-02-29** doesn't exist in Proleptic Gregorian calendar, and `LocalDate.of(1000, 2, 29)` throws an exception. To avoid the issue, I build the `LocalDate.of(1000, 2, 1)` date and add 28 days. The `plusDays(28)` method produces the next valid date after `1000-02-28` which is **1000-03-01**.

### Why are the changes needed?
Before the changes, the `java.time.DateTimeException` exception is raised while loading the date `1000-02-29` from parquet files saved by Spark 2.4.5:
```scala
scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTime.enabled", true)
scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_date_leap").show
20/03/21 03:03:59 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3)
java.time.DateTimeException: Invalid date 'February 29' as '1000' is not a leap year
```
The parquet files were saved via the commands:
```shell
$ export TZ="America/Los_Angeles"
```
```scala
scala> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> val df = Seq(java.sql.Date.valueOf("1000-02-29")).toDF("dateS").select($"dateS".as("date"))
df: org.apache.spark.sql.DataFrame = [date: date]
scala> df.write.mode("overwrite").parquet("/Users/maxim/tmp/before_1582/2_4_5_date_leap")
scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_date_leap").show
+----------+
|      date|
+----------+
|1000-02-29|
+----------+
```

### Does this PR introduce any user-facing change?
Yes, after the fix:
```scala
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTime.enabled", true)
scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_date_leap").show
+----------+
|      date|
+----------+
|1000-03-01|
+----------+
```

### How was this patch tested?
Added tests to `DateTimeUtilsSuite`.

Closes #27974 from MaxGekk/julian-date-29-feb.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-23 14:21:24 +08:00
Kent Yao 88ae6c4481 [SPARK-31189][SQL][DOCS] Fix errors and missing parts for datetime pattern document
### What changes were proposed in this pull request?

Fix errors and missing parts for datetime pattern document
1. The pattern we use is similar to DateTimeFormatter and SimpleDateFormat but not identical. So we shouldn't use any of them in the API docs but use a link to the doc of our own.
2. Some pattern letters are missing
3. Some pattern letters are explicitly banned - Set('A', 'c', 'e', 'n', 'N')
4. the second fraction pattern different logic for parsing and formatting

### Why are the changes needed?

fix and improve doc
### Does this PR introduce any user-facing change?

yes, new and updated doc
### How was this patch tested?

pass Jenkins
viewed locally with `jekyll serve`
![image](https://user-images.githubusercontent.com/8326978/77044447-6bd3bb00-69fa-11ea-8d6f-7084166c5dea.png)

Closes #27956 from yaooqinn/SPARK-31189.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-20 21:59:26 +08:00
Maxim Gekk 4766a36647 [SPARK-31183][SQL] Rebase date/timestamp from/to Julian calendar in Avro
### What changes were proposed in this pull request?
The PR addresses the issue of compatibility with Spark 2.4 and earlier version in reading/writing dates and timestamp via **Avro** datasource. Previous releases are based on a hybrid calendar - Julian + Gregorian. Since Spark 3.0, Proleptic Gregorian calendar is used by default, see SPARK-26651. In particular, the issue pops up for dates/timestamps before 1582-10-15 when the hybrid calendar switches from/to Gregorian to/from Julian calendar. The same local date in different calendar is converted to different number of days since the epoch 1970-01-01. For example, the 1001-01-01 date is converted to:
- -719164 in Julian calendar. Spark 2.4 saves the number as a value of DATE type into **Avro** files.
- -719162 in Proleptic Gregorian calendar. Spark 3.0 saves the number as a date value.

The PR proposes rebasing from/to Proleptic Gregorian calendar to the hybrid one under the SQL config:
```
spark.sql.legacy.avro.rebaseDateTime.enabled
```
which is set to `false` by default which means the rebasing is not performed by default.

The details of the implementation:
1. Re-use 2 methods of `DateTimeUtils` added by the PR https://github.com/apache/spark/pull/27915 for rebasing microseconds.
2. Re-use 2 methods of `DateTimeUtils` added by the PR https://github.com/apache/spark/pull/27915 for rebasing days.
3. Use `rebaseGregorianToJulianMicros()` and `rebaseGregorianToJulianDays()` while saving timestamps/dates to **Avro** files if the SQL config is on.
4. Use `rebaseJulianToGregorianMicros()` and `rebaseJulianToGregorianDays()` while loading timestamps/dates from **Avro** files if the SQL config is on.
5. The SQL config `spark.sql.legacy.avro.rebaseDateTime.enabled` controls conversions from/to dates, and timestamps of the `timestamp-millis`, `timestamp-micros` logical types.

### Why are the changes needed?
For the backward compatibility with Spark 2.4 and earlier versions. The changes allow users to read dates/timestamps saved by previous version, and get the same result. Also after the changes, users can enable the rebasing in write, and save dates/timestamps that can be loaded correctly by Spark 2.4 and earlier versions.

### Does this PR introduce any user-facing change?
Yes, the timestamp `1001-01-01 01:02:03.123456` saved by Spark 2.4.5 as `timestamp-micros` is interpreted by Spark 3.0.0-preview2 differently:
```scala
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_date_avro").show(false)
+----------+
|date      |
+----------+
|1001-01-07|
+----------+
```
After the changes:
```scala
scala> spark.conf.set("spark.sql.legacy.avro.rebaseDateTime.enabled", true)
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")

scala> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_date_avro").show(false)
+----------+
|date      |
+----------+
|1001-01-01|
+----------+
```

### How was this patch tested?
1. Added tests to `AvroLogicalTypeSuite` to check rebasing in read. The test reads back avro files saved by Spark 2.4.5 via:
```shell
$ export TZ="America/Los_Angeles"
```
```scala
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> val df = Seq("1001-01-01").toDF("dateS").select($"dateS".cast("date").as("date"))
df: org.apache.spark.sql.DataFrame = [date: date]
scala> df.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_date_avro")

scala> val df2 = Seq("1001-01-01 01:02:03.123456").toDF("tsS").select($"tsS".cast("timestamp").as("ts"))
df2: org.apache.spark.sql.DataFrame = [ts: timestamp]
scala> df2.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro")

scala> :paste
// Entering paste mode (ctrl-D to finish)

  val timestampSchema = s"""
    |  {
    |    "namespace": "logical",
    |    "type": "record",
    |    "name": "test",
    |    "fields": [
    |      {"name": "ts", "type": ["null", {"type": "long","logicalType": "timestamp-millis"}], "default": null}
    |    ]
    |  }
    |""".stripMargin

// Exiting paste mode, now interpreting.
scala> df3.write.format("avro").option("avroSchema", timestampSchema).save("/Users/maxim/tmp/before_1582/2_4_5_ts_millis_avro")

```

2. Added the following tests to `AvroLogicalTypeSuite` to check rebasing of dates/timestamps (in microsecond and millisecond precision). The tests write rebased a date/timestamps and read them back w/ enabled/disabled rebasing, and compare results. :
  - `rebasing microseconds timestamps in write`
  - `rebasing milliseconds timestamps in write`
  - `rebasing dates in write`

Closes #27953 from MaxGekk/rebase-avro-datetime.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-20 13:57:49 +08:00
Dongjoon Hyun f1cc86792f [SPARK-31181][SQL][TESTS] Remove the default value assumption on CREATE TABLE test cases
### What changes were proposed in this pull request?

A few `CREATE TABLE` test cases have some assumption on the default value of `LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED`. This PR (SPARK-31181) makes the test cases more explicit from test-case side.

The configuration change was tested via https://github.com/apache/spark/pull/27894 during discussing SPARK-31136. This PR has only the test case part from that PR.

### Why are the changes needed?

This makes our test case more robust in terms of the default value of `LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED`. Even in the case where we switch the conf value, that will be one-liner with no test case changes.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with the existing tests.

Closes #27946 from dongjoon-hyun/SPARK-EXPLICIT-TEST.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-20 12:28:57 +08:00
Takeshi Yamamuro ca499e9409
[SPARK-25121][SQL] Supports multi-part table names for broadcast hint resolution
### What changes were proposed in this pull request?

This pr fixed code to respect a database name for broadcast table hint resolution.
Currently, spark ignores a database name in multi-part names;
```
scala> sql("CREATE DATABASE testDb")
scala> spark.range(10).write.saveAsTable("testDb.t")

// without this patch
scala> spark.range(10).join(spark.table("testDb.t"), "id").hint("broadcast", "testDb.t").explain
== Physical Plan ==
*(2) Project [id#24L]
+- *(2) BroadcastHashJoin [id#24L], [id#26L], Inner, BuildLeft
   :- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false]))
   :  +- *(1) Range (0, 10, step=1, splits=4)
   +- *(2) Project [id#26L]
      +- *(2) Filter isnotnull(id#26L)
         +- *(2) FileScan parquet testdb.t[id#26L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Repositories/spark/spark-2.3.1-bin-hadoop2.7/spark-warehouse..., PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>

// with this patch
scala> spark.range(10).join(spark.table("testDb.t"), "id").hint("broadcast", "testDb.t").explain
== Physical Plan ==
*(2) Project [id#3L]
+- *(2) BroadcastHashJoin [id#3L], [id#5L], Inner, BuildRight
   :- *(2) Range (0, 10, step=1, splits=4)
   +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true]))
      +- *(1) Project [id#5L]
         +- *(1) Filter isnotnull(id#5L)
            +- *(1) FileScan parquet testdb.t[id#5L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Repositories/spark/spark-master/spark-warehouse/testdb.db/t], PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>
```

This PR comes from https://github.com/apache/spark/pull/22198

### Why are the changes needed?

For better usability.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added unit tests.

Closes #27935 from maropu/SPARK-25121-2.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-19 20:11:04 -07:00
Wenchen Fan ac262cb272 [SPARK-30292][SQL][FOLLOWUP] ansi cast from strings to integral numbers (byte/short/int/long) should fail with fraction
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/26933

Fraction string like "1.23" is definitely not a valid integral format and we should fail to do the cast under the ANSI mode.

### Why are the changes needed?

correct the ANSI cast behavior from string to integral

### Does this PR introduce any user-facing change?

Yes under ANSI mode, but ANSI mode is off by default.

### How was this patch tested?

new test

Closes #27957 from cloud-fan/ansi.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-03-20 00:52:09 +09:00
Maxim Gekk bb295d80e3 [SPARK-31159][SQL] Rebase date/timestamp from/to Julian calendar in parquet
### What changes were proposed in this pull request?
The PR addresses the issue of compatibility with Spark 2.4 and earlier version in reading/writing dates and timestamp via Parquet datasource. Previous releases are based on a hybrid calendar - Julian + Gregorian. Since Spark 3.0, Proleptic Gregorian calendar is used by default, see SPARK-26651. In particular, the issue pops up for dates/timestamps before 1582-10-15 when the hybrid calendar switches from/to Gregorian to/from Julian calendar. The same local date in different calendar is converted to different number of days since the epoch 1970-01-01. For example, the 1001-01-01 date is converted to:
- -719164 in Julian calendar. Spark 2.4 saves the number as a value of DATE type into parquet.
- -719162 in Proleptic Gregorian calendar. Spark 3.0 saves the number as a date value.

According to the parquet spec, parquet timestamps of the `TIMESTAMP_MILLIS`, `TIMESTAMP_MICROS` output type and parquet dates should be based on Proleptic Gregorian calendar but the `INT96` timestamps should be stored as Julian days. Since the version 3.0, Spark conforms the spec but for the backward compatibility with previous version, the PR proposes rebasing from/to Proleptic Gregorian calendar to the hybrid one under the SQL config:
```
spark.sql.legacy.parquet.rebaseDateTime.enabled
```
which is set to `false` by default which means the rebasing is not performed by default.

The details of the implementation:
1. Added 2 methods to `DateTimeUtils` for rebasing microseconds. `rebaseGregorianToJulianMicros()` builds a local timestamp in Proleptic Gregorian calendar, extracts date-time fields `year`, `month`, ..., `second fraction` from the local timestamp and uses them to build another local timestamp based on the hybrid calendar (using `java.util.Calendar` API). After that it calculates the number of microseconds since the epoch using the resulted local timestamp. The function performs the conversion via the system JVM time zone for compatibility with Spark 2.4 and earlier versions. The `rebaseJulianToGregorianMicros()` function does reverse conversion.
2. Added 2 methods to `DateTimeUtils` for rebasing days. `rebaseGregorianToJulianDays()` builds a local date from the passed number of days since the epoch in Proleptic Gregorian calendar, interprets the resulted date as a local date in the hybrid calendar and gets the number of days since the epoch from the resulted local date. The conversion is performed via the `UTC` time zone because the conversion is independent from time zones, and `UTC` is selected to void round issues of casting days to milliseconds and back. The `rebaseJulianToGregorianDays()` functions does revers conversion.
3. Use `rebaseGregorianToJulianMicros()` and `rebaseGregorianToJulianDays()` while saving timestamps/dates to parquet files if the SQL config is on.
4. Use `rebaseJulianToGregorianMicros()` and `rebaseJulianToGregorianDays()` while loading timestamps/dates from parquet files if the SQL config is on.
5. The SQL config `spark.sql.legacy.parquet.rebaseDateTime.enabled` controls conversions from/to dates, timestamps of `TIMESTAMP_MILLIS`, `TIMESTAMP_MICROS`, see the SQL config `spark.sql.parquet.outputTimestampType`.
6. The rebasing is always performed for `INT96` timestamps, independently from `spark.sql.legacy.parquet.rebaseDateTime.enabled`.
7. Supported the vectorized parquet reader, see the SQL config `spark.sql.parquet.enableVectorizedReader`.

### Why are the changes needed?
- For the backward compatibility with Spark 2.4 and earlier versions. The changes allow users to read dates/timestamps saved by previous version, and get the same result. Also after the changes, users can enable the rebasing in write, and save dates/timestamps that can be loaded correctly by Spark 2.4 and earlier versions.
- It fixes the bug of incorrect saving/loading timestamps of the `INT96` type

### Does this PR introduce any user-facing change?
Yes, the timestamp `1001-01-01 01:02:03.123456` saved by Spark 2.4.5 as `TIMESTAMP_MICROS` is interpreted by Spark 3.0.0-preview2 differently:
```scala
scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_ts_micros").show(false)
+--------------------------+
|ts                        |
+--------------------------+
|1001-01-07 11:32:20.123456|
+--------------------------+
```
After the changes:
```scala
scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTime.enabled", true)

scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_ts_micros").show(false)
+--------------------------+
|ts                        |
+--------------------------+
|1001-01-01 01:02:03.123456|
+--------------------------+
```

### How was this patch tested?
1. Added tests to `ParquetIOSuite` to check rebasing in read for regular reader and vectorized parquet reader. The test reads back parquet files saved by Spark 2.4.5 via:
```shell
$ export TZ="America/Los_Angeles"
```
```scala
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> val df = Seq("1001-01-01").toDF("dateS").select($"dateS".cast("date").as("date"))
df: org.apache.spark.sql.DataFrame = [date: date]
scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_date")

scala> val df = Seq("1001-01-01 01:02:03.123456").toDF("tsS").select($"tsS".cast("timestamp").as("ts"))
df: org.apache.spark.sql.DataFrame = [ts: timestamp]

scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")
scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_ts_micros")

scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MILLIS")
scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_ts_millis")

scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "INT96")
scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_ts_int96")
```
2. Manually check the write code path. Save date/timestamps (TIMESTAMP_MICROS, TIMESTAMP_MILLIS, INT96) by Spark 3.1.0-SNAPSHOT (after the changes):
```bash
$ export TZ="America/Los_Angeles"
```
```scala
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTime.enabled", true)
scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")
scala> val df = Seq(("1001-01-01", "1001-01-01 01:02:03.123456")).toDF("dateS", "tsS").select($"dateS".cast("date").as("d"), $"tsS".cast("timestamp").as("ts"))
df: org.apache.spark.sql.DataFrame = [d: date, ts: timestamp]
scala> df.write.parquet("/Users/maxim/tmp/before_1582/3_0_0_micros")
scala> spark.read.parquet("/Users/maxim/tmp/before_1582/3_0_0_micros").show(false)
+----------+--------------------------+
|d         |ts                        |
+----------+--------------------------+
|1001-01-01|1001-01-01 01:02:03.123456|
+----------+--------------------------+
```
Read the saved date/timestamp by Spark 2.4.5:
```scala
scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles")
scala> spark.read.parquet("/Users/maxim/tmp/before_1582/3_0_0_micros").show(false)
+----------+--------------------------+
|d         |ts                        |
+----------+--------------------------+
|1001-01-01|1001-01-01 01:02:03.123456|
+----------+--------------------------+
```

Closes #27915 from MaxGekk/rebase-parquet-datetime.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-19 12:49:51 +08:00
Wenchen Fan 8643e5d9c5 [SPARK-31171][SQL][FOLLOWUP] update document
### What changes were proposed in this pull request?

A followup of https://github.com/apache/spark/pull/27936 to update document.

### Why are the changes needed?

correct document

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

N/A

Closes #27950 from cloud-fan/null.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-03-19 07:29:31 +09:00
Kent Yao 3d695954e5 [SPARK-31150][SQL][FOLLOWUP] handle ' as escape for text
### What changes were proposed in this pull request?

pattern `''` means literal `'`

```sql
select date_format(to_timestamp("11111904-01-23 15:02:01", 'y-MM-dd HH:mm:ss'), "y-MM-dd HH:mm:ss''SSSSSSSSS");
5377-02-14 06:27:19'000000519
```
0946a9514f missed this case and this pr add it back.

### Why are the changes needed?

bugfix

### Does this PR introduce any user-facing change?

no
### How was this patch tested?

add ut

Closes #27949 from yaooqinn/SPARK-31150-2.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-03-19 07:27:06 +09:00
yi.wu 8bfaa62f2f [SPARK-31175][SQL] Avoid creating reverse comparator for each compare in InterpretedOrdering
### What changes were proposed in this pull request?

Prpend `-` to the compare result instead of creating a new reverse comparator for each compare when sorting in DESC order in InterpretedOrdering.

### Why are the changes needed?

Currently, we'll create a new reverse comparator for each compare in InterpretedOrdering, which could generate lots of small and instant object and hurt JVM when there're plenty of data.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass Jenkins.

Closes #27938 from Ngone51/reverse_comparator.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-18 23:56:48 +08:00
Kent Yao 57fcc49306 [SPARK-31176][SQL] Remove support for 'e'/'c' as datetime pattern charactar
### What changes were proposed in this pull request?

The meaning of 'u' was day number of the week in SimpleDateFormat, it was changed to year in DateTimeFormatter. Now we keep the old meaning of 'u' by substituting 'u' to 'e' internally and use DateTimeFormatter to parse the pattern string. In DateTimeFormatter, the 'e' and 'c' also represents day-of-week. e.g.

```sql
select date_format(timestamp '2019-10-06', 'yyyy-MM-dd uuuu');
select date_format(timestamp '2019-10-06', 'yyyy-MM-dd uuee');
select date_format(timestamp '2019-10-06', 'yyyy-MM-dd eeee');
```
Because of the substitution, they all goes to `.... eeee` silently. The users may congitive problems of their meanings, so we should mark them as illegal pattern characters to stay the same as before.

This pr move the method `convertIncompatiblePattern` from `DatetimeUtils` to `DateTimeFormatterHelper` object, since it is quite specific for `DateTimeFormatterHelper` class.
And 'e' and 'c' char checking in this method.

Besides,`convertIncompatiblePattern` has a bug that will lose the last `'` if it ends with it, this pr fixes this too. e.g.

```sql
spark-sql> select date_format(timestamp "2019-10-06", "yyyy-MM-dd'S'");
20/03/18 11:19:45 ERROR SparkSQLDriver: Failed in [select date_format(timestamp "2019-10-06", "yyyy-MM-dd'S'")]
java.lang.IllegalArgumentException: Pattern ends with an incomplete string literal: uuuu-MM-dd'S

spark-sql> select to_timestamp("2019-10-06S", "yyyy-MM-dd'S'");
NULL
```
### Why are the changes needed?

avoid vagueness
bug fix

### Does this PR introduce any user-facing change?

no, these are not  exposed yet

### How was this patch tested?

add ut

Closes #27939 from yaooqinn/SPARK-31176.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-18 20:19:50 +08:00
Kent Yao f1d27cdd91 [SPARK-31119][SQL] Add interval value support for extract expression as extract source
### What changes were proposed in this pull request?

```
<extract expression> ::= EXTRACT <left paren> <extract field> FROM <extract source> <right paren>

<extract source> ::= <datetime value expression> | <interval value expression>
```
We now only support datetime values as extract source for `extract` expression but it's alternative function `date_part` supports both datetime and interval.

This pr adds interval value support for `extract` expression as extract source

### Why are the changes needed?

For ANSI compliance and the semantic consistency between extract and `date_part`, we support intervals for extract expressions.

### Does this PR introduce any user-facing change?

yes, in the `extract(abc from xyz)` expression, the `xyz` can be intervals

### How was this patch tested?

add unit tests

Closes #27876 from yaooqinn/SPARK-31119.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-18 12:29:39 +08:00
Wenchen Fan dc5ebc2d5b
[SPARK-31171][SQL] size(null) should return null under ansi mode
### What changes were proposed in this pull request?

Make `size(null)` return null under ANSI mode, regardless of the `spark.sql.legacy.sizeOfNull` config.

### Why are the changes needed?

In https://github.com/apache/spark/pull/27834, we change the result of `size(null)` to be -1 to match the 2.4 behavior and avoid breaking changes.

However, it's true that the "return -1" behavior is error-prone when being used with aggregate functions. The current ANSI mode controls a bunch of "better behaviors" like failing on overflow. We don't enable these "better behaviors" by default because they are too breaking. The "return null" behavior of `size(null)` is a good fit of the ANSI mode.

### Does this PR introduce any user-facing change?

No as ANSI mode is off by default.

### How was this patch tested?

new tests

Closes #27936 from cloud-fan/null.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-17 11:48:54 -07:00
Kent Yao 0946a9514f [SPARK-31150][SQL] Parsing seconds fraction with variable length for timestamp
### What changes were proposed in this pull request?
This PR is to support parsing timestamp values with variable length second fraction parts.

e.g. 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]' can parse timestamp with 0~6 digit-length second fraction but fail >=7
```sql
select to_timestamp(v, 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]') from values
 ('2019-10-06 10:11:12.'),
 ('2019-10-06 10:11:12.0'),
 ('2019-10-06 10:11:12.1'),
 ('2019-10-06 10:11:12.12'),
 ('2019-10-06 10:11:12.123UTC'),
 ('2019-10-06 10:11:12.1234'),
 ('2019-10-06 10:11:12.12345CST'),
 ('2019-10-06 10:11:12.123456PST') t(v)
2019-10-06 03:11:12.123
2019-10-06 08:11:12.12345
2019-10-06 10:11:12
2019-10-06 10:11:12
2019-10-06 10:11:12.1
2019-10-06 10:11:12.12
2019-10-06 10:11:12.1234
2019-10-06 10:11:12.123456

select to_timestamp('2019-10-06 10:11:12.1234567PST', 'yyyy-MM-dd HH:mm:ss.SSSSSS[zzz]')
NULL
```
Since 3.0, we use java 8 time API to parse and format timestamp values. when we create the `DateTimeFormatter`, we use `appendPattern` to create the build first, where the 'S..S' part will be parsed to a fixed-length(= `'S..S'.length`). This fits the formatting part but too strict for the parsing part because the trailing zeros are very likely to be truncated.

### Why are the changes needed?

improve timestamp parsing and more compatible with 2.4.x

### Does this PR introduce any user-facing change?

no, the related changes are newly added
### How was this patch tested?

add uts

Closes #27906 from yaooqinn/SPARK-31150.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-17 21:53:46 +08:00
Wenchen Fan d7b97a1d0d [SPARK-31166][SQL] UNION map<null, null> and other maps should not fail
### What changes were proposed in this pull request?

After https://github.com/apache/spark/pull/27542, `map()` returns `map<null, null>` instead of `map<string, string>`. However, this breaks queries which union `map()` and other maps.

The reason is, `TypeCoercion` rules and `Cast` think it's illegal to cast null type map key to other types, as it makes the key nullable, but it's actually legal. This PR fixes it.

### Why are the changes needed?

To avoid breaking queries.

### Does this PR introduce any user-facing change?

Yes, now some queries that work in 2.x can work in 3.0 as well.

### How was this patch tested?

new test

Closes #27926 from cloud-fan/bug.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-17 12:01:29 +08:00
HyukjinKwon 6704103499
[SPARK-31146][SQL] Leverage the helper method for aliasing in built-in SQL expressions
### What changes were proposed in this pull request?

This PR is kind of a followup of #26808. It leverages the helper method for aliasing in built-in SQL expressions to use the alias as its output column name where it's applicable.

- `Expression`, `UnaryMathExpression` and `BinaryMathExpression` search the alias in the tags by default.
- When the naming is different in its implementation, it has to be overwritten for the expression specifically. E.g., `CallMethodViaReflection`, `Remainder`, `CurrentTimestamp`,
`FormatString` and `XPathDouble`.

This PR fixes the aliases of the functions below:

| class                    | alias            |
|--------------------------|------------------|
|`Rand`                    |`random`          |
|`Ceil`                    |`ceiling`         |
|`Remainder`               |`mod`             |
|`Pow`                     |`pow`             |
|`Signum`                  |`sign`            |
|`Chr`                     |`char`            |
|`Length`                  |`char_length`     |
|`Length`                  |`character_length`|
|`FormatString`            |`printf`          |
|`Substring`               |`substr`          |
|`Upper`                   |`ucase`           |
|`XPathDouble`             |`xpath_number`    |
|`DayOfMonth`              |`day`             |
|`CurrentTimestamp`        |`now`             |
|`Size`                    |`cardinality`     |
|`Sha1`                    |`sha`             |
|`CallMethodViaReflection` |`java_method`     |

Note: `EqualTo`, `=` and `==` aliases were excluded because it's unable to leverage this helper method. It should fix the parser.

Note: this PR also excludes some instances such as `ToDegrees`, `ToRadians`, `UnaryMinus` and `UnaryPositive` that needs an explicit name overwritten to make the scope of this PR smaller.

### Why are the changes needed?

To respect expression name.

### Does this PR introduce any user-facing change?

Yes, it will change the output column name.

### How was this patch tested?

Manually tested, and unittests were added.

Closes #27901 from HyukjinKwon/31146.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-16 11:22:34 -07:00
Wenchen Fan 50a29672e0 [SPARK-30958][SQL] do not set default era for DateTimeFormatter
### What changes were proposed in this pull request?

It's not needed at all as now we replace "y" with "u" if there is no "G". So the era is either explicitly specified (e.g. "yyyy G") or can be inferred from the year (e.g. "uuuu").

### Why are the changes needed?

By default we use "uuuu" as the year pattern, which indicates the era already. If we set a default era, it can get conflicted and fail the parsing.

### Does this PR introduce any user-facing change?

yea, now spark can parse date/timestamp with negative year via the "yyyy" pattern, which will be converted to "uuuu" under the hood.

### How was this patch tested?

new tests

Closes #27707 from cloud-fan/bug.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-16 16:48:31 +09:00
Wenchen Fan b27b3c91f1 [SPARK-31090][SPARK-25457] Revert "IntegralDivide returns data type of the operands"
### What changes were proposed in this pull request?

This reverts commit 47d6e80a2e.

### Why are the changes needed?

There is no standard requiring that `div` must return the type of the operand, and always returning long type looks fine. This is kind of a cosmetic change and we should avoid it if it breaks existing queries. This is similar to reverting TRIM function parameter order change.

### Does this PR introduce any user-facing change?

Yes, change the behavior of `div` back to be the same as 2.4.

### How was this patch tested?

N/A

Closes #27835 from cloud-fan/revert2.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-13 10:47:36 +09:00
Kent Yao 7b4b29e8d9
[SPARK-31131][SQL] Remove the unnecessary config spark.sql.legacy.timeParser.enabled
### What changes were proposed in this pull request?

spark.sql.legacy.timeParser.enabled should be removed from SQLConf and the migration guide
spark.sql.legacy.timeParsePolicy is the right one

### Why are the changes needed?

fix doc

### Does this PR introduce any user-facing change?

no
### How was this patch tested?

Pass the jenkins

Closes #27889 from yaooqinn/SPARK-31131.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-12 09:24:49 -07:00
Wenchen Fan 77c49cb702 [SPARK-31124][SQL] change the default value of minPartitionNum in AQE
### What changes were proposed in this pull request?

AQE has a perf regression when using the default settings: if we coalesce the shuffle partitions into one or few partitions, we may leave many CPU cores idle and the perf is worse than with AQE off (which leverages all CPU cores).

Technically, this is not a bad thing. If there are many queries running at the same time, it's better to coalesce shuffle partitions into fewer partitions. However, the default settings of AQE should try to avoid any perf regression as possible as we can.

This PR changes the default value of minPartitionNum when coalescing shuffle partitions, to be `SparkContext.defaultParallelism`, so that AQE can leverage all the CPU cores.

### Why are the changes needed?

avoid AQE perf regression

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

existing tests

Closes #27879 from cloud-fan/aqe.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-12 21:28:24 +08:00
yi.wu feb9b9e771 [SPARK-31010][SQL][FOLLOW-UP] Give an example for typed Scala UDF in error message
### What changes were proposed in this pull request?

In the error message, adding an example for typed Scala UDF.

### Why are the changes needed?

Help user to know how to migrate to typed Scala UDF.

### Does this PR introduce any user-facing change?

No, it's a new error message in Spark 3.0.

### How was this patch tested?

Pass Jenkins.

Closes #27884 from Ngone51/spark_31010_followup.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-12 21:16:02 +09:00
Wenchen Fan 8efb71013d
[SPARK-31091] Revert SPARK-24640 Return NULL from size(NULL) by default
### What changes were proposed in this pull request?

This PR reverts https://github.com/apache/spark/pull/26051 and https://github.com/apache/spark/pull/26066

### Why are the changes needed?

There is no standard requiring that `size(null)` must return null, and returning -1 looks reasonable as well. This is kind of a cosmetic change and we should avoid it if it breaks existing queries. This is similar to reverting TRIM function parameter order change.

### Does this PR introduce any user-facing change?

Yes, change the behavior of `size(null)` back to be the same as 2.4.

### How was this patch tested?

N/A

Closes #27834 from cloud-fan/revert.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-11 09:55:24 -07:00
Wenchen Fan 5be0d04f16 [SPARK-31117][SQL][TEST] reduce the test time of DateTimeUtilsSuite
### What changes were proposed in this pull request?

`DateTimeUtilsSuite.daysToMicros and microsToDays` takes 30 seconds, which is too long for a UT.

This PR changes the test to check random data, to reduce testing time. Now this test takes 1 second.

### Why are the changes needed?

make test faster

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

N/A

Closes #27873 from cloud-fan/test.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-11 23:47:13 +08:00
Maxim Gekk 3d3e366aa8 [SPARK-31076][SQL] Convert Catalyst's DATE/TIMESTAMP to Java Date/Timestamp via local date-time
### What changes were proposed in this pull request?
In the PR, I propose to change conversion of java.sql.Timestamp/Date values to/from internal values of Catalyst's TimestampType/DateType before cutover day `1582-10-15` of Gregorian calendar. I propose to construct local date-time from microseconds/days since the epoch. Take each date-time component `year`, `month`, `day`, `hour`, `minute`, `second` and `second fraction`, and construct java.sql.Timestamp/Date using the extracted components.

### Why are the changes needed?
This will rebase underlying time/date offset in the way that collected java.sql.Timestamp/Date values will have the same local time-date component as the original values in Gregorian calendar.

Here is the example which demonstrates the issue:
```sql
scala> sql("select date '1100-10-10'").collect()
res1: Array[org.apache.spark.sql.Row] = Array([1100-10-03])
```

### Does this PR introduce any user-facing change?
Yes, after the changes:
```sql
scala> sql("select date '1100-10-10'").collect()
res0: Array[org.apache.spark.sql.Row] = Array([1100-10-10])
```

### How was this patch tested?
By running `DateTimeUtilsSuite`, `DateFunctionsSuite` and `DateExpressionsSuite`.

Closes #27807 from MaxGekk/rebase-timestamp-before-1582.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-11 20:53:56 +08:00
Liang-Chi Hsieh 15557a7d05 [SPARK-31071][SQL] Allow annotating non-null fields when encoding Java Beans
### What changes were proposed in this pull request?

When encoding Java Beans to Spark DataFrame, respecting `javax.annotation.Nonnull` and producing non-null fields.

### Why are the changes needed?

When encoding Java Beans to Spark DataFrame, non-primitive types are encoded as nullable fields. Although It works for most cases, it can be an issue under a few situations, e.g. the one described in the JIRA ticket when saving DataFrame to Avro format with non-null field.

We should allow Spark users more flexibility when creating Spark DataFrame from Java Beans. Currently, Spark users cannot create DataFrame with non-nullable fields in the schema from beans with non-nullable properties.

Although it is possible to project top-level columns with SQL expressions like `AssertNotNull` to make it non-null, for nested fields it is more tricky to do it similarly.

### Does this PR introduce any user-facing change?

Yes. After this change, Spark users can use `javax.annotation.Nonnull` to annotate non-null fields in Java Beans when encoding beans to Spark DataFrame.

### How was this patch tested?

Added unit test.

Closes #27851 from viirya/SPARK-31071.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-11 18:27:48 +08:00
Yuanjian Li 3493162c78 [SPARK-31030][SQL] Backward Compatibility for Parsing and formatting Datetime
### What changes were proposed in this pull request?
In Spark version 2.4 and earlier, datetime parsing, formatting and conversion are performed by using the hybrid calendar (Julian + Gregorian).
Since the Proleptic Gregorian calendar is de-facto calendar worldwide, as well as the chosen one in ANSI SQL standard, Spark 3.0 switches to it by using Java 8 API classes (the java.time packages that are based on ISO chronology ). The switching job is completed in SPARK-26651.
But after the switching, there are some patterns not compatible between Java 8 and Java 7, Spark needs its own definition on the patterns rather than depends on Java API.
In this PR, we achieve this by writing the document and shadow the incompatible letters. See more details in [SPARK-31030](https://issues.apache.org/jira/browse/SPARK-31030)

### Why are the changes needed?
For backward compatibility.

### Does this PR introduce any user-facing change?
No.
After we define our own datetime parsing and formatting patterns, it's same to old Spark version.

### How was this patch tested?
Existing and new added UT.
Locally document test:
![image](https://user-images.githubusercontent.com/4833765/76064100-f6acc280-5fc3-11ea-9ef7-82e7dc074205.png)

Closes #27830 from xuanyuanking/SPARK-31030.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-11 14:11:13 +08:00
Kent Yao 3bd6ebff81 [SPARK-30189][SQL] Interval from year-month/date-time string should handle whitespaces
### What changes were proposed in this pull request?

Currently, we parse interval from multi units strings or from date-time/year-month pattern strings, the former handles all whitespace, the latter not or even spaces.

### Why are the changes needed?

behavior consistency

### Does this PR introduce any user-facing change?
yes, interval in date-time/year-month like
```
select interval '\n-\t10\t 12:34:46.789\t' day to second
-- !query 126 schema
struct<INTERVAL '-10 days -12 hours -34 minutes -46.789 seconds':interval>
-- !query 126 output
-10 days -12 hours -34 minutes -46.789 seconds
```
is valid now.

### How was this patch tested?

add ut.

Closes #26815 from yaooqinn/SPARK-30189.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-10 22:08:58 +08:00
Eric Wu 15df2a3f40 [SPARK-31079][SQL] Logging QueryExecutionMetering in RuleExecutor logger
### What changes were proposed in this pull request?
RuleExecutor already support metering for analyzer/optimizer rules. By providing such information in `PlanChangeLogger`, user can get more information when debugging rule changes .

This PR enhanced `PlanChangeLogger` to display RuleExecutor metrics. This can be easily done by calling the existing API `resetMetrics` and `dumpTimeSpent`, but there might be conflicts if user is also collecting total metrics of a sql job. Thus I introduced `QueryExecutionMetrics`, as the snapshot of `QueryExecutionMetering`, to better support this feature.

Information added to `PlanChangeLogger`
```
=== Metrics of Executed Rules ===
Total number of runs: 554
Total time: 0.107756568 seconds
Total number of effective runs: 11
Total time of effective runs: 0.047615486 seconds
```

### Why are the changes needed?
Provide better plan change debugging user experience

### Does this PR introduce any user-facing change?
Only add more debugging info of `planChangeLog`, default log level is TRACE.

### How was this patch tested?
Update existing tests to verify the new logs

Closes #27846 from Eric5553/ExplainRuleExecMetrics.

Authored-by: Eric Wu <492960551@qq.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-10 19:08:59 +08:00
HyukjinKwon 815c7929c2
[SPARK-31065][SQL] Match schema_of_json to the schema inference of JSON data source
### What changes were proposed in this pull request?

This PR proposes two things:

1. Convert `null` to `string` type during schema inference of `schema_of_json` as JSON datasource does. This is a bug fix as well because `null` string is not the proper DDL formatted string and it is unable for SQL parser to recognise it as a type string. We should match it to JSON datasource and return a string type so `schema_of_json` returns a proper DDL formatted string.

2. Let `schema_of_json` respect `dropFieldIfAllNull` option during schema inference.

### Why are the changes needed?

To let `schema_of_json` return a proper DDL formatted string, and respect `dropFieldIfAllNull` option.

### Does this PR introduce any user-facing change?
Yes, it does.

```scala
import collection.JavaConverters._
import org.apache.spark.sql.functions._

spark.range(1).select(schema_of_json(lit("""{"id": ""}"""))).show()
spark.range(1).select(schema_of_json(lit("""{"id": "a", "drop": {"drop": null}}"""), Map("dropFieldIfAllNull" -> "true").asJava)).show(false)
```

**Before:**

```
struct<id:null>
struct<drop:struct<drop:null>,id:string>
```

**After:**

```
struct<id:string>
struct<id:string>
```

### How was this patch tested?

Manually tested, and unittests were added.

Closes #27854 from HyukjinKwon/SPARK-31065.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-10 00:33:32 -07:00
Yuchen Huo a22994333a [SPARK-30902][SQL][FOLLOW-UP] Allow ReplaceTableAsStatement to have none provider
### What changes were proposed in this pull request?

This is a follow up for https://github.com/apache/spark/pull/27650 where allow None provider for create table. Here we are doing the same thing for ReplaceTable.

### Why are the changes needed?

Although currently the ASTBuilder doesn't seem to allow `replace` without `USING` clause. This would allow `DataFrameWriterV2` to use the statements instead of commands directly.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests

Closes #27838 from yuchenhuo/SPARK-30902.

Authored-by: Yuchen Huo <yuchen.huo@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-10 11:37:31 +08:00
Wenchen Fan 1aa184763a
[SPARK-31053][SQL] mark connector APIs as Evolving
### What changes were proposed in this pull request?

The newly added catalog APIs are marked as Experimental but other DS v2 APIs are marked as Evolving.

This PR makes it consistent and mark all Connector APIs as Evolving.

### Why are the changes needed?

For consistency.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

N/A

Closes #27811 from cloud-fan/tag.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-03-08 11:41:09 -07:00
beliefer f8a3730fd7 [SPARK-30841][SQL][DOC][FOLLOW-UP] Add version information to the configuration of SQL
### What changes were proposed in this pull request?
This PR follows https://github.com/apache/spark/pull/27691, https://github.com/apache/spark/pull/27730 and https://github.com/apache/spark/pull/27770
I sorted out some information show below.

Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.sql.redaction.options.regex | 2.2.2 | SPARK-23850 | 6a55d8b03053e616dcacb79cd2c29a06d219dc32#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.redaction.string.regex | 2.3.0 | SPARK-22791 | 28315714ddef3ddcc192375e98dd5207cf4ecc98#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.function.concatBinaryAsString | 2.3.0 | SPARK-22771 | f2b3525c17d660cf6f082bbafea8632615b4f58e#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.function.eltOutputAsString | 2.3.0 | SPARK-22937 | bf853018cabcd3b3abf84bfe534d2981020b4a71#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.sources.validatePartitionColumns | 3.0.0 | SPARK-26263 | 5a140b7844936cf2b65f08853b8cfd8c499d4f13#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.streaming.continuous.epochBacklogQueueSize | 3.0.0 | SPARK-24063 | c4bbfd177b4e7cb46f47b39df9fd71d2d9a12c6d#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.streaming.continuous.executorQueueSize | 2.3.0 | SPARK-22789 | 8941a4abcada873c26af924e129173dc33d66d71#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.streaming.continuous.executorPollIntervalMs | 2.3.0 | SPARK-22789 | 8941a4abcada873c26af924e129173dc33d66d71#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.sources.useV1SourceList | 3.0.0 | SPARK-28747 | cb06209fc908bac6ce6a8f20653865489773cbc3#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.streaming.disabledV2Writers | 2.3.1 | SPARK-23196 | 588b9694c1967ff45774431441e84081ee6eb515#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.streaming.disabledV2MicroBatchReaders | 2.4.0 | SPARK-23362 | 0a73aa31f41c83503d5d99eff3c9d7b406014ab3#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.sources.partitionOverwriteMode | 2.3.0 | SPARK-20236 | b96248862589bae1ddcdb14ce4c802789a001306#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.storeAssignmentPolicy | 3.0.0 | SPARK-28730 | 895c90b582cc2b2667241f66d5b733852aeef9eb#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.ansi.enabled | 3.0.0 | SPARK-30125 | d9b30694122f8716d3acb448638ef1e2b96ebc7a#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.execution.sortBeforeRepartition | 2.1.4 | SPARK-23207 and SPARK-22905 and SPARK-24564 and SPARK-25114 | 4d2d3d47e00e78893b1ecd5a9a9070adc5243ac9#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.optimizer.nestedSchemaPruning.enabled | 2.4.1 | SPARK-4502 | dfcff38394929970fee454c69864d0e10d59f8d4#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.optimizer.serializer.nestedSchemaPruning.enabled | 3.0.0 | SPARK-26837 | 0f2c0b53e8fb18c86c67b5dd679c006db93f94a5#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.optimizer.expression.nestedPruning.enabled | 3.0.0 | SPARK-27707 | 127bc899ae78d73332a87f0972b5db3c9936c1f1#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.execution.topKSortFallbackThreshold | 2.4.0 | SPARK-24193 | 8a837bf4f3f2758f7825d2362cf9de209026651a#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.csv.parser.columnPruning.enabled | 2.4.0 | SPARK-24244 and SPARK-24368 | 64fad0b519cf35b8c0a0dec18dd3df9488a5ed25#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.repl.eagerEval.enabled | 2.4.0 | SPARK-24215 | 6a0b77a55d53e74ac0a0892556c3a7a933474948#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.repl.eagerEval.maxNumRows | 2.4.0 | SPARK-24215 | 6a0b77a55d53e74ac0a0892556c3a7a933474948#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.repl.eagerEval.truncate | 2.4.0 | SPARK-24215 | 6a0b77a55d53e74ac0a0892556c3a7a933474948#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.codegen.aggregate.fastHashMap.capacityBit | 2.4.0 | SPARK-24978 | 6193a202aab0271b4532ee4b740318290f2c44a1#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.avro.compression.codec | 2.4.0 | SPARK-24881 | 0a0f68bae6c0a1bf30184b1e9ac6bf3805bd7511#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.avro.deflate.level | 2.4.0 | SPARK-24881 | 0a0f68bae6c0a1bf30184b1e9ac6bf3805bd7511#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.sizeOfNull | 2.4.0 | SPARK-24605 | d08f53dc61f662f5291f71bcbe1a7b9f531a34d2#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.replaceDatabricksSparkAvro.enabled | 2.4.0 | SPARK-25129 | ac0174e55af2e935d41545721e9f430c942b3a0c#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.setopsPrecedence.enabled | 2.4.0 | SPARK-24966 | 73dd6cf9b558f9d752e1f3c13584344257ad7863#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.exponentLiteralAsDecimal.enabled | 3.0.0 | SPARK-29956 | 87ebfaf003fcd05a7f6d23b3ecd4661409ce5f2f#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.allowNegativeScaleOfDecimal | 3.0.0 | SPARK-30812 | b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.createHiveTableByDefault.enabled | 3.0.0 | SPARK-30098 | 58be82ad4b98fc17e821e916e69e77a6aa36209d#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.integralDivide.returnBigint | 3.0.0 | SPARK-25457 | 47d6e80a2e64823fabb596503fb6a6cc6f51f713#diff-9a6b543db706f1a90f790783d6930a13 | Exists in branch-3.0 branch, but the pom.xml file corresponding to the commit log is 2.5.0-SNAPSHOT
spark.sql.legacy.bucketedTableScan.outputOrdering | 3.0.0 | SPARK-28595 | 469423f33887a966aaa33eb75f5e7974a0a97beb#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.parser.havingWithoutGroupByAsWhere | 2.4.1 | SPARK-25708 | 3dba5d41f1a66ae5eb08404d103284110c45a351#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.json.allowEmptyString.enabled | 3.0.0 | SPARK-25040 | d3de7568f32e298442f07b0a28b2c906de72c797#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.createEmptyCollectionUsingStringType | 3.0.0 | SPARK-30790 | 8ab6ae3ede96adb093347470a5cbbf17fe8c04e9#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.allowUntypedScalaUDF | 3.0.0 | SPARK-26580 | bc30a07ce262840c99a752db4fbd3a423f652017#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.truncateTable.ignorePermissionAcl.enabled | 2.4.6 | SPARK-30312 | 830a4ec59b86253f18eb7dfd6ed0bbe0d7920e5b#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.legacy.dataset.nameNonStructGroupingKeyAsValue | 3.0.0 | SPARK-26085 | ab2eafb3cdc7631452650c6cac03a92629255347#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.debug.maxToStringFields | 3.0.0 | SPARK-26066 | 81550b38e43fb20f89f529d2127575c71a54a538#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.maxPlanStringLength | 3.0.0 | SPARK-26103 | 812ad5546148d2194ab0e4230ee85b8f6a5be2fb#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.setCommandRejectsSparkCoreConfs | 3.0.0 | SPARK-26060 | 1ab3d3e474ce2e36d58aea8ad09fb61f0c73e5c5#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.datetime.java8API.enabled | 3.0.0 | SPARK-27008 | 52671d631d2a64ed1cfa0c6e01168908faf92df8#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.sources.binaryFile.maxLength | 3.0.0 | SPARK-27588 | 618d6bff71073c8c93501ab7392c3cc579730f0b#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.typeCoercion.datetimeToString.enabled | 3.0.0 | SPARK-27638 | 83d289eef492de8c7f3e5145f9bd75431608b500#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.defaultCatalog | 3.0.0 | SPARK-29753 | 942753a44beeae5f0142ceefa307e90cbc1234c5#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.catalog.$SESSION_CATALOG_NAME | 3.0.0 | SPARK-29412 | 9407fba0375675d6ee6461253f3b8230e8d67509#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.legacy.doLooseUpcast | 3.0.0 | SPARK-30812 | b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.ctePrecedencePolicy | 3.0.0 | SPARK-30829 | 00943be81afbca6be13e1e72b24536cd98a788d6#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.timeParserPolicy | 3.1.0 | SPARK-30668 | 7db0af578585ecaeee9fd23f8189292289b52a97#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.followThreeValuedLogicInArrayExists | 3.0.0 | SPARK-30812 | b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.maven.additionalRemoteRepositories | 3.0.0 | SPARK-29175 | 3d7359ad4202067b26a199657b6a3e1f38be0e4d#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.legacy.fromDayTimeString.enabled | 3.0.0 | SPARK-29864 and SPARK-29920 | e933539cdd557297daf97ff5e532a3f098896979#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.notReserveProperties | 3.0.0 | SPARK-30812 | b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.addSingleFileInAddFile | 3.0.0 | SPARK-30234 | 8a8d1fbb10af6da481f26831cd519ef46ccbce6c#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.mssqlserver.numericMapping.enabled | 2.4.5 | SPARK-28152 | 69de7f31c37a7e0298e66cc814afc1b0aa948bbb#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.csv.filterPushdown.enabled | 3.0.0 | SPARK-30323 | 4e50f0291f032b4a5c0b46ed01fdef14e4cbb050#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.addPartitionInBatch.size | 3.0.0 | SPARK-29938 | 5ccbb38a71890b114c707279e7395d1f6284ebfd#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.timeParser.enabled | 3.0.0 | SPARK-30668 | 92f57237871400ab9d499e1174af22a867c01988#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.allowDuplicatedMapKeys | 3.0.0 | SPARK-25829 | 33329caa81827a245b84158b13234b88a4746e56#diff-9a6b543db706f1a90f790783d6930a13 |  

### Why are the changes needed?
Supplemental configuration version information.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Exists UT

Closes #27829 from beliefer/add-version-to-sql-config-part-four.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-08 12:33:02 +09:00
iRakson cba17e07e9 [SPARK-30899][SQL] CreateArray/CreateMap's data type should not depend on SQLConf.get
### What changes were proposed in this pull request?
Introduced a new parameter `emptyCollection` for `CreateMap` and `CreateArray` functiion to remove dependency on SQLConf.get.

### Why are the changes needed?
This allows to avoid the issue when the configuration change between different phases of planning, and this can silently break a query plan which can lead to crashes or data corruption.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing UTs.

Closes #27657 from iRakson/SPARK-30899.

Authored-by: iRakson <raksonrakesh@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-06 16:45:06 +08:00
Takeshi Yamamuro 71c73d58f6 [SPARK-30279][SQL] Support 32 or more grouping attributes for GROUPING_ID
### What changes were proposed in this pull request?

This pr intends to support 32 or more grouping attributes for GROUPING_ID. In the current master, an integer overflow can occur to compute grouping IDs;
e75d9afb2f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala (L613)

For example, the query below generates wrong grouping IDs in the master;
```

scala> val numCols = 32 // or, 31
scala> val cols = (0 until numCols).map { i => s"c$i" }
scala> sql(s"create table test_$numCols (${cols.map(c => s"$c int").mkString(",")}, v int) using parquet")
scala> val insertVals = (0 until numCols).map { _ => 1 }.mkString(",")
scala> sql(s"insert into test_$numCols values ($insertVals,3)")
scala> sql(s"select grouping_id(), sum(v) from test_$numCols group by grouping sets ((${cols.mkString(",")}), (${cols.init.mkString(",")}))").show(10, false)
scala> sql(s"drop table test_$numCols")

// numCols = 32
+-------------+------+
|grouping_id()|sum(v)|
+-------------+------+
|0            |3     |
|0            |3     | // Wrong Grouping ID
+-------------+------+

// numCols = 31
+-------------+------+
|grouping_id()|sum(v)|
+-------------+------+
|0            |3     |
|1            |3     |
+-------------+------+
```
To fix this issue, this pr change code to use long values for `GROUPING_ID` instead of int values.
### Why are the changes needed?

To support more cases in `GROUPING_ID`.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added unit tests.

Closes #26918 from maropu/FixGroupingIdIssue.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-03-06 16:57:03 +09:00
Maxim Gekk 59f1e76b82 [SPARK-31020][SPARK-31023][SPARK-31025][SPARK-31044][SQL] Support foldable args by from_csv/json and schema_of_csv/json
### What changes were proposed in this pull request?
In the PR, I propose:

1. To replace matching by `Literal` in `ExprUtils.evalSchemaExpr()` to checking foldable property of the `schema` expression.
2. To replace matching by `Literal` in `ExprUtils.evalTypeExpr()` to checking foldable property of the `schema` expression.
3. To change checking of the input parameter in the `SchemaOfCsv` expression, and allow foldable `child` expression.
4. To change checking of the input parameter in the `SchemaOfJson` expression, and allow foldable `child` expression.

### Why are the changes needed?
This should improve Spark SQL UX for `from_csv`/`from_json`. Currently, Spark expects only literals:
```sql
spark-sql> select from_csv('1,Moscow', replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', ''));
Error in query: Schema should be specified in DDL format as a string literal or output of the schema_of_csv function instead of replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', '');; line 1 pos 7
spark-sql> select from_json('{"id":1, "city":"Moscow"}', replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', ''));
Error in query: Schema should be specified in DDL format as a string literal or output of the schema_of_json function instead of replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', '');; line 1 pos 7
```
and only string literals are acceptable as CSV examples by `schema_of_csv`/`schema_of_json`:
```sql
spark-sql> select schema_of_csv(concat_ws(',', 0.1, 1));
Error in query: cannot resolve 'schema_of_csv(concat_ws(',', CAST(0.1BD AS STRING), CAST(1 AS STRING)))' due to data type mismatch: The input csv should be a string literal and not null; however, got concat_ws(',', CAST(0.1BD AS STRING), CAST(1 AS STRING)).; line 1 pos 7;
'Project [unresolvedalias(schema_of_csv(concat_ws(,, cast(0.1 as string), cast(1 as string))), None)]
+- OneRowRelation
spark-sql> select schema_of_json(regexp_replace('{"item_id": 1, "item_price": 0.1}', 'item_', ''));
Error in query: cannot resolve 'schema_of_json(regexp_replace('{"item_id": 1, "item_price": 0.1}', 'item_', ''))' due to data type mismatch: The input json should be a string literal and not null; however, got regexp_replace('{"item_id": 1, "item_price": 0.1}', 'item_', '').; line 1 pos 7;
'Project [unresolvedalias(schema_of_json(regexp_replace({"item_id": 1, "item_price": 0.1}, item_, )), None)]
+- OneRowRelation
```

### Does this PR introduce any user-facing change?
Yes, after the changes users can pass any foldable string expression as the `schema` parameter to `from_csv()/from_json()`. For the example above:
```sql
spark-sql> select from_csv('1,Moscow', replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', ''));
{"id":1,"city":"Moscow"}
spark-sql> select from_json('{"id":1, "city":"Moscow"}', replace('dpt_org_id INT, dpt_org_city STRING', 'dpt_org_', ''));
{"id":1,"city":"Moscow"}
```
After change the `schema_of_csv`/`schema_of_json` functions accept foldable expressions, for example:
```sql
spark-sql> select schema_of_csv(concat_ws(',', 0.1, 1));
struct<_c0:double,_c1:int>
spark-sql> select schema_of_json(regexp_replace('{"item_id": 1, "item_price": 0.1}', 'item_', ''));
struct<id:bigint,price:double>
```

### How was this patch tested?
Added new test to `CsvFunctionsSuite` and to `JsonFunctionsSuite`.

Closes #27804 from MaxGekk/foldable-arg-csv-json-func.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-06 12:29:35 +08:00
Dongjoon Hyun afb84e9d37 [SPARK-30886][SQL] Deprecate two-parameter TRIM/LTRIM/RTRIM functions
### What changes were proposed in this pull request?

This PR aims to show a deprecation warning on two-parameter TRIM/LTRIM/RTRIM function usages based on the community decision.
- https://lists.apache.org/thread.html/r48b6c2596ab06206b7b7fd4bbafd4099dccd4e2cf9801aaa9034c418%40%3Cdev.spark.apache.org%3E

### Why are the changes needed?

For backward compatibility, SPARK-28093 is reverted. However, from Apache Spark 3.0.0, we should give a safe guideline to use SQL syntax instead of the esoteric function signatures.

### Does this PR introduce any user-facing change?

Yes. This shows a directional warning.

### How was this patch tested?

Pass the Jenkins with a newly added test case.

Closes #27643 from dongjoon-hyun/SPARK-30886.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-03-05 20:09:39 -08:00
maryannxue d705d36c0c [SPARK-31045][SQL] Add config for AQE logging level
### What changes were proposed in this pull request?
This PR adds an internal config for changing the logging level of adaptive execution query plan evolvement.

### Why are the changes needed?
To make AQE debugging easier.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Added UT.

Closes #27798 from maryannxue/aqe-log-level.

Authored-by: maryannxue <maryannxue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-06 11:41:45 +08:00
beliefer d9254b26f1 [SPARK-30841][SQL][DOC][FOLLOW-UP] Add version information to the configuration of SQL
### What changes were proposed in this pull request?
This PR follows https://github.com/apache/spark/pull/27691 and https://github.com/apache/spark/pull/27730
I sorted out some information show below.

Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.sql.execution.useObjectHashAggregateExec | 2.2.0 | SPARK-19944 | 0ee38a39e43dd7ad9d50457e446ae36f64621a1b#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.jsonGenerator.ignoreNullFields | 3.0.0 | SPARK-29444 | 78b0cbe265c4e8cc3d4d8bf5d734f2998c04d376#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.streaming.fileSink.log.deletion | 2.0.0 | SPARK-14678 | 7bc948557bb6169cbeec335f8400af09375a62d3#diff-32bb9518401c0948c5ea19377b5069ab |
spark.sql.streaming.fileSink.log.compactInterval | 2.0.0 | SPARK-14678 | 7bc948557bb6169cbeec335f8400af09375a62d3#diff-32bb9518401c0948c5ea19377b5069ab |
spark.sql.streaming.fileSink.log.cleanupDelay | 2.0.0 | SPARK-14678 | 7bc948557bb6169cbeec335f8400af09375a62d3#diff-32bb9518401c0948c5ea19377b5069ab |
spark.sql.streaming.fileSource.log.deletion | 2.0.1 | SPARK-15698 | 8d8e2332ca12067817de45a8d3812928150975d0#diff-32bb9518401c0948c5ea19377b5069ab |
spark.sql.streaming.fileSource.log.compactInterval | 2.0.1 | SPARK-15698 | 8d8e2332ca12067817de45a8d3812928150975d0#diff-32bb9518401c0948c5ea19377b5069ab |
spark.sql.streaming.fileSource.log.cleanupDelay | 2.0.1 | SPARK-15698 | 8d8e2332ca12067817de45a8d3812928150975d0#diff-32bb9518401c0948c5ea19377b5069ab |
spark.sql.streaming.fileSource.schema.forceNullable | 3.0.0 | SPARK-28651 | 5bb69945e4aaf519cd10a5c5083332f618039af0#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.streaming.fileSource.cleaner.numThreads | 3.0.0 | SPARK-29876 | abf759a91e01497586b8bb6b7a314dd28fd6cff1#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.streaming.schemaInference | 2.0.0 | SPARK-15458 | 1fb7b3a0a2e3a5c5f784aab662df93fcc1449c36#diff-32bb9518401c0948c5ea19377b5069ab |  
spark.sql.streaming.pollingDelay | 2.0.0 | SPARK-16002 | afa14b71b28d788c53816bd2616ccff0c3967f40#diff-32bb9518401c0948c5ea19377b5069ab |  
spark.sql.streaming.stopTimeout | 3.0.0 | SPARK-30143 | 4c37a8a3f4a489b52f1919d2db84f6e32c6a05cd#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.streaming.noDataProgressEventInterval | 2.1.1 | SPARK-19944 | 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.streaming.noDataMicroBatches.enabled | 2.4.1 | SPARK-24157 | 535bf1cc9e6b54df7059ac3109b8cba30057d040#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.streaming.metricsEnabled | 2.0.2 | SPARK-17731 | 881e0eb05782ea74cf92a62954466b14ea9e05b6#diff-32bb9518401c0948c5ea19377b5069ab |
spark.sql.streaming.numRecentProgressUpdates | 2.1.1 | SPARK-19944 | 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.streaming.checkpointFileManagerClass | 2.4.0 | SPARK-23966 | cbb41a0c5b01579c85f06ef42cc0585fbef216c5#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.streaming.checkpoint.escapedPathCheck.enabled | 3.0.0 | SPARK-26824 | 77b99af57330cf2e5016a6acc69642d54041b041#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.statistics.parallelFileListingInStatsComputation.enabled | 2.4.1 | SPARK-24626 | f11f44548903bbab7ab764574d6bed326cf4cd8d#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.defaultSizeInBytes | 1.1.0 | SPARK-2393 | c7db274be79f448fda566208946cb50958ea9b1a#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.statistics.fallBackToHdfs | 2.0.0 | SPARK-15960 | 5c53442cc098dd618ba1430962727c74b2de2e68#diff-32bb9518401c0948c5ea19377b5069ab |  
spark.sql.statistics.ndv.maxError | 2.1.1 | SPARK-19944 | 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.statistics.histogram.enabled | 2.3.0 | SPARK-17074 | 11b60af737a04d931356aa74ebf3c6cf4a6b08d6#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.statistics.histogram.numBins | 2.3.0 | SPARK-17074 | 11b60af737a04d931356aa74ebf3c6cf4a6b08d6#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.statistics.percentile.accuracy | 2.3.0 | SPARK-17074 | 11b60af737a04d931356aa74ebf3c6cf4a6b08d6#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.statistics.size.autoUpdate.enabled | 2.3.0 | SPARK-21127 | d5202259d9aa9ad95d572af253bf4a722b7b437a#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.cbo.enabled | 2.2.0 | SPARK-19944 | 0ee38a39e43dd7ad9d50457e446ae36f64621a1b#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.cbo.planStats.enabled | 3.0.0 | SPARK-24690 | 3f3a18fff116a02ff7996d45a1061f48a2de3102#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.cbo.joinReorder.enabled | 2.2.0 | SPARK-19944 | 0ee38a39e43dd7ad9d50457e446ae36f64621a1b#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.cbo.joinReorder.dp.threshold | 2.2.0 | SPARK-19944 | 0ee38a39e43dd7ad9d50457e446ae36f64621a1b#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.cbo.joinReorder.card.weight | 2.2.0 | SPARK-19915 | c083b6b7dec337d680b54dabeaa40e7a0f69ae69#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.cbo.joinReorder.dp.star.filter | 2.2.0 | SPARK-20233 | fbe4216e1e83d243a7f0521b76bfb20c25278281#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.cbo.starSchemaDetection | 2.2.0 | SPARK-17791 | 81639115947a13017d1637549a8f66ba599b27b8#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.cbo.starJoinFTRatio | 2.2.0 | SPARK-17791 | 81639115947a13017d1637549a8f66ba599b27b8#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.session.timeZone | 2.2.0 | SPARK-19944 | 0ee38a39e43dd7ad9d50457e446ae36f64621a1b#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.windowExec.buffer.in.memory.threshold | 2.2.1 | SPARK-21595 | 406eb1c2ee670c2f14f2737c32c9aa0b8d35bf7c#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.windowExec.buffer.spill.threshold | 2.2.0 | SPARK-13450 | 02c274eaba0a8e7611226e0d4e93d3c36253f4ce#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.sortMergeJoinExec.buffer.in.memory.threshold | 2.2.1 | SPARK-21595 | 406eb1c2ee670c2f14f2737c32c9aa0b8d35bf7c#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.sortMergeJoinExec.buffer.spill.threshold | 2.2.0 | SPARK-13450 | 02c274eaba0a8e7611226e0d4e93d3c36253f4ce#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.cartesianProductExec.buffer.in.memory.threshold | 2.2.1 | SPARK-21595 | 406eb1c2ee670c2f14f2737c32c9aa0b8d35bf7c#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.cartesianProductExec.buffer.spill.threshold | 2.2.0 | SPARK-13450 | 02c274eaba0a8e7611226e0d4e93d3c36253f4ce#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.parser.quotedRegexColumnNames | 2.3.0 | SPARK-12139 | 2cbfc975ba937a4eb761de7a6473b7747941f386#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.execution.rangeExchange.sampleSizePerPartition | 2.3.0 | SPARK-22160 | 323806e68f91f3c7521327186a37ddd1436267d0#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.execution.arrow.enabled | 2.3.0 | SPARK-22159 | d29d1e87995e02cb57ba3026c945c3cd66bb06e2#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.execution.arrow.pyspark.enabled | 3.0.0 | SPARK-27834 | db48da87f02e2e89710ba65fab8b07e9c85b9e74#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.execution.arrow.sparkr.enabled | 3.0.0 | SPARK-27834 | db48da87f02e2e89710ba65fab8b07e9c85b9e74#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.execution.arrow.fallback.enabled | 2.4.0 | SPARK-23380 | d6632d185e147fcbe6724545488ad80dce20277e#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.execution.arrow.pyspark.fallback.enabled | 3.0.0 | SPARK-27834 | db48da87f02e2e89710ba65fab8b07e9c85b9e74#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.execution.arrow.maxRecordsPerBatch | 2.3.0 | SPARK-13534 | d03aebbe6508ba441dc87f9546f27aeb27553d77#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.execution.pandas.udf.buffer.size | 3.1.0 | SPARK-27870 | 692e3ddb4e517638156f7427ade8b62fb37634a7#diff-9a6b543db706f1a90f790783d6930a13 | Exists in master, not branch-3.0
spark.sql.legacy.execution.pandas.groupedMap.assignColumnsByName | 2.4.1 | SPARK-24324 | 3f203050ac764516e68fb43628bba0df5963e44d#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.execution.pandas.convertToArrowArraySafely | 3.0.0 | SPARK-30812 | b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.optimizer.replaceExceptWithFilter | 2.3.0 | SPARK-22181 | 01f6ba0e7a12ef818d56e7d5b1bd889b79f2b57c#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.decimalOperations.allowPrecisionLoss | 2.3.1 | SPARK-22036 | 8a98274823a4671cee85081dd19f40146e736325#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.legacy.literal.pickMinimumPrecision | 2.3.3 | SPARK-25454 | 26d893a4f64de18222942568f7735114447a6ab7#diff-9a6b543db706f1a90f790783d6930a13 |  

### Why are the changes needed?
Supplemental configuration version information.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Exists UT

Closes #27770 from beliefer/add-version-to-sql-config-part-three.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-06 11:19:08 +09:00
HyukjinKwon fc12165f48 [SPARK-31036][SQL] Use stringArgs in Expression.toString to respect hidden parameters
### What changes were proposed in this pull request?

This PR proposes to respect hidden parameters by using `stringArgs`  in `Expression.toString `. By this, we can show the strings properly in some cases such as `NonSQLExpression`.

### Why are the changes needed?

To respect "hidden" arguments in the string representation.

### Does this PR introduce any user-facing change?

Yes, for example, on the top of https://github.com/apache/spark/pull/27657,

```scala
val identify = udf((input: Seq[Int]) => input)
spark.range(10).select(identify(array("id"))).show()
```

shows hidden parameter `useStringTypeWhenEmpty`.

```
+---------------------+
|UDF(array(id, false))|
+---------------------+
|                  [0]|
|                  [1]|
...
```

whereas:

```scala
spark.range(10).select(array("id")).show()
```

```
+---------+
|array(id)|
+---------+
|      [0]|
|      [1]|
...
```

### How was this patch tested?

Manually tested as below:

```scala
val identify = udf((input: Boolean) => input)
spark.range(10).select(identify(exists(array(col("id")), _ % 2 === 0))).show()
```

Before:

```
+-------------------------------------------------------------------------------------+
|UDF(exists(array(id), lambdafunction(((lambda 'x % 2) = 0), lambda 'x, false), true))|
+-------------------------------------------------------------------------------------+
|                                                                                 true|
|                                                                                false|
|                                                                                 true|
...
```

After:

```
+-------------------------------------------------------------------------------+
|UDF(exists(array(id), lambdafunction(((lambda 'x % 2) = 0), lambda 'x, false)))|
+-------------------------------------------------------------------------------+
|                                                                           true|
|                                                                          false|
|                                                                           true|
...
```

Closes #27788 from HyukjinKwon/arguments-str-repr.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-06 10:33:20 +09:00
DB Tsai fe126a6a05 [SPARK-31058][SQL][TEST-HIVE1.2] Consolidate the implementation of quoteIfNeeded
### What changes were proposed in this pull request?
There are two implementation of quoteIfNeeded.  One is in `org.apache.spark.sql.connector.catalog.CatalogV2Implicits.quote` and the other is in `OrcFiltersBase.quoteAttributeNameIfNeeded`. This PR will consolidate them into one.

### Why are the changes needed?
Simplify the codebase.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing UTs.

Closes #27814 from dbtsai/SPARK-31058.

Authored-by: DB Tsai <d_tsai@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2020-03-06 00:13:57 +00:00
Wenchen Fan ba86524b25 [SPARK-31037][SQL] refine AQE config names
### What changes were proposed in this pull request?

When introducing AQE to others, I feel the config names are a bit incoherent and hard to use.
This PR refines the config names:
1. remove the "shuffle" prefix. AQE is all about shuffle and we don't need to add the "shuffle" prefix everywhere.
2. `targetPostShuffleInputSize` is obscure, rename to `advisoryShufflePartitionSizeInBytes`.
3. `reducePostShufflePartitions` doesn't match the actual optimization, rename to `coalesceShufflePartitions`
4. `minNumPostShufflePartitions` is obscure, rename it `minPartitionNum` under the `coalesceShufflePartitions` namespace
5. `maxNumPostShufflePartitions` is confusing with the word "max", rename it `initialPartitionNum`
6. `skewedJoinOptimization` is too verbose. skew join is a well-known terminology in database area, we can just say `skewJoin`

### Why are the changes needed?

Make the config names easy to understand.

### Does this PR introduce any user-facing change?

deprecate the config `spark.sql.adaptive.shuffle.targetPostShuffleInputSize`

### How was this patch tested?

N/A

Closes #27793 from cloud-fan/aqe.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-06 00:46:34 +08:00
Maxim Gekk 1fd9a91c66 [SPARK-31005][SQL] Support time zone ids in casting strings to timestamps
### What changes were proposed in this pull request?
In the PR, I propose to change `DateTimeUtils.stringToTimestamp` to support any valid time zone id at the end of input string. After the changes, the function accepts zone ids in the formats:
- no zone id. In that case, the function uses the local session time zone from the SQL config `spark.sql.session.timeZone`
- -[h]h:[m]m
- +[h]h:[m]m
- Z
- Short zone id, see https://docs.oracle.com/javase/8/docs/api/java/time/ZoneId.html#SHORT_IDS
- Zone ID starts with 'UTC+', 'UTC-', 'GMT+', 'GMT-', 'UT+' or 'UT-'. The ID is split in two, with a two or three letter prefix and a suffix starting with the sign. The suffix must be in the formats:
  - +|-h[h]
  - +|-hh[:]mm
  - +|-hh:mm:ss
  - +|-hhmmss
- Region-based zone IDs in the form `{area}/{city}`, such as `Europe/Paris` or `America/New_York`. The default set of region ids is supplied by the IANA Time Zone Database (TZDB).

### Why are the changes needed?
- To use `stringToTimestamp` as a substitution of removed `stringToTime`, see https://github.com/apache/spark/pull/27710#discussion_r385020173
- Improve UX of Spark SQL by allowing flexible formats of zone ids. Currently, Spark accepts only `Z` and zone offsets that can be inconvenient when a time zone offset is shifted due to daylight saving rules. For instance:
```sql
spark-sql> select cast('2015-03-18T12:03:17.123456 Europe/Moscow' as timestamp);
NULL
```

### Does this PR introduce any user-facing change?
Yes. After the changes, casting strings to timestamps allows time zone id at the end of the strings:
```sql
spark-sql> select cast('2015-03-18T12:03:17.123456 Europe/Moscow' as timestamp);
2015-03-18 12:03:17.123456
```

### How was this patch tested?
- Added new test cases to the `string to timestamp` test in `DateTimeUtilsSuite`.
- Run `CastSuite` and `AnsiCastSuite`.

Closes #27753 from MaxGekk/stringToTimestamp-uni-zoneId.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-05 20:49:43 +08:00
Wenchen Fan 807ea413b4 [SPARK-31019][SQL] make it clear that people can deduplicate map keys
### What changes were proposed in this pull request?

rename the config and make it non-internal.

### Why are the changes needed?

Now we fail the query if duplicated map keys are detected, and provide a legacy config to deduplicate it. However, we must provide a way to get users out of this situation, instead of just rejecting to run the query. This exit strategy should always be there, while legacy config indicates that it may be removed someday.

### Does this PR introduce any user-facing change?

no, just rename a config which was added in 3.0

### How was this patch tested?

add more tests for the fail behavior.

Closes #27772 from cloud-fan/map.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-05 20:43:52 +09:00
Kent Yao f45ae7f2c5 [SPARK-31038][SQL] Add checkValue for spark.sql.session.timeZone
### What changes were proposed in this pull request?

The `spark.sql.session.timeZone` config can accept any string value including invalid time zone ids, then it will fail other queries that rely on the time zone. We should do the value checking in the set phase and fail fast if the zone value is invalid.

### Why are the changes needed?

improve configuration
### Does this PR introduce any user-facing change?

yes, will fail fast if the value is a wrong timezone id
### How was this patch tested?

add ut

Closes #27792 from yaooqinn/SPARK-31038.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-05 19:38:20 +08:00
Terry Kim 66b4fd040e [SPARK-31024][SQL] Allow specifying session catalog name spark_catalog in qualified column names for v1 tables
### What changes were proposed in this pull request?

Currently, the user cannot specify the session catalog name (`spark_catalog`) in qualified column names for v1 tables:
```
SELECT spark_catalog.default.t.i FROM spark_catalog.default.t
```
fails with `cannot resolve 'spark_catalog.default.t.i`.

This is inconsistent with v2 table behavior where catalog name can be used:
```
SELECT testcat.ns1.tbl.id FROM testcat.ns1.tbl.id
```

This PR proposes to fix the inconsistency and allow the user to specify session catalog name in column names for v1 tables.

### Why are the changes needed?

Fixing an inconsistent behavior.

### Does this PR introduce any user-facing change?

Yes, now the following query works:
```
SELECT spark_catalog.default.t.i FROM spark_catalog.default.t
```

### How was this patch tested?

Added new tests.

Closes #27776 from imback82/spark_catalog_col_name_resolution.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-05 18:33:59 +08:00
Yuanjian Li 7db0af5785 [SPARK-30668][SQL][FOLLOWUP] Raise exception instead of silent change for new DateFormatter
### What changes were proposed in this pull request?
This is a follow-up work for #27441. For the cases of new TimestampFormatter return null while legacy formatter can return a value, we need to throw an exception instead of silent change. The legacy config will be referenced in the error message.

### Why are the changes needed?
Avoid silent result change for new behavior in 3.0.

### Does this PR introduce any user-facing change?
Yes, an exception is thrown when we detect legacy formatter can parse the string and the new formatter return null.

### How was this patch tested?
Extend existing UT.

Closes #27537 from xuanyuanking/SPARK-30668-follow.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-05 15:29:39 +08:00
Terry Kim b30278107f [SPARK-30885][SQL][FOLLOW-UP] Fix issues where some V1 commands allow tables that are not fully qualified
### What changes were proposed in this pull request?

There are few V1 commands such as `REFRESH TABLE` that still allow `spark_catalog.t` because they run the commands with parsed table names without trying to load them in the catalog. This PR addresses this issue.

The PR also addresses the issue brought up in https://github.com/apache/spark/pull/27642#discussion_r382402104.

### Why are the changes needed?

To fix a bug where for some V1 commands, `spark_catalog.t` is allowed.

### Does this PR introduce any user-facing change?

Yes, a bug is fixed and `REFRESH TABLE spark_catalog.t` is not allowed.

### How was this patch tested?

Added new test.

Closes #27718 from imback82/fix_TempViewOrV1Table.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-04 18:09:48 +08:00
Wenchen Fan e4c61e35da [SPARK-30960][SQL] add back the legacy date/timestamp format support in CSV/JSON parser
### What changes were proposed in this pull request?

Before Spark 3.0, the JSON/CSV parser has a special behavior that, when the parser fails to parse a timestamp/date, fallback to another way to parse it, to support some legacy format. The fallback was removed by https://issues.apache.org/jira/browse/SPARK-26178 and https://issues.apache.org/jira/browse/SPARK-26243.

This PR adds back this legacy fallback. Since we switch the API to do datetime operations, we can't be exactly the same as before. Here we add back the support of the legacy formats that are common (examples of Spark 2.4):
1. the fields can have one or two letters
```
scala> sql("""select from_json('{"time":"1123-2-22 2:22:22"}', 'time Timestamp')""").show(false)
+-------------------------------------------+
|jsontostructs({"time":"1123-2-22 2:22:22"})|
+-------------------------------------------+
|[1123-02-22 02:22:22]                      |
+-------------------------------------------+
```
2. the separator between data and time can be "T" as well
```
scala> sql("""select from_json('{"time":"2000-12-12T12:12:12"}', 'time Timestamp')""").show(false)
+---------------------------------------------+
|jsontostructs({"time":"2000-12-12T12:12:12"})|
+---------------------------------------------+
|[2000-12-12 12:12:12]                        |
+---------------------------------------------+
```
3. the second fraction can be arbitrary length
```
scala> sql("""select from_json('{"time":"1123-02-22T02:22:22.123456789123"}', 'time Timestamp')""").show(false)
+----------------------------------------------------------+
|jsontostructs({"time":"1123-02-22T02:22:22.123456789123"})|
+----------------------------------------------------------+
|[1123-02-15 02:22:22.123]                                 |
+----------------------------------------------------------+
```
4. date string can end up with any chars after "T" or space
```
scala> sql("""select from_json('{"time":"1123-02-22Tabc"}', 'time date')""").show(false)
+----------------------------------------+
|jsontostructs({"time":"1123-02-22Tabc"})|
+----------------------------------------+
|[1123-02-22]                            |
+----------------------------------------+
```
5. remove "GMT" from the string before parsing
```
scala> sql("""select from_json('{"time":"GMT1123-2-22 2:22:22.123"}', 'time Timestamp')""").show(false)
+--------------------------------------------------+
|jsontostructs({"time":"GMT1123-2-22 2:22:22.123"})|
+--------------------------------------------------+
|[1123-02-22 02:22:22.123]                         |
+--------------------------------------------------+
```
### Why are the changes needed?

It doesn't hurt to keep this legacy support. It just makes the parsing more relaxed.

### Does this PR introduce any user-facing change?

yes, to make 3.0 support parsing most of the csv/json values that were supported before.

### How was this patch tested?

new tests

Closes #27710 from cloud-fan/bug2.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-04 18:27:44 +09:00
Takeshi Yamamuro 4a1d273a4a [SPARK-30997][SQL] Fix an analysis failure in generators with aggregate functions
### What changes were proposed in this pull request?

We have supported generators in SQL aggregate expressions by SPARK-28782.
But, the generator(explode) query with aggregate functions in DataFrame failed as follows;

```
// SPARK-28782: Generator support in aggregate expressions
scala> spark.range(3).toDF("id").createOrReplaceTempView("t")
scala> sql("select explode(array(min(id), max(id))) from t").show()
+---+
|col|
+---+
|  0|
|  2|
+---+

// A failure case handled in this pr
scala> spark.range(3).select(explode(array(min($"id"), max($"id")))).show()
org.apache.spark.sql.AnalysisException:
The query operator `Generate` contains one or more unsupported
expression types Aggregate, Window or Generate.
Invalid expressions: [min(`id`), max(`id`)];;
Project [col#46L]
+- Generate explode(array(min(id#42L), max(id#42L))), false, [col#46L]
   +- Range (0, 3, step=1, splits=Some(4))

  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:49)
  at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:48)
  at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:129)
```

The root cause is that `ExtractGenerator` wrongly replaces a project w/ aggregate functions
before `GlobalAggregates` replaces it with an aggregate as follows;

```
scala> sql("SET spark.sql.optimizer.planChangeLog.level=warn")
scala> spark.range(3).select(explode(array(min($"id"), max($"id")))).show()

20/03/01 12:51:58 WARN HiveSessionStateBuilder$$anon$1:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences ===
!'Project [explode(array(min('id), max('id))) AS List()]   'Project [explode(array(min(id#72L), max(id#72L))) AS List()]
 +- Range (0, 3, step=1, splits=Some(4))                   +- Range (0, 3, step=1, splits=Some(4))

20/03/01 12:51:58 WARN HiveSessionStateBuilder$$anon$1:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator ===
!'Project [explode(array(min(id#72L), max(id#72L))) AS List()]   Project [col#76L]
!+- Range (0, 3, step=1, splits=Some(4))                         +- Generate explode(array(min(id#72L), max(id#72L))), false, [col#76L]
!                                                                   +- Range (0, 3, step=1, splits=Some(4))

20/03/01 12:51:58 WARN HiveSessionStateBuilder$$anon$1:
=== Result of Batch Resolution ===
!'Project [explode(array(min('id), max('id))) AS List()]   Project [col#76L]
!+- Range (0, 3, step=1, splits=Some(4))                   +- Generate explode(array(min(id#72L), max(id#72L))), false, [col#76L]
!                                                             +- Range (0, 3, step=1, splits=Some(4))

// the analysis failed here...
```

To avoid the case in `ExtractGenerator`, this pr addes a condition to ignore generators having aggregate functions.
A correct sequence of rules is as follows;

```
20/03/01 13:19:06 WARN HiveSessionStateBuilder$$anon$1:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences ===
!'Project [explode(array(min('id), max('id))) AS List()]   'Project [explode(array(min(id#27L), max(id#27L))) AS List()]
 +- Range (0, 3, step=1, splits=Some(4))                   +- Range (0, 3, step=1, splits=Some(4))

20/03/01 13:19:06 WARN HiveSessionStateBuilder$$anon$1:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$GlobalAggregates ===
!'Project [explode(array(min(id#27L), max(id#27L))) AS List()]   'Aggregate [explode(array(min(id#27L), max(id#27L))) AS List()]
 +- Range (0, 3, step=1, splits=Some(4))                         +- Range (0, 3, step=1, splits=Some(4))

20/03/01 13:19:06 WARN HiveSessionStateBuilder$$anon$1:
=== Applying Rule org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator ===
!'Aggregate [explode(array(min(id#27L), max(id#27L))) AS List()]   'Project [explode(_gen_input_0#31) AS List()]
!+- Range (0, 3, step=1, splits=Some(4))                           +- Aggregate [array(min(id#27L), max(id#27L)) AS _gen_input_0#31]
!                                                                     +- Range (0, 3, step=1, splits=Some(4))

```

### Why are the changes needed?

A bug fix.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added tests.

Closes #27749 from maropu/ExplodeInAggregate.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-03-03 12:25:12 -08:00
Terry Kim c263c15408 [SPARK-31015][SQL] Star(*) expression fails when used with qualified column names for v2 tables
### What changes were proposed in this pull request?

For a v2 table created with `CREATE TABLE testcat.ns1.ns2.tbl (id bigint, name string) USING foo`, the following works as expected
```
SELECT testcat.ns1.ns2.tbl.id FROM testcat.ns1.ns2.tbl
```
, but a query with qualified column name with star(*)
```
SELECT testcat.ns1.ns2.tbl.* FROM testcat.ns1.ns2.tbl
[info]   org.apache.spark.sql.AnalysisException: cannot resolve 'testcat.ns1.ns2.tbl.*' given input columns 'id, name';
```
fails to resolve. And this PR proposes to fix this issue.

### Why are the changes needed?

To fix a bug as describe above.

### Does this PR introduce any user-facing change?

Yes, now `SELECT testcat.ns1.ns2.tbl.* FROM testcat.ns1.ns2.tbl` works as expected.

### How was this patch tested?

Added new test.

Closes #27766 from imback82/fix_star_expression.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-04 00:55:26 +08:00
Takeshi Yamamuro 313e62c376 [SPARK-30998][SQL] ClassCastException when a generator having nested inner generators
### What changes were proposed in this pull request?

A query below failed in the master;

```
scala> sql("select array(array(1, 2), array(3)) ar").select(explode(explode($"ar"))).show()
20/03/01 13:51:56 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)/ 1]
java.lang.ClassCastException: scala.collection.mutable.ArrayOps$ofRef cannot be cast to org.apache.spark.sql.catalyst.util.ArrayData
	at org.apache.spark.sql.catalyst.expressions.ExplodeBase.eval(generators.scala:313)
	at org.apache.spark.sql.execution.GenerateExec.$anonfun$doExecute$8(GenerateExec.scala:108)
	at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484)
	at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490)
	at scala.collection.Iterator$ConcatIterator.hasNext(Iterator.scala:222)
	at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    ...
```

This pr modified the `hasNestedGenerator` code in `ExtractGenerator` for correctly catching nested inner generators.

### Why are the changes needed?

A bug fix.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added tests.

Closes #27750 from maropu/HandleNestedGenerators.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-03-03 19:00:33 +09:00
Josh Rosen f0010c81e2 [SPARK-31003][TESTS] Fix incorrect uses of assume() in tests
### What changes were proposed in this pull request?

This patch fixes several incorrect uses of `assume()` in our tests.

If a call to `assume(condition)` fails then it will cause the test to be marked as skipped instead of failed: this feature allows test cases to be skipped if certain prerequisites are missing. For example, we use this to skip certain tests when running on Windows (or when Python dependencies are unavailable).

In contrast, `assert(condition)` will fail the test if the condition doesn't hold.

If `assume()` is accidentally substituted for `assert()`then the resulting test will be marked as skipped in cases where it should have failed, undermining the purpose of the test.

This patch fixes several such cases, replacing certain `assume()` calls with `assert()`.

Credit to ahirreddy for spotting this problem.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #27754 from JoshRosen/fix-assume-vs-assert.

Lead-authored-by: Josh Rosen <rosenville@gmail.com>
Co-authored-by: Josh Rosen <joshrosen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-03-02 15:20:45 -08:00
Jungtaek Lim (HeartSaVioR) f24a46011c [SPARK-30993][SQL] Use its sql type for UDT when checking the type of length (fixed/var) or mutable
### What changes were proposed in this pull request?

This patch fixes the bug of UnsafeRow which misses to handle the UDT specifically, in `isFixedLength` and `isMutable`. These methods don't check its SQL type for UDT, always treating UDT as variable-length, and non-mutable.

It doesn't bring any issue if UDT is used to represent complicated type, but when UDT is used to represent some type which is matched with fixed length of SQL type, it exposes the chance of correctness issues, as these informations sometimes decide how the value should be handled.

We got report from user mailing list which suspected as mapGroupsWithState looks like handling UDT incorrectly, but after some investigation it was from GenerateUnsafeRowJoiner in shuffle phase.

0e2ca11d80/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeRowJoiner.scala (L32-L43)

Here updating position should not happen on fixed-length column, but due to this bug, the value of UDT having fixed-length as sql type would be modified, which actually corrupts the value.

### Why are the changes needed?

Misclassifying of the type of length for UDT can corrupt the value when the row is presented to the input of GenerateUnsafeRowJoiner, which brings correctness issue.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

New UT added.

Closes #27747 from HeartSaVioR/SPARK-30993.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-03-02 22:33:11 +08:00
Jiaan Geng a429ac83e4 [SPARK-30841][SQL][DOC][FOLLOW-UP] Add version information to the configuration of SQL
### What changes were proposed in this pull request?
This PR follows https://github.com/apache/spark/pull/27691
I sorted out some information show below.

Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.sql.orc.compression.codec | 2.3.0 | SPARK-21839 | d8f45408635d4fccac557cb1e877dfe9267fb326#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.orc.impl | 2.3.0 | SPARK-20728 | 326f1d6728a7734c228d8bfaa69442a1c7b92e9b#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.orc.enableVectorizedReader | 2.3.0 | SPARK-16060 | 60f6b994505e3f82091a04eed2dc0a9e8bd523ce#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.orc.columnarReaderBatchSize | 2.4.0 | SPARK-23188 | cc41245fa3f954f961541bf4b4275c28473042b8#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.orc.filterPushdown | 1.4.0 | SPARK-2883 | 65d71bd9fbfe6fe1b741c80fed72d6ae3d22b028#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.orc.mergeSchema | 3.0.0 | SPARK-11412 | 73183b3c8c2022846587f08e8dea5c387ed3b8d5#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.hive.verifyPartitionPath | 1.4.0 | Spark-5068 | 1f39a61118184e136f38381a9f3ba0b2d5d589d9#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.hive.metastorePartitionPruning | 1.5.0 | SPARK-9386 | ce89ff477aea6def68265ed218f6105680755c9a#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.hive.manageFilesourcePartitions | 2.1.1 | SPARK-19944 | 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.hive.filesourcePartitionFileCacheSize | 2.1.1 | SPARK-19944 | 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.hive.caseSensitiveInferenceMode | 2.1.1 | SPARK-19944 | 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.optimizer.metadataOnly | 2.1.1 | SPARK-19944 | 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.columnNameOfCorruptRecord | 1.2.0 | SPARK-3339 | 1c7f0ab302de9f82b1bd6da852d133823bc67c66#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.broadcastTimeout | 1.3.0 | SPARK-4269 | fa66ef6c97e87c9255b67b03836a4ba50598ebae#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.thriftserver.scheduler.pool | 1.1.1 | SPARK-3025 | 496f62d9a98067256d8a51fd1e7a485ff6492fa8#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.thriftServer.incrementalCollect | 2.0.3 | SPARK-18857 | c94288b57b5ce2232e58e35cada558d8d5b8ec6e#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.thriftserver.ui.retainedStatements | 1.4.0 | SPARK-5100 | 343d3bfafd449a0371feb6a88f78e07302fa7143#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.thriftserver.ui.retainedSessions | 1.4.0 | SPARK-5100 | 343d3bfafd449a0371feb6a88f78e07302fa7143#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.sources.default | 1.3.0 | SPARK-5658 | a21090ebe1ef7a709709300712de7d928a923244#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.hive.convertCTAS | 2.0.0 | SPARK-15646 | 5a835b99f9852b0c2a35f9c75a51d493474994ea#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.hive.gatherFastStats | 2.0.1 | SPARK-17063 | 3d283f6c9d9daef53fa4e90b0ead2a94710a37a7#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.sources.partitionColumnTypeInference.enabled | 1.5.0 | SPARK-7939 | 03ef6be9ce61a13dcd9d8c71298fb4be39119411#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.sources.bucketing.enabled | 2.0.0 | SPARK-13486 | 2b2c8c33236677c916541f956f7b94bba014a9ce#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.sources.bucketing.maxBuckets | 2.4.0 | SPARK-23997 | de46df549acee7fda56bb0871f444d2f3b49e582#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.crossJoin.enabled | 2.0.0 | SPARK-15425 | 4462da7071462084c5b55cc414c7faa0e1396a18#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.orderByOrdinal | 2.0.0 | SPARK-12789 | 2c5b18fb0fdeabd378dd97e91f72d1eac4e21cc7#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.groupByOrdinal | 2.0.0 | SPARK-13957 | 05f652d6c2bbd764a1dd5a45301811e14519486f#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.groupByAliases | 2.2.0 | SPARK-14471 | af3a1411a28796d4d9a100eefb093b1d91532754#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.sources.outputCommitterClass | 1.4.0 | SPARK-7567 | a385f4b8dd22e0e056569cffc4fa63047cb7c8f2#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.sources.commitProtocolClass | 2.1.1 | SPARK-19944 | 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.sources.parallelPartitionDiscovery.threshold | 1.5.0 | SPARK-8125 | a1064df0ee3daf496800be84293345a10e1497d9#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.sources.parallelPartitionDiscovery.parallelism | 2.1.1 | SPARK-19944 | 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.sources.ignoreDataLocality | 3.0.0 | SPARK-30812 | b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.selfJoinAutoResolveAmbiguity | 1.4.0 | SPARK-6231 | e61083ccab7764d1929248490a3d2e83987241e0#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.analyzer.failAmbiguousSelfJoin | 3.0.0 | SPARK-30812 | b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.retainGroupColumns | 1.4.0 | SPARK-7462 | 9c35f02b35fda80d6558573466735e79b3dd9124#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.pivotMaxValues | 1.6.0 | SPARK-8992 | 5940fc71d2a245cc6e50edb455c3dd3dbb8de43a#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.runSQLOnFiles | 1.6.0 | SPARK-11197 | f8c6bec65784de89b47e96a367d3f9790c1b3115#diff-41ef65b9ef5b518f77e2a03559893f4d
spark.sql.codegen.wholeStage | 2.0.0 | SPARK-13486 | 2b2c8c33236677c916541f956f7b94bba014a9ce#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.codegen.useIdInClassName | 2.3.1 | SPARK-23032 | 26a8b4e398ee6d1de06a5f3ac1d6d342c9b67d78#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.codegen.maxFields | 2.0.0 | SPARK-14224 and SPARK-14223 and SPARK-14310 | 5a4b11a901703464b9261dea0642d80cf8d4856c#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.codegen.factoryMode | 2.4.0 | SPARK-23711 | a40ffc656d62372da85e0fa932b67207839e7fde#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.codegen.fallback | 2.0.0 | SPARK-15759 | f0fa0a8946fb4bdf0f4697a8e389f49e98422871#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.codegen.logging.maxLines | 2.3.0 | SPARK-20871 | 2a53fbfce72b3faef020e39a1e8628d68bc95beb#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.codegen.hugeMethodLimit | 2.3.0 | SPARK-21871 | 4a779bdac3e75c17b7d36c5a009ba6c948fa9fb6#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.codegen.methodSplitThreshold | 3.0.0 | SPARK-25850 | e017cb39642a5039abd8ce8127ad41712901bdbc#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.codegen.splitConsumeFuncByOperator | 2.3.1 | SPARK-21717 | c79e771f8952e6773c3a84cc617145216feddbcf#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.files.maxPartitionBytes | 2.0.0 | SPARK-13664 | 17eec0a71ba8713c559d641e3f43a1be726b037c#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.files.openCostInBytes | 2.0.0 | SPARK-14259 | 400b2f863ffaa01a34a8dae1541c61526fef908b#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.files.ignoreCorruptFiles | 2.1.1 | SPARK-19944 | 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.files.ignoreMissingFiles | 2.3.0 | SPARK-22366 | 8e9863531bebbd4d83eafcbc2b359b8bd0ac5734#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.files.maxRecordsPerFile | 2.2.0 | SPARK-19944 | 0ee38a39e43dd7ad9d50457e446ae36f64621a1b#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.exchange.reuse | 2.0.0 | SPARK-13523 | 3dc9ae2e158e5b51df6f799767946fe1d190156b#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.execution.reuseSubquery | 3.0.0 | SPARK-30812 | b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.streaming.stateStore.providerClass | 2.3.0 | SPARK-20883 and SPARK-20376 | fa757ee1d41396ad8734a3f2dd045bb09bc82a2e#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.streaming.stateStore.minDeltasForSnapshot | 2.0.0 | SPARK-13809 | 8c826880f5eaa3221c4e9e7d3fece54e821a0b98#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.streaming.flatMapGroupsWithState.stateFormatVersion | 2.4.0 | SPARK-22187 | b3d88ac02940eff4c867d3acb79fe5ff9d724e83#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.streaming.checkpointLocation | 2.0.0 | SPARK-13985 | caea15214571d9b12dcf1553e5c1cc8b83a8ba5b#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.streaming.forceDeleteTempCheckpointLocation | 3.0.0 | SPARK-30812 | b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.streaming.minBatchesToRetain | 2.1.1 | SPARK-19944 | 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.streaming.maxBatchesToRetainInMemory | 2.4.0 | SPARK-24717 | 8b7d4f842fdc90b8d1c37080bdd9b5e1d070f5c0#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.streaming.aggregation.stateFormatVersion | 2.4.0 | SPARK-24763 | 6c5cb85856235efd464b109558896f81ae2c4c75#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.streaming.stopActiveRunOnRestart | 3.0.0 | SPARK-29568 | 363af16c72abe19fc5cc5b5bdf9d8dc34975f2ba#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.streaming.join.stateFormatVersion | 3.0.0 | SPARK-26154 | c941362cb94b24bdf48d4928a1a4dff1b13a1484#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.streaming.unsupportedOperationCheck | 2.0.0 | SPARK-14473 | 775cf17eaaae1a38efe47b282b1d6bbdb99bd759#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.variable.substitute | 2.0.0 | SPARK-14769 | 334c293ec0bcc2195d502c574ca40dbc4769d666#diff-32bb9518401c0948c5ea19377b5069ab
spark.sql.codegen.aggregate.map.twolevel.enabled | 2.3.0 | SPARK-22159 | d29d1e87995e02cb57ba3026c945c3cd66bb06e2#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.codegen.aggregate.map.vectorized.enable | 3.0.0 | SPARK-28257 | 42b80ae128ab1aa8a87c1376fe88e2cde52e6e4f#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.codegen.aggregate.splitAggregateFunc.enabled | 3.0.0 | SPARK-21870 | cb0cddffe9452937033e0e6b1fc0e600d2c787ad#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.view.maxNestedViewDepth | 2.2.0 | SPARK-19877 | ee36bc1c9043ead3c3ba4fba7e68c6c47ad7ae7a#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.streaming.commitProtocolClass | 2.1.0 | SPARK-19944 | 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13
spark.sql.streaming.multipleWatermarkPolicy | 2.4.0 | SPARK-24730 | 6078b891da8fe7fc36579699473168ae7443284c#diff-9a6b543db706f1a90f790783d6930a13

### Why are the changes needed?
Supplemental configuration version information.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Exists UT

Closes #27730 from beliefer/add-version-to-sql-config-part-two.

Lead-authored-by: Jiaan Geng <beliefer@163.com>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-02 15:13:58 +09:00
iRakson 92a5ae2ae4 [SPARK-30234][SQL][FOLLOWUP] Rename spark.sql.legacy.addDirectory.recursive.enabled to spark.sql.legacy.addSingleFileInAddFile
### What changes were proposed in this pull request?
Rename `spark.sql.legacy.addDirectory.recursive.enabled` to `spark.sql.legacy.addSingleFileInAddFile`

### Why are the changes needed?
To follow the naming convention

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing UTs.

Closes #27725 from iRakson/SPARK-30234_CONFIG.

Authored-by: iRakson <raksonrakesh@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-03-01 10:55:41 +09:00
iRakson a40a2f8338 [SPARK-27619][SQL][FOLLOWUP] Rename 'spark.sql.legacy.useHashOnMapType' to 'spark.sql.legacy.allowHashOnMapType'
### What changes were proposed in this pull request?
Renamed configuration from `spark.sql.legacy.useHashOnMapType` to `spark.sql.legacy.allowHashOnMapType`.

### Why are the changes needed?
Better readability of configuration.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing UTs.

Closes #27719 from iRakson/SPARK-27619_FOLLOWUP.

Authored-by: iRakson <raksonrakesh@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-28 22:57:50 +08:00
Wenchen Fan f21894e5fa [SPARK-30902][SQL] Default table provider should be decided by catalog implementations
### What changes were proposed in this pull request?

When `CREATE TABLE` SQL statement does not specify the provider, leave it to the catalog implementations to decide.

### Why are the changes needed?

It's super weird if we set the default provider to parquet when creating a table in a JDBC catalog.

### Does this PR introduce any user-facing change?

Yes, v2 catalog will not see a "provider" property in table properties if it's not specified in `CREATE TABLE` SQL statement. V2 catalog is new in 3.0.

### How was this patch tested?

new tests

Closes #27650 from cloud-fan/create_table.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-28 15:14:23 +09:00
Liang-Chi Hsieh ba032acf95 [SPARK-30955][SQL] Exclude Generate output when aliasing in nested column pruning
### What changes were proposed in this pull request?

When aliasing in nested column pruning in Project on top of Generate, we should exclude Generate outputs.

### Why are the changes needed?

Right now we would prune nested columns in Project on top of Generate. It is possible that referred nested columns are from Generate's outputs, not from its child. To address that case, we should exclude Generate outputs when aliasing in nested column pruning.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #27702 from viirya/fix-nested-pruning.

Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-28 12:29:46 +09:00
Kent Yao 2d2706cb86 [SPARK-30956][SQL][TESTS] Use intercept instead of try-catch to assert failures in IntervalUtilsSuite
### What changes were proposed in this pull request?

In this PR, I addressed the comment from https://github.com/apache/spark/pull/27672#discussion_r383719562 to use `intercept` instead of `try-catch` block to assert  failures in the IntervalUtilsSuite

### Why are the changes needed?

improve tests
### Does this PR introduce any user-facing change?

no

### How was this patch tested?

Nah

Closes #27700 from yaooqinn/intervaltest.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-02-27 23:12:35 +09:00
beliefer 1515d45b8d [SPARK-27924][SQL][FOLLOW-UP] Improve ANSI SQL Boolean-Predicate
### What changes were proposed in this pull request?
This PR follows https://github.com/apache/spark/pull/25074 and improves the implement.

### Why are the changes needed?
Improve code.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Exists UT

Closes #27699 from beliefer/improve-boolean-test.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-27 13:42:02 +08:00
beliefer 825d3dc11b [SPARK-30841][SQL][DOC] Add version information to the configuration of SQL
### What changes were proposed in this pull request?
Add version information to the configuration of Spark SQL.
Note: Because SQLConf has a lot of configuration items, I split the items into two PR. Another PR will follows this PR.

I sorted out some information show below.

Item name | Since version | JIRA ID | Commit ID | Note
-- | -- | -- | -- | --
spark.sql.analyzer.maxIterations | 3.0.0 | SPARK-30138 | c2f29d5ea58eb4565cc5602937d6d0bb75558513#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.optimizer.excludedRules | 2.4.0 | SPARK-24802 | 434319e73f8cb6e080671bdde42a72228bd814ef#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.optimizer.maxIterations | 2.0.0 | SPARK-14677 | f4be0946af219379fb2476e6f80b2e50463adeb2#diff-32bb9518401c0948c5ea19377b5069ab |  
spark.sql.optimizer.inSetConversionThreshold | 2.0.0 | SPARK-14796 | 3647120a5a879edf3a96a5fd68fb7aa849ad57ef#diff-32bb9518401c0948c5ea19377b5069ab |  
spark.sql.optimizer.inSetSwitchThreshold | 3.0.0 | SPARK-26205 | 0c23a39384b7ae5fb4aeb4f7f6fe72007b84bbd2#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.optimizer.planChangeLog.level | 3.0.0 | SPARK-25415 | 8b702e1e0aba1d3e4b0aa582f20cf99f80a44a09#diff-9a6b543db706f1a90f790783d6930a13 | This configuration does not exist in branch-2.4 branch, but from the branch-3.0 git log, it is found that the version number of the pom.xml file is 2.4.0-SNAPSHOT
spark.sql.optimizer.planChangeLog.rules | 3.0.0 | SPARK-25415 | 8b702e1e0aba1d3e4b0aa582f20cf99f80a44a09#diff-9a6b543db706f1a90f790783d6930a13 | This configuration does not exist in branch-2.4 branch, but from the branch-3.0 git log, it is found that the version number of the pom.xml file is 2.4.0-SNAPSHOT
spark.sql.optimizer.planChangeLog.batches | 3.0.0 | SPARK-27088 | 074533334d01afdd7862a1ac6c5a7a672bcce3f8#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.optimizer.dynamicPartitionPruning.enabled | 3.0.0 | SPARK-11150 | a7a3935c97d1fe6060cae42bbc9229c087b648ab#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.optimizer.dynamicPartitionPruning.useStats | 3.0.0 | SPARK-11150 | a7a3935c97d1fe6060cae42bbc9229c087b648ab#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.optimizer.dynamicPartitionPruning.fallbackFilterRatio | 3.0.0 | SPARK-11150 | a7a3935c97d1fe6060cae42bbc9229c087b648ab#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly | 3.0.0 | SPARK-30528 | 59a13c9b7bc3b3aa5b5bc30a60344f849c0f8012#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.inMemoryColumnarStorage.compressed | 1.0.1 | SPARK-2631 | 86534d0f5255362618c05a07b0171ec35c915822#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.inMemoryColumnarStorage.batchSize | 1.1.1 | SPARK-2650 | 779d1eb26d0f031791e93c908d51a59c3b422a55#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.inMemoryColumnarStorage.partitionPruning | 1.2.0 | SPARK-2961 | 248067adbe90f93c7d5e23aa61b3072dfdf48a8a#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.inMemoryTableScanStatistics.enable | 3.0.0 | SPARK-28257 | 42b80ae128ab1aa8a87c1376fe88e2cde52e6e4f#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.inMemoryColumnarStorage.enableVectorizedReader | 2.3.1 | SPARK-23312 | e5e9f9a430c827669ecfe9d5c13cc555fc89c980#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.columnVector.offheap.enabled | 2.3.0 | SPARK-20101 | 572af5027e45ca96e0d283a8bf7c84dcf476f9bc#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.join.preferSortMergeJoin | 2.0.0 | SPARK-13977 | 9c23c818ca0175c8f2a4a66eac261ec251d27c97#diff-32bb9518401c0948c5ea19377b5069ab |  
spark.sql.sort.enableRadixSort | 2.0.0 | SPARK-14724 | e2b5647ab92eb478b3f7b36a0ce6faf83e24c0e5#diff-32bb9518401c0948c5ea19377b5069ab |
spark.sql.autoBroadcastJoinThreshold | 1.1.0 | SPARK-2393 | c7db274be79f448fda566208946cb50958ea9b1a#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.limit.scaleUpFactor | 2.1.1 | SPARK-19944 | 80ebca62cbdb7d5c8606e95a944164ab1a943694#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.hive.advancedPartitionPredicatePushdown.enabled | 2.3.0 | SPARK-20331 | d8cada8d1d3fce979a4bc1f9879593206722a3b9#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.shuffle.partitions | 1.1.0 | SPARK-1508 | 08ed9ad81397b71206c4dc903bfb94b6105691ed#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.adaptive.enabled | 1.6.0 | SPARK-9858 and SPARK-9859 and SPARK-9861 | d728d5c98658c44ed2949b55d36edeaa46f8c980#diff-41ef65b9ef5b518f77e2a03559893f4d |
spark.sql.adaptive.forceApply | 3.0.0 | SPARK-30719 | b29cb1a82b1a1facf1dd040025db93d998dad4cd#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.adaptive.shuffle.reducePostShufflePartitions | 3.0.0 | SPARK-30812 | b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch | 3.0.0 | SPARK-30812 | b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.adaptive.shuffle.minNumPostShufflePartitions | 3.0.0 | SPARK-9853 | 8616109061efc5b23b24bb9ec4a3c0f2745903c1#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.adaptive.shuffle.targetPostShuffleInputSize | 1.6.0 | SPARK-9858 and SPARK-9859 and SPARK-9861 | d728d5c98658c44ed2949b55d36edeaa46f8c980#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.adaptive.shuffle.maxNumPostShufflePartitions | 3.0.0 | SPARK-9853 | 8616109061efc5b23b24bb9ec4a3c0f2745903c1#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.adaptive.shuffle.localShuffleReader.enabled | 3.0.0 | SPARK-29893 | 6e581cf164c3a2930966b270ac1406dc1195c942#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.adaptive.skewedJoinOptimization.enabled | 3.0.0 | SPARK-30812 | b76bc0b1b8b2abd00a84f805af90ca4c5925faaa#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.adaptive.skewedJoinOptimization.skewedPartitionFactor | 3.0.0 | SPARK-30812 | 5b36cdbbfef147e93b35eaa4f8e0bea9690b6d06#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.adaptive.nonEmptyPartitionRatioForBroadcastJoin | 3.0.0 | SPARK-9853 and SPARK-29002 | 8616109061efc5b23b24bb9ec4a3c0f2745903c1#diff-9a6b543db706f1a90f790783d6930a13 and b2f06608b785f577999318c00f2c315f39d90889#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.subexpressionElimination.enabled | 1.6.0 | SPARK-10371 | f38509a763816f43a224653fe65e4645894c9fc4#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.caseSensitive | 1.4.0 | SPARK-4699 | 21bd7222e55b9cf684c072141998a0623a69f514#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.constraintPropagation.enabled | 2.2.0 | SPARK-19846 | e011004bedca47be998a0c14fe22a6f9bb5090cd#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.parser.escapedStringLiterals | 2.2.1 | SPARK-20399 | 3d1908fd58fd9b1970cbffebdb731bfe4c776ad9#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.sources.fileCompressionFactor | 2.3.1 | SPARK-22790 | 0fc5533e53ad03eb67590ddd231f40c2713150c3#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.parquet.mergeSchema | 1.5.0 | SPARK-8690 | 246265f2bb056d5e9011d3331b809471a24ff8d7#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.parquet.respectSummaryFiles | 1.5.0 | SPARK-8838 | 6175d6cfe795fbd88e3ee713fac375038a3993a8#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.parquet.binaryAsString | 1.1.1 | SPARK-2927 | de501e169f24e4573747aec85b7651c98633c028#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.parquet.int96AsTimestamp | 1.3.0 | SPARK-4987 | 67d52207b5cf2df37ca70daff2a160117510f55e#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.parquet.int96TimestampConversion | 2.3.0 | SPARK-12297 | acf7ef3154e094875fa89f30a78ab111b267db91#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.parquet.outputTimestampType | 2.3.0 | SPARK-10365 | 21a7bfd5c324e6c82152229f1394f26afeae771c#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.parquet.compression.codec | 1.1.1 | SPARK-3131 | 3a9d874d7a46ab8b015631d91ba479d9a0ba827f#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.parquet.filterPushdown | 1.2.0 | SPARK-4391 | 576688aa2a19bd4ba239a2b93af7947f983e5124#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.parquet.filterPushdown.date | 2.4.0 | SPARK-23727 | b02e76cbffe9e589b7a4e60f91250ca12a4420b2#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.parquet.filterPushdown.timestamp | 2.4.0 | SPARK-24718 | 43e4e851b642bbee535d22e1b9e72ec6b99f6ed4#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.parquet.filterPushdown.decimal | 2.4.0 | SPARK-24549 | 9549a2814951f9ba969955d78ac4bd2240f85989#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.parquet.filterPushdown.string.startsWith | 2.4.0 | SPARK-24638 | 03545ce6de08bd0ad685c5f59b73bc22dfc40887#diff-9a6b543db706f1a90f790783d6930a13 | 
spark.sql.parquet.pushdown.inFilterThreshold | 2.4.0 | SPARK-17091 | e1de34113e057707dfc5ff54a8109b3ec7c16dfb#diff-9a6b543db706f1a90f790783d6930a13 |  
spark.sql.parquet.writeLegacyFormat | 1.6.0 | SPARK-10400 | 01cd688f5245cbb752863100b399b525b31c3510#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.parquet.output.committer.class | 1.5.0 | SPARK-8139 | 111d6b9b8a584b962b6ae80c7aa8c45845ce0099#diff-41ef65b9ef5b518f77e2a03559893f4d |  
spark.sql.parquet.enableVectorizedReader | 2.0.0 | SPARK-13486 | 2b2c8c33236677c916541f956f7b94bba014a9ce#diff-32bb9518401c0948c5ea19377b5069ab |
spark.sql.parquet.recordLevelFilter.enabled | 2.3.0 | SPARK-17310 | 673c67046598d33b9ecf864024ca7a937c1998d6#diff-9a6b543db706f1a90f790783d6930a13 |
spark.sql.parquet.columnarReaderBatchSize | 2.4.0 | SPARK-23188 | cc41245fa3f954f961541bf4b4275c28473042b8#diff-9a6b543db706f1a90f790783d6930a13 |  

### Why are the changes needed?
Supplemental configuration version information.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Exists UT

Closes #27691 from beliefer/add-version-to-sql-config-part-one.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-27 10:58:44 +09:00
iRakson c913b9d8b5 [SPARK-27619][SQL] MapType should be prohibited in hash expressions
### What changes were proposed in this pull request?
`hash()` and `xxhash64()` cannot be used on elements of `Maptype`. A new configuration `spark.sql.legacy.useHashOnMapType` is introduced to allow users to restore the previous behaviour.

When `spark.sql.legacy.useHashOnMapType` is set to false:

```
scala> spark.sql("select hash(map())");
org.apache.spark.sql.AnalysisException: cannot resolve 'hash(map())' due to data type mismatch: input to function hash cannot contain elements of MapType; line 1 pos 7;
'Project [unresolvedalias(hash(map(), 42), None)]
+- OneRowRelation
```

when `spark.sql.legacy.useHashOnMapType` is set to true :

```
scala> spark.sql("set spark.sql.legacy.useHashOnMapType=true");
res3: org.apache.spark.sql.DataFrame = [key: string, value: string]

scala> spark.sql("select hash(map())").first()
res4: org.apache.spark.sql.Row = [42]

```

### Why are the changes needed?

As discussed in Jira, SparkSql's map hashcodes depends on their order of insertion which is not consistent with the normal scala behaviour which might confuse users.
Code snippet from JIRA :
```
val a = spark.createDataset(Map(1->1, 2->2) :: Nil)
val b = spark.createDataset(Map(2->2, 1->1) :: Nil)

// Demonstration of how Scala Map equality is unaffected by insertion order:
assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode())
assert(Map(1->1, 2->2) == Map(2->2, 1->1))
assert(a.first() == b.first())

// In contrast, this will print two different hashcodes:
println(Seq(a, b).map(_.selectExpr("hash(*)").first()))
```

Also `MapType` is prohibited for aggregation / joins / equality comparisons #7819 and set operations #17236.

### Does this PR introduce any user-facing change?
Yes. Now users cannot use hash functions on elements of `mapType`. To restore the previous behaviour set `spark.sql.legacy.useHashOnMapType` to true.

### How was this patch tested?
UT added.

Closes #27580 from iRakson/SPARK-27619.

Authored-by: iRakson <raksonrakesh@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-27 01:48:12 +08:00
Terry Kim 73305475c1 [SPARK-30782][SQL] Column resolution doesn't respect current catalog/namespace for v2 tables
### What changes were proposed in this pull request?

This PR proposes to fix an issue where qualified columns are not matched for v2 tables if current catalog/namespace are used.

For v1 tables, you can currently perform the following:
```SQL
SELECT default.t.id FROM t;
```

For v2 tables, the following fails:
```SQL
USE testcat.ns1.ns2;
SELECT testcat.ns1.ns2.t.id FROM t;

org.apache.spark.sql.AnalysisException: cannot resolve '`testcat.ns1.ns2.t.id`' given input columns: [t.id, t.point]; line 1 pos 7;
```

### Why are the changes needed?

It is a bug since qualified column names cannot match if current catalog/namespace are used.

### Does this PR introduce any user-facing change?

Yes, now the following works:
```SQL
USE testcat.ns1.ns2;
SELECT testcat.ns1.ns2.t.id FROM t;
```

### How was this patch tested?

Added new tests

Closes #27532 from imback82/qualifed_col_respect_current.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-27 00:21:38 +08:00
gatorsmile 28b8713036 [SPARK-30950][BUILD] Setting version to 3.1.0-SNAPSHOT
### What changes were proposed in this pull request?
This patch is to bump the master branch version to 3.1.0-SNAPSHOT.

### Why are the changes needed?
N/A

### Does this PR introduce any user-facing change?
N/A

### How was this patch tested?
N/A

Closes #27698 from gatorsmile/updateVersion.

Authored-by: gatorsmile <gatorsmile@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-25 19:44:31 -08:00
Wenchen Fan 8f247e5d36 [SPARK-30918][SQL] improve the splitting of skewed partitions
### What changes were proposed in this pull request?

Use the average size of the non-skewed partitions as the target size when splitting skewed partitions, instead of ADAPTIVE_EXECUTION_SKEWED_PARTITION_SIZE_THRESHOLD

### Why are the changes needed?

The goal of skew join optimization is to make the data distribution move even. So it makes more sense the use the average size of the non-skewed partitions as the target size.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

existing tests

Closes #27669 from cloud-fan/aqe.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2020-02-25 14:10:29 -08:00
Maxim Gekk ffc0935e64 [SPARK-30869][SQL] Convert dates to/from timestamps in microseconds precision
### What changes were proposed in this pull request?
In the PR, I propose to replace:

1. `millisToDays()` by `microsToDays()` which accepts microseconds since the epoch and returns days since the epoch in the specified time zone. The last one is the internal representation of Catalyst's DateType.
2. `daysToMillis()` by `daysToMicros()` which accepts days since the epoch in some time zone and returns the number of microseconds since the epoch. The last one is internal representation of Catalyst's TimestampType.
3. `fromMillis()` by `millisToMicros()`
4. `toMillis()` by `microsToMillis()`

### Why are the changes needed?
Spark stores timestamps in microseconds precision, so, there is no actual need to convert dates to milliseconds, and then to microseconds. As examples, look at DateTimeUtils functions `monthsBetween()` and `truncTimestamp()`.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By existing test suites UnivocityParserSuite, DateExpressionsSuite, ComputeCurrentTimeSuite, DateTimeUtilsSuite, DateFunctionsSuite, JsonSuite, StreamSuite.

Closes #27618 from MaxGekk/replace-millis-by-micros.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-25 23:05:28 +08:00
Kent Yao 761209c1f2 [SPARK-30919][SQL] Make interval multiply and divide's overflow behavior consistent with other operations
### What changes were proposed in this pull request?

The current behavior of interval multiply and divide follows the ANSI SQL standard when overflow, it is compatible with other operations when `spark.sql.ansi.enabled` is true, but not compatible when `spark.sql.ansi.enabled` is false.

When `spark.sql.ansi.enabled` is false, as the factor is a double value, so it should use java's rounding or truncation behavior for casting double to integrals. when divided by zero, it returns `null`.  we also follow the natural rules for intervals as defined in the Gregorian calendar, so we do not add the month fraction to days but add days fraction to microseconds.

### Why are the changes needed?

Make interval multiply and divide's overflow behavior consistent with other interval operations

### Does this PR introduce any user-facing change?

no, these are new features in 3.0

### How was this patch tested?

add uts

Closes #27672 from yaooqinn/SPARK-30919.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-25 22:19:24 +08:00
Josh Rosen f152d2a0a8 [SPARK-30944][BUILD] Update URL for Google Cloud Storage mirror of Maven Central
### What changes were proposed in this pull request?

This PR is a followup to #27307: per https://travis-ci.community/t/maven-builds-that-use-the-gcs-maven-central-mirror-should-update-their-paths/5926, the Google Cloud Storage mirror of Maven Central has updated its URLs: the new paths are updated more frequently. The new paths are listed on https://storage-download.googleapis.com/maven-central/index.html

This patch updates our build files to use these new URLs.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing build + tests.

Closes #27688 from JoshRosen/update-gcs-mirror-url.

Authored-by: Josh Rosen <joshrosen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-25 17:04:13 +09:00
Peter Toth 1a4e2423b2 [SPARK-30870][SQL] Column pruning shouldn't alias a nested column if it means the whole structure
### What changes were proposed in this pull request?
This PR fixes a bug in nested column aliasing by taking the data type of the referenced nested fields into account when calculating the number of extracted columns. After this PR this query runs without issues:
```
SELECT explodedvalue.*
FROM VALUES array(named_struct('nested', named_struct('a', 1, 'b', 2))) AS (value)
LATERAL VIEW explode(value) AS explodedvalue
```
This is a regression from Spark 2.4.

### Why are the changes needed?
To fix a bug.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Added new UT.

Closes #27675 from peter-toth/SPARK-30870.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-24 13:46:21 -08:00
beliefer 621e37e2ab [SPARK-28880][SQL] Support ANSI nested bracketed comments
### What changes were proposed in this pull request?
Spark SQL support single comments and bracketed comments now. This PR will support nested bracketed comments.

There are some mainstream database support the syntax.
**PostgreSQL:**
https://www.postgresql.org/docs/11/sql-syntax-lexical.html#SQL-SYNTAX-COMMENTS

**Vertica:**
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/LanguageElements/Expressions/Comments.htm?zoom_highlight=comments

Note: Because Spark SQL not exists UT for single comments and bracketed comments, so I add some UT for them.

### Why are the changes needed?
nested bracketed comments is ANSI standard.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
New UT

Closes #27495 from beliefer/nested-brancket-comments.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-02-24 00:28:46 -08:00
Peter Toth a372f76cbd [SPARK-30898][SQL] The behavior of MakeDecimal should not depend on SQLConf.get
### What changes were proposed in this pull request?
This PR adds a new `nullOnOverflow` parameter to `MakeDecimal` so as to avoid its value depending on `SQLConf.get` and change during planning.

### Why are the changes needed?
This allows to avoid the issue when the configuration change between different phases of planning, and this can silently break a query plan which can lead to crashes or data corruption.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing UTs.

Closes #27656 from peter-toth/SPARK-30898.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-24 16:48:48 +09:00
Peter Toth 612f63f39e [SPARK-30897][SQL] The behavior of ArrayExists should not depend on SQLConf.get
### What changes were proposed in this pull request?
This PR adds a new `followThreeValuedLogic` parameter to `ArrayExists` so as to avoid its value depending on `SQLConf.get` and change during planning.

### Why are the changes needed?
This allows to avoid the issue when the configuration change between different phases of planning, and this can silently break a query plan which can lead to crashes or data corruption.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing UTs.

Closes #27655 from peter-toth/SPARK-30897.

Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-24 16:47:08 +09:00
Burak Yavuz 4ff2718d54 [SPARK-30924][SQL][3.0] Add additional checks to Merge Into
### What changes were proposed in this pull request?

Merge Into is currently missing additional validation around:

 1. The lack of any WHEN statements
 2. The first WHEN MATCHED statement needs to have a condition if there are two WHEN MATCHED statements.
 3. Single use of UPDATE/DELETE

This PR introduces these validations.
(1) is required, because otherwise the MERGE statement is useless.
(2) is required, because otherwise the second WHEN MATCHED condition becomes dead code
(3) is up for debate, but the idea there is that a single expression should be sufficient to specify when you would like to update or delete your records. We restrict it for now to reduce surface area and ambiguity.

### Why are the changes needed?

To ease DataSource developers when building implementations for MERGE

### Does this PR introduce any user-facing change?

Adds additional validation checks

### How was this patch tested?

Unit tests

Closes #27677 from brkyvz/mergeChecks.

Authored-by: Burak Yavuz <brkyvz@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-24 15:16:37 +08:00
jiake f4696ba252 [SPARK-30922][SQL] remove the max splits config in skewed join
### What changes were proposed in this pull request?
When skewed join optimization split more skewed readers, the plan may be very large and can not be shown in ui quickly. The config `spark.sql.adaptive.skewedJoinOptimization.skewedPartitionMaxSplits`  is to resolve the above ui shown issue. And after [PR#27493](https://github.com/apache/spark/pull/27493) combined the skewed readers into one, we not need this config.

### Why are the changes needed?
remove the unnecessary config

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
existing test

Closes #27673 from JkSelf/removeMaxSplitNum.

Authored-by: jiake <ke.a.jia@intel.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-24 14:29:25 +08:00
Maxim Gekk c41ef39819 [SPARK-30925][SQL] Prevent overflow/round errors in conversions of milliseconds to/from microseconds
### What changes were proposed in this pull request?
- Use `Math.multiplyExact()` in `DateTimeUtils.fromMillis()` to prevent silent overflow in conversion milliseconds to microseconds.
- Use `DateTimeUtils.fromMillis()` in all places where milliseconds are converted to microseconds
- Use `DateTimeUtils.toMillis()` in all places where microseconds are converted to milliseconds

### Why are the changes needed?

1. To prevent silent arithmetic overflow while multiplying by 1000 in `fromMillis()`. Instead of it, `new ArithmeticException("long overflow")` will be thrown, and handled accordantly.
2. To correctly round microseconds in conversion to milliseconds. For example, `1965-01-01 10:11:12.123456` is represented as `-157700927876544` in micro precision. In milliseconds precision the above needs to be represented as `-157700927877` or `1965-01-01 10:11:12.123`.

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
By `TimestampFormatterSuite`, `CastSuite`, `DateExpressionsSuite`, `IntervalExpressionsSuite`, `ExpressionParserSuite`, `ExpressionParserSuite`, `DateTimeUtilsSuite`, `IntervalUtilsSuite`

Closes #27676 from MaxGekk/millis-2-micros-overflow.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-24 14:06:25 +08:00
Maxim Gekk 310c14ac8d [MINOR][SQL] Add a comment for removedSQLConfigs
### What changes were proposed in this pull request?
In the PR, I propose to explain in the description of `removedSQLConfigs` when removed SQL configs should NOT be placed to the map.

### Why are the changes needed?
To make the cases when SQL configs should be added to `removedSQLConfigs` more clear. Recently, `spark.sql.variable.substitute.depth` was removed from the map by #27646 because it contradicts to the condition described by the PR.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By `./dev/scalastyle`

Closes #27653 from MaxGekk/removedSQLConfigs-comment.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-22 09:48:10 +09:00
beliefer 59d6d5cbb0 [SPARK-30840][CORE][SQL] Add version property for ConfigEntry and ConfigBuilder
### What changes were proposed in this pull request?
Spark `ConfigEntry` and `ConfigBuilder` missing Spark version information of each configuration at release. This is not good for Spark user when they visiting the page of spark configuration.
http://spark.apache.org/docs/latest/configuration.html
The new Spark SQL config docs looks like:
![sql配置截屏](https://user-images.githubusercontent.com/8486025/74604522-cb882f00-50f9-11ea-8683-57a90f9e3347.png)

```
> SET -v
spark.sql.adaptive.enabled      false   When true, enable adaptive query execution.
spark.sql.adaptive.nonEmptyPartitionRatioForBroadcastJoin       0.2     The relation with a non-empty partition ratio lower than this config will not be considered as the build side of a broadcast-hash join in adaptive execution regardless of its size.This configuration only has an effect when 'spark.sql.adaptive.enabled' is enabled.
spark.sql.adaptive.optimizeSkewedJoin.enabled   true    When true and adaptive execution is enabled, a skewed join is automatically handled at runtime.
spark.sql.adaptive.optimizeSkewedJoin.skewedPartitionFactor     10      A partition is considered as a skewed partition if its size is larger than this factor multiple the median partition size and also larger than  spark.sql.adaptive.optimizeSkewedJoin.skewedPartitionSizeThreshold
spark.sql.adaptive.optimizeSkewedJoin.skewedPartitionMaxSplits  5       Configures the maximum number of task to handle a skewed partition in adaptive skewedjoin.
spark.sql.adaptive.optimizeSkewedJoin.skewedPartitionSizeThreshold      64MB    Configures the minimum size in bytes for a partition that is considered as a skewed partition in adaptive skewed join.
spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled    true    Whether to fetch the continuous shuffle blocks in batch. Instead of fetching blocks one by one, fetching continuous shuffle blocks for the same map task in batch can reduce IO and improve performance. Note, multiple continuous blocks exist in single fetch request only happen when 'spark.sql.adaptive.enabled' and 'spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled' is enabled, this feature also depends on a relocatable serializer, the concatenation support codec in use and the new version shuffle fetch protocol.
spark.sql.adaptive.shuffle.localShuffleReader.enabled   true    When true and 'spark.sql.adaptive.enabled' is enabled, this enables the optimization of converting the shuffle reader to local shuffle reader for the shuffle exchange of the broadcast hash join in probe side.
spark.sql.adaptive.shuffle.maxNumPostShufflePartitions  <undefined>     The advisory maximum number of post-shuffle partitions used in adaptive execution. This is used as the initial number of pre-shuffle partitions. By default it equals to spark.sql.shuffle.partitions. This configuration only has an effect when 'spark.sql.adaptive.enabled' and 'spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled' is enabled.
```

**Note**: Because there are so many configuration items that are exposed and require a lot of finishing, I will add the version numbers of these configuration items in another PR.

### Why are the changes needed?
Supplemental configuration version information.

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
Exists UT

Closes #27592 from beliefer/add-version-to-config.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-22 09:46:42 +09:00
Eric Wu 1f0300fb16 [SPARK-30764][SQL] Improve the readability of EXPLAIN FORMATTED style
### What changes were proposed in this pull request?
The style of `EXPLAIN FORMATTED` output needs to be improved. We’ve already got some observations/ideas in
https://github.com/apache/spark/pull/27368#discussion_r376694496
https://github.com/apache/spark/pull/27368#discussion_r376927143

Observations/Ideas:
1. Using comma as the separator is not clear, especially commas are used inside the expressions too.
2. Show the column counts first? For example, `Results [4]: …`
3. Currently the attribute names are automatically generated, this need to refined.
4. Add arguments field in common implementations as `EXPLAIN EXTENDED` did by calling `argString` in `TreeNode.simpleString`. This will eliminate most existing minor differences between
`EXPLAIN EXTENDED` and `EXPLAIN FORMATTED`.
5. Another improvement we can do is: the generated alias shouldn't include attribute id. collect_set(val, 0, 0)#123 looks clearer than collect_set(val#456, 0, 0)#123

This PR is currently addressing comments 2 & 4, and open for more discussions on improving readability.

### Why are the changes needed?
The readability of `EXPLAIN FORMATTED` need to be improved, which will help user better understand the query plan.

### Does this PR introduce any user-facing change?
Yes, `EXPLAIN FORMATTED` output style changed.

### How was this patch tested?
Update expect results of test cases in explain.sql

Closes #27509 from Eric5553/ExplainFormattedRefine.

Authored-by: Eric Wu <492960551@qq.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-21 23:36:14 +08:00
Yuanjian Li a5efbb284e [SPARK-30809][SQL] Review and fix issues in SQL API docs
### What changes were proposed in this pull request?
- Add missing `since` annotation.
- Don't show classes under `org.apache.spark.sql.dynamicpruning` package in API docs.
- Fix the scope of `xxxExactNumeric` to remove it from the API docs.

### Why are the changes needed?
Avoid leaking APIs unintentionally in Spark 3.0.0.

### Does this PR introduce any user-facing change?
No. All these changes are to avoid leaking APIs unintentionally in Spark 3.0.0.

### How was this patch tested?
Manually generated the API docs and verified the above issues have been fixed.

Closes #27560 from xuanyuanking/SPARK-30809.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-21 17:03:22 +08:00
Maxim Gekk abe0821ee9 [SPARK-30894][SQL] Make Size's nullable independent from SQL config changes
### What changes were proposed in this pull request?
In the PR, I propose to add the `legacySizeOfNull ` parameter to the `Size` expression, and pass the value of `spark.sql.legacy.sizeOfNull` if `legacySizeOfNull` is not provided on creation of `Size`.

### Why are the changes needed?
This allows to avoid the issue when the configuration change between different phases of planning, and this can silently break a query plan which can lead to crashes or data corruption.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By `CollectionExpressionsSuite`.

Closes #27658 from MaxGekk/Size-SQLConf-get-deps.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-21 15:32:11 +08:00
yi.wu 82ce4753aa [SPARK-26580][SQL][ML][FOLLOW-UP] Throw exception when use untyped UDF by default
### What changes were proposed in this pull request?

This PR proposes to throw exception by default when user use untyped UDF(a.k.a `org.apache.spark.sql.functions.udf(AnyRef, DataType)`).

And user could still use it by setting `spark.sql.legacy.useUnTypedUdf.enabled` to `true`.

### Why are the changes needed?

According to #23498, since Spark 3.0, the untyped UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will  return 0 in Spark 3.0 but null in Spark 2.4. And the behavior change is introduced due to Spark3.0 is built with Scala 2.12 by default.

As a result, this might change data silently and may cause correctness issue if user still expect `null` in some cases. Thus, we'd better to encourage user to use typed UDF to avoid this problem.

### Does this PR introduce any user-facing change?

Yeah. User will hit exception now when use untyped UDF.

### How was this patch tested?

Added test and updated some tests.

Closes #27488 from Ngone51/spark_26580_followup.

Lead-authored-by: yi.wu <yi.wu@databricks.com>
Co-authored-by: wuyi <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-21 14:46:54 +08:00
yi.wu 4d356554a6 [MINOR][SQL] Fix error position of NOSCAN
### What changes were proposed in this pull request?

Point to correct position when miswrite `NOSCAN` detects.

### Why are the changes needed?

Before:

```
[info]   org.apache.spark.sql.catalyst.parser.ParseException: Expected `NOSCAN` instead of `SCAN`(line 1, pos 0)
[info]
[info] == SQL ==
[info] ANALYZE TABLE analyze_partition_with_null PARTITION (name) COMPUTE STATISTICS SCAN
[info] ^^^
```

After:

```
[info]   org.apache.spark.sql.catalyst.parser.ParseException: Expected `NOSCAN` instead of `SCAN`(line 1, pos 78)
[info]
[info] == SQL ==
[info] ANALYZE TABLE analyze_partition_with_null PARTITION (name) COMPUTE STATISTICS SCAN
[info] ------------------------------------------------------------------------------^^^
```

### Does this PR introduce any user-facing change?

Yes, user will see better error message.

### How was this patch tested?

Manually test.

Closes #27662 from Ngone51/fix_noscan_reference.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-02-21 15:21:53 +09:00
wuyi 5eb004f4bb Revert "[SPARK-28093][SQL] Fix TRIM/LTRIM/RTRIM function parameter order issue"
### What changes were proposed in this pull request?

This reverts commit bef5d9d6c3.

### Why are the changes needed?

Revert it according to https://github.com/apache/spark/pull/24902#issuecomment-584511167.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass Jenkins.

Closes #27540 from Ngone51/revert_spark_28093.

Lead-authored-by: wuyi <yi.wu@databricks.com>
Co-authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-21 12:55:32 +08:00
Maxim Gekk bb40ab09f4 [SPARK-30892][SQL] Exclude spark.sql.variable.substitute.depth from removedSQLConfigs
### What changes were proposed in this pull request?

Exclude the SQL config `spark.sql.variable.substitute.depth` from `SQLConf.removedSQLConfigs`

### Why are the changes needed?
By the #27169, the config was placed to `SQLConf.removedSQLConfigs`. And as a consequence of that when an user set it non-default value (1 for example),  he/she will get an exception. It is acceptable for SQL configs that could impact on the behavior but not for this particular config. Raising of such exception will just make migration to Spark 3.0 more difficult.

### Does this PR introduce any user-facing change?
Yes, before the changes users get an exception when he/she set `spark.sql.variable.substitute.depth` to a value different from `40`.

### How was this patch tested?
Run `spark.conf.set("spark.sql.variable.substitute.depth", 1)` in `spark-shell`.

Closes #27646 from MaxGekk/remove-substitute-depth-conf.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-21 00:44:09 +08:00
Maxim Gekk a551715fd2 [SPARK-29930][SPARK-30416][SQL][FOLLOWUP] Move deprecated/removed config checks from RuntimeConfig to SQLConf
### What changes were proposed in this pull request?
- Output warnings for deprecated SQL configs in `SQLConf. setConfWithCheck()` and in `SQLConf. unsetConf()`
- Throw an exception for removed SQL configs in `SQLConf. setConfWithCheck()` when they set to non-default values
- Remove checking of deprecated and removed SQL configs from RuntimeConfig

### Why are the changes needed?
Currently, warnings/exceptions are printed only when a SQL config is set dynamically, for instance via `spark.conf.set()`. After the changes, removed/deprecated SQL configs will be checked when they set statically. For example:
```
$ bin/spark-shell --conf spark.sql.fromJsonForceNullableSchema=false
scala> spark.emptyDataFrame
java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':
...
Caused by: org.apache.spark.sql.AnalysisException: The SQL config 'spark.sql.fromJsonForceNullableSchema' was removed in the version 3.0.0. It was removed to prevent errors like SPARK-23173 for non-default value.
```
```
$ bin/spark-shell --conf spark.sql.hive.verifyPartitionPath=false
scala> spark.emptyDataFrame
20/02/20 02:10:26 WARN SQLConf: The SQL config 'spark.sql.hive.verifyPartitionPath' has been deprecated in Spark v3.0 and may be removed in the future. This config is replaced by 'spark.files.ignoreMissingFiles'.
```

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
By `SQLConfSuite`

Closes #27645 from MaxGekk/remove-sql-configs-followup-2.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-21 00:00:48 +08:00
Maxim Gekk 4248b7fbb9 [SPARK-30858][SQL] Make IntegralDivide's dataType independent from SQL config changes
### What changes were proposed in this pull request?
In the PR, I propose to add the `returnLong` parameter to `IntegralDivide`, and pass the value of `spark.sql.legacy.integralDivide.returnBigint` if `returnLong` is not provided on creation of `IntegralDivide`.

### Why are the changes needed?
This allows to avoid the issue when the configuration change between different phases of planning, and this can silently break a query plan which can lead to crashes or data corruption.

OptionsAttachments

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By `ArithmeticExpressionSuite`.

Closes #27628 from MaxGekk/integral-divide-conf.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-20 21:26:20 +08:00
Gengliang Wang 92d5d40c8e [SPARK-30881][SQL][DOCS] Revise the doc of spark.sql.sources.parallelPartitionDiscovery.threshold
### What changes were proposed in this pull request?

Revise the doc of SQL configuration `spark.sql.sources.parallelPartitionDiscovery.threshold`.
### Why are the changes needed?

The doc of configuration "spark.sql.sources.parallelPartitionDiscovery.threshold" is not accurate on the part "This applies to Parquet, ORC, CSV, JSON and LibSVM data sources".

We should revise it as effective on all the file-based data sources.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

None. It's just doc.

Closes #27639 from gengliangwang/reviseParallelPartitionDiscovery.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-02-20 00:59:22 -08:00
herman c92d437c46 [SPARK-30811][SQL] CTE should not cause stack overflow when it refers to non-existent table with same name
### Why are the changes needed?
This ports the tests introduced in 7285eea683 to master to avoid future regressions.

### Background
A query with Common Table Expressions can cause a stack overflow when it contains a CTE that refers a non-existing table with the same name. The name of the table need to have a database qualifier. This is caused by a couple of things:

- CTESubstitution runs analysis on the CTE, but this does not throw an exception because the table has a database qualifier. The reason is that we don't fail is because we re-attempt to resolve the relation in a later rule;
- CTESubstitution replace logic does not check if the table it is replacing has a database, it shouldn't replace the relation if it does. So now we will happily replace nonexist.t with t;

Note that this not an issue for master or the spark-3.0 branch.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Added regression test to `AnalysisErrorSuite` and `DataFrameSuite`.

Closes #27635 from hvanhovell/SPARK-30811-master.

Authored-by: herman <herman@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-19 10:17:46 -08:00
jiake 10a4eafcfe [SPARK-30812][SQL] update the skew join configs by adding the prefix "skewedJoinOptimization"
### What changes were proposed in this pull request?
This is a follow up in [PR#27563](https://github.com/apache/spark/pull/27563).
This PR adds the prefix of "skewedJoinOptimization" in the skew join related configs.

### Why are the changes needed?
address remaining address

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
only update config and no need new ut.

Closes #27630 from JkSelf/renameskewjoinconfig.

Authored-by: jiake <ke.a.jia@intel.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-19 15:55:29 +08:00
Wenchen Fan 1b67d546bd revert SPARK-29663 and SPARK-29688
### What changes were proposed in this pull request?

This PR reverts https://github.com/apache/spark/pull/26325 and https://github.com/apache/spark/pull/26347

### Why are the changes needed?

When we do sum/avg, we need a wider type of input to hold the sum value, to reduce the possibility of overflow. For example, we use long to hold the sum of integral inputs, use double to hold the sum of float/double.

However, we don't have a wider type of interval. Also the semantic is unclear: what if the days field overflows but the months field doesn't? Currently the avg of `1 month` and `2 month` is `1 month 15 days`, which assumes 1 month has 30 days and we should avoid this assumption.

### Does this PR introduce any user-facing change?

yes, remove 2 features added in 3.0

### How was this patch tested?

N/A

Closes #27619 from cloud-fan/revert.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: herman <herman@databricks.com>
2020-02-18 21:19:57 +01:00
yi.wu 68d7edf949 [SPARK-30812][SQL][CORE] Revise boolean config name to comply with new config naming policy
### What changes were proposed in this pull request?

Revise below config names to comply with [new config naming policy](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-naming-policy-of-Spark-configs-td28875.html):

SQL:
* spark.sql.execution.subquery.reuse.enabled / [SPARK-27083](https://issues.apache.org/jira/browse/SPARK-27083)
* spark.sql.legacy.allowNegativeScaleOfDecimal.enabled / [SPARK-30252](https://issues.apache.org/jira/browse/SPARK-30252)
* spark.sql.adaptive.optimizeSkewedJoin.enabled / [SPARK-29544](https://issues.apache.org/jira/browse/SPARK-29544)
* spark.sql.legacy.property.nonReserved / [SPARK-30183](https://issues.apache.org/jira/browse/SPARK-30183)
* spark.sql.streaming.forceDeleteTempCheckpointLocation.enabled / [SPARK-26389](https://issues.apache.org/jira/browse/SPARK-26389)
* spark.sql.analyzer.failAmbiguousSelfJoin.enabled / [SPARK-28344](https://issues.apache.org/jira/browse/SPARK-28344)
* spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled / [SPARK-30074](https://issues.apache.org/jira/browse/SPARK-30074)
* spark.sql.execution.pandas.arrowSafeTypeConversion / [SPARK-25811](https://issues.apache.org/jira/browse/SPARK-25811)
* spark.sql.legacy.looseUpcast / [SPARK-24586](https://issues.apache.org/jira/browse/SPARK-24586)
* spark.sql.legacy.arrayExistsFollowsThreeValuedLogic / [SPARK-28052](https://issues.apache.org/jira/browse/SPARK-28052)
* spark.sql.sources.ignoreDataLocality.enabled / [SPARK-29189](https://issues.apache.org/jira/browse/SPARK-29189)
* spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled / [SPARK-9853](https://issues.apache.org/jira/browse/SPARK-9853)

CORE:
* spark.eventLog.erasureCoding.enabled / [SPARK-25855](https://issues.apache.org/jira/browse/SPARK-25855)
* spark.shuffle.readHostLocalDisk.enabled / [SPARK-30235](https://issues.apache.org/jira/browse/SPARK-30235)
* spark.scheduler.listenerbus.logSlowEvent.enabled / [SPARK-29001](https://issues.apache.org/jira/browse/SPARK-29001)
* spark.resources.coordinate.enable / [SPARK-27371](https://issues.apache.org/jira/browse/SPARK-27371)
* spark.eventLog.logStageExecutorMetrics.enabled / [SPARK-23429](https://issues.apache.org/jira/browse/SPARK-23429)

### Why are the changes needed?

To comply with the config naming policy.

### Does this PR introduce any user-facing change?

No. Configurations listed above are all newly added in Spark 3.0.

### How was this patch tested?

Pass Jenkins.

Closes #27563 from Ngone51/revise_boolean_conf_name.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-18 20:39:50 +08:00
yi.wu 643a480b11 [SPARK-30863][SQL] Distinguish Cast and AnsiCast in toString
### What changes were proposed in this pull request?

Prefix by `ansi_`  in `toString` if it's a `AnsiCast` or ansi enabled `Cast`.

E.g. run `spark.sql("select cast('51' as int)").queryExecution.analyzed` under ansi mode.

Before this PR:
```
Project [cast(51 as int) AS CAST(51 AS INT)#0]
+- OneRowRelation
```

After this PR:
```
Project [ansi_cast(51 as int) AS CAST(51 AS INT)#0]
+- OneRowRelation
```

### Why are the changes needed?

This is useful while comparing `LogicalPlan`s literally.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass Jenkins.

Closes #27608 from Ngone51/ansi_cast_tostring.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-18 16:10:43 +08:00
HyukjinKwon 9618806f44 [SPARK-30847][SQL] Take productPrefix into account in MurmurHash3.productHash
### What changes were proposed in this pull request?

This PR proposes to port Scala's bugfix https://github.com/scala/scala/pull/7693 (Scala 2.13) to address https://github.com/scala/bug/issues/10495 issue.

In short, it is possible for different product instances having the same children to have the same hash. See:

```scala
scala> spark.range(1).selectExpr("id - 1").queryExecution.analyzed.semanticHash()
res0: Int = -565572825

scala> spark.range(1).selectExpr("id + 1").queryExecution.analyzed.semanticHash()
res1: Int = -565572825
```

### Why are the changes needed?

It was found during the review of https://github.com/apache/spark/pull/27565. We should better produce different hash for different objects.

### Does this PR introduce any user-facing change?

No, it's not identified. Possibly performance related issue.

### How was this patch tested?

Manually tested, and unittest was added.

Closes #27601 from HyukjinKwon/SPARK-30847.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-18 14:28:44 +08:00
Terry Kim 5866bc77d7 [SPARK-30814][SQL] ALTER TABLE ... ADD COLUMN position should be able to reference columns being added
### What changes were proposed in this pull request?

In ALTER TABLE, a column in ADD COLUMNS can depend on the position of a column that is just being added. For example, for a table with the following schema:
```
root:
  - a: string
  - b: long
```
, the following should work:
```
ALTER TABLE t ADD COLUMNS (x int AFTER a, y int AFTER x)
```
Currently, the above statement will throw an exception saying that AFTER x cannot be resolved, because x doesn't exist yet. This PR proposes to fix this issue.

### Why are the changes needed?

To fix a bug described above.

### Does this PR introduce any user-facing change?

Yes, now
```
ALTER TABLE t ADD COLUMNS (x int AFTER a, y int AFTER x)
```
works as expected.

### How was this patch tested?

Added new tests

Closes #27584 from imback82/alter_table_pos_fix.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-18 13:01:45 +08:00
beliefer d8d3ce5c76 [SPARK-30825][SQL][DOC] Update documents information for window function
### What changes were proposed in this pull request?
I checked the all the window function and found all of them not add parameter information and version information to the document.
This PR will make a supplement.

### Why are the changes needed?
Documentation is missing and does not meet new standards.

### Does this PR introduce any user-facing change?
Yes. User will face the information of parameters and version.

### How was this patch tested?
Exists UT

Closes #27572 from beliefer/add_since_for_window_function.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-18 09:44:34 +09:00
Yuanjian Li e4a541b278 [SPARK-30829][SQL] Define LegacyBehaviorPolicy enumeration as the common value for result change configs
### What changes were proposed in this pull request?
Define a new enumeration `LegacyBehaviorPolicy` in SQLConf, it will be used as the common value for result change configs.

### Why are the changes needed?
During API auditing for the 3.0 release, we found several new approaches that will change the results silently. For these features, we need a common three-value config.

### Does this PR introduce any user-facing change?
Yes, original config `spark.sql.legacy.ctePrecedence.enabled` change to `spark.sql.legacy.ctePrecedencePolicy`.

### How was this patch tested?
Existing UT.

Closes #27579 from xuanyuanking/SPARK-30829.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-18 00:52:05 +08:00
Maxim Gekk 06217cfded [SPARK-30793][SQL] Fix truncations of timestamps before the epoch to minutes and seconds
### What changes were proposed in this pull request?
In the PR, I propose to replace `%` by `Math.floorMod` in `DateTimeUtils.truncTimestamp` for the `SECOND` and `MINUTE` levels.

### Why are the changes needed?
This fixes the issue of incorrect truncation of timestamps before the epoch `1970-01-01T00:00:00.000000Z` to the `SECOND` and `MINUTE` levels. For example, timestamps after the epoch are truncated by cutting off the rest part of the timestamp:
```sql
spark-sql> select date_trunc('SECOND', '2020-02-11 00:01:02.123');
2020-02-11 00:01:02
```
but seconds in the truncated timestamp before the epoch are increased by 1:
```sql
spark-sql> select date_trunc('SECOND', '1960-02-11 00:01:02.123');
1960-02-11 00:01:03
```

### Does this PR introduce any user-facing change?
Yes. After the changes, the example above outputs correct result:
```sql
spark-sql> select date_trunc('SECOND', '1960-02-11 00:01:02.123');
1960-02-11 00:01:02
```

### How was this patch tested?
Added new tests to `DateFunctionsSuite`.

Closes #27543 from MaxGekk/fix-second-minute-truc.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-17 22:51:56 +08:00
Yuanjian Li ab186e3659 [SPARK-25829][SQL] Add config spark.sql.legacy.allowDuplicatedMapKeys and change the default behavior
### What changes were proposed in this pull request?
This is a follow-up for #23124, add a new config `spark.sql.legacy.allowDuplicatedMapKeys` to control the behavior of removing duplicated map keys in build-in functions. With the default value `false`, Spark will throw a RuntimeException while duplicated keys are found.

### Why are the changes needed?
Prevent silent behavior changes.

### Does this PR introduce any user-facing change?
Yes, new config added and the default behavior for duplicated map keys changed to RuntimeException thrown.

### How was this patch tested?
Modify existing UT.

Closes #27478 from xuanyuanking/SPARK-25892-follow.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-17 22:06:58 +08:00
Maxim Gekk 9107f77f15 [SPARK-30843][SQL] Fix getting of time components before 1582 year
### What changes were proposed in this pull request?

1. Rewrite DateTimeUtils methods `getHours()`, `getMinutes()`, `getSeconds()`, `getSecondsWithFraction()`, `getMilliseconds()` and `getMicroseconds()` using Java 8 time APIs. This will automatically switch the `Hour`, `Minute`, `Second` and `DatePart` expressions on Proleptic Gregorian calendar.
2. Remove unused methods and constant of DateTimeUtils - `to2001`, `YearZero `, `toYearZero` and `absoluteMicroSecond()`.
3. Remove unused value `timeZone` from `TimeZoneAwareExpression` since all expressions have been migrated to Java 8 time API, and legacy instance of `TimeZone` is not needed any more.
4. Change signatures of modified DateTimeUtils methods, and pass `ZoneId` instead of `TimeZone`. This will allow to avoid unnecessary conversions `TimeZone` -> `String` -> `ZoneId`.
5. Modify tests in `DateTimeUtilsSuite` and in `DateExpressionsSuite` to pass `ZoneId` instead of `TimeZone`. Correct the tests, to pass tested zone id instead of None.

### Why are the changes needed?
The changes fix the issue of wrong results returned by the `hour()`, `minute()`, `second()`, `date_part('millisecond', ...)` and `date_part('microsecond', ....)`, see example in [SPARK-30843](https://issues.apache.org/jira/browse/SPARK-30843).

### Does this PR introduce any user-facing change?
Yes. After the changes, the results of examples from SPARK-30843:
```sql
spark-sql> select hour(timestamp '0010-01-01 00:00:00');
0
spark-sql> select minute(timestamp '0010-01-01 00:00:00');
0
spark-sql> select second(timestamp '0010-01-01 00:00:00');
0
spark-sql> select date_part('milliseconds', timestamp '0010-01-01 00:00:00');
0.000
spark-sql> select date_part('microseconds', timestamp '0010-01-01 00:00:00');
0
```

### How was this patch tested?
- By existing test suites `DateTimeUtilsSuite`, `DateExpressionsSuite` and `DateFunctionsSuite`.
- Add new tests to `DateExpressionsSuite` and `DateTimeUtilsSuite` for 10 year, like:
```scala
  input = date(10, 1, 1, 0, 0, 0, 0, zonePST)
  assert(getHours(input, zonePST) === 0)
```
- Re-run `DateTimeBenchmark` using Amazon EC2.

| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge |
| AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) |
| Java | OpenJDK8/11 |

Closes #27596 from MaxGekk/localtimestamp-greg-cal.

Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Max Gekk <max.gekk@gmail.com>
Co-authored-by: Ubuntu <ubuntu@ip-172-31-1-30.us-west-2.compute.internal>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-17 13:59:21 +08:00
Wenchen Fan ab07c6300c [SPARK-30799][SQL] "spark_catalog.t" should not be resolved to temp view
### What changes were proposed in this pull request?

No v2 command supports temp views and the `ResolveCatalogs`/`ResolveSessionCatalog` framework is designed with this assumption.

However, `ResolveSessionCatalog` needs to fallback to v1 commands, which do support temp views (e.g. CACHE TABLE). To work around it, we add a hack in `CatalogAndIdentifier`, which does not expand the given identifier with current namespace if the catalog is session catalog.

This works fine in most cases, as temp views should take precedence over tables during lookup. So if `CatalogAndIdentifier` returns a single name "t", the v1 commands can still resolve it to temp views correctly, or resolve it to table "default.t" if temp view doesn't exist.

However, if users write `spark_catalog.t`, it shouldn't be resolved to temp views as temp views don't belong to any catalog. `CatalogAndIdentifier` can't distinguish between `spark_catalog.t` and `t`, so the caller side may mistakenly resolve `spark_catalog.t` to a temp view.

This PR proposes to fix this issue by
1. remove the hack in `CatalogAndIdentifier`, and clearly document that this shouldn't be used to resolve temp views.
2. update `ResolveSessionCatalog` to explicitly look up temp views first before calling `CatalogAndIdentifier`, for v1 commands that support temp views.

### Why are the changes needed?

To avoid releasing a behavior that we should not support.

Removing the hack also fixes the problem we hit in https://github.com/apache/spark/pull/27532/files#diff-57b3d87be744b7d79a9beacf8e5e5eb2R937

### Does this PR introduce any user-facing change?

yes, now it's not allowed to refer to a temp view with `spark_catalog` prefix.

### How was this patch tested?

new tests

Closes #27550 from cloud-fan/ns.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-17 12:07:46 +08:00
DB Tsai d0f9614760 [SPARK-30289][SQL] Partitioned by Nested Column for InMemoryTable
### What changes were proposed in this pull request?
1. `InMemoryTable` was flatting the nested columns, and then the flatten columns was used to look up the indices which is not correct.

This PR implements partitioned by nested column for `InMemoryTable`.

### Why are the changes needed?

This PR implements partitioned by nested column for `InMemoryTable`, so we can test this features in DSv2

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing unit tests and new tests.

Closes #26929 from dbtsai/addTests.

Authored-by: DB Tsai <d_tsai@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2020-02-14 21:46:01 +00:00
Maxim Gekk 7137a6d065 [SPARK-30766][SQL] Fix the timestamp truncation to the HOUR and DAY levels
### What changes were proposed in this pull request?
In the PR, I propose to use Java 8 time API in timestamp truncations to the levels of `HOUR` and `DAY`. The problem is in the usage of `timeZone.getOffset(millis)` in days/hours truncations where the combined calendar (Julian + Gregorian) is used underneath.

### Why are the changes needed?
The change fix wrong truncations. For example, the following truncation to hours should print `0010-01-01 01:00:00` but it outputs wrong timestamp:
```scala
Seq("0010-01-01 01:02:03.123456").toDF()
    .select($"value".cast("timestamp").as("ts"))
    .select(date_trunc("HOUR", $"ts").cast("string"))
    .show(false)
+------------------------------------+
|CAST(date_trunc(HOUR, ts) AS STRING)|
+------------------------------------+
|0010-01-01 01:30:17                 |
+------------------------------------+
```

### Does this PR introduce any user-facing change?
Yes. After the changes, the result of the example above is:
```scala
+------------------------------------+
|CAST(date_trunc(HOUR, ts) AS STRING)|
+------------------------------------+
|0010-01-01 01:00:00                 |
+------------------------------------+
```

### How was this patch tested?
- Added new test to `DateFunctionsSuite`
- By `DateExpressionsSuite` and `DateTimeUtilsSuite`

Closes #27512 from MaxGekk/fix-trunc-old-timestamp.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-14 22:16:57 +08:00
Yuming Wang fb0e07b08c [SPARK-29231][SQL] Constraints should be inferred from cast equality constraint
### What changes were proposed in this pull request?

This PR add support infer constraints from cast equality constraint. For example:
```scala
scala> spark.sql("create table spark_29231_1(c1 bigint, c2 bigint)")
res0: org.apache.spark.sql.DataFrame = []

scala> spark.sql("create table spark_29231_2(c1 int, c2 bigint)")
res1: org.apache.spark.sql.DataFrame = []

scala> spark.sql("select t1.* from spark_29231_1 t1 join spark_29231_2 t2 on (t1.c1 = t2.c1 and t1.c1 = 1)").explain
== Physical Plan ==
*(2) Project [c1#5L, c2#6L]
+- *(2) BroadcastHashJoin [c1#5L], [cast(c1#7 as bigint)], Inner, BuildRight
   :- *(2) Project [c1#5L, c2#6L]
   :  +- *(2) Filter (isnotnull(c1#5L) AND (c1#5L = 1))
   :     +- *(2) ColumnarToRow
   :        +- FileScan parquet default.spark_29231_1[c1#5L,c2#6L] Batched: true, DataFilters: [isnotnull(c1#5L), (c1#5L = 1)], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehouse/spark_29231_1], PartitionFilters: [], PushedFilters: [IsNotNull(c1), EqualTo(c1,1)], ReadSchema: struct<c1:bigint,c2:bigint>
   +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), [id=#209]
      +- *(1) Project [c1#7]
         +- *(1) Filter isnotnull(c1#7)
            +- *(1) ColumnarToRow
               +- FileScan parquet default.spark_29231_2[c1#7] Batched: true, DataFilters: [isnotnull(c1#7)], Format: Parquet, Location: InMemoryFileIndex[file:/root/spark-3.0.0-preview2-bin-hadoop2.7/spark-warehouse/spark_29231_2], PartitionFilters: [], PushedFilters: [IsNotNull(c1)], ReadSchema: struct<c1:int>
```

After this PR:
```scala
scala> spark.sql("select t1.* from spark_29231_1 t1 join spark_29231_2 t2 on (t1.c1 = t2.c1 and t1.c1 = 1)").explain
== Physical Plan ==
*(2) Project [c1#0L, c2#1L]
+- *(2) BroadcastHashJoin [c1#0L], [cast(c1#2 as bigint)], Inner, BuildRight
   :- *(2) Project [c1#0L, c2#1L]
   :  +- *(2) Filter (isnotnull(c1#0L) AND (c1#0L = 1))
   :     +- *(2) ColumnarToRow
   :        +- FileScan parquet default.spark_29231_1[c1#0L,c2#1L] Batched: true, DataFilters: [isnotnull(c1#0L), (c1#0L = 1)], Format: Parquet, Location: InMemoryFileIndex[file:/root/opensource/spark/spark-warehouse/spark_29231_1], PartitionFilters: [], PushedFilters: [IsNotNull(c1), EqualTo(c1,1)], ReadSchema: struct<c1:bigint,c2:bigint>
   +- BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))), [id=#99]
      +- *(1) Project [c1#2]
         +- *(1) Filter ((cast(c1#2 as bigint) = 1) AND isnotnull(c1#2))
            +- *(1) ColumnarToRow
               +- FileScan parquet default.spark_29231_2[c1#2] Batched: true, DataFilters: [(cast(c1#2 as bigint) = 1), isnotnull(c1#2)], Format: Parquet, Location: InMemoryFileIndex[file:/root/opensource/spark/spark-warehouse/spark_29231_2], PartitionFilters: [], PushedFilters: [IsNotNull(c1)], ReadSchema: struct<c1:int>
```

### Why are the changes needed?

Improve query performance.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #27252 from wangyum/SPARK-29231.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-13 22:48:27 +08:00
Terry Kim a6b4b914f2 [SPARK-30613][SQL] Support Hive style REPLACE COLUMNS syntax
### What changes were proposed in this pull request?

This PR proposes to support Hive-style `ALTER TABLE ... REPLACE COLUMNS ...` as described in https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-Add/ReplaceColumns

The user now can do the following:
```SQL
CREATE TABLE t (col1 int, col2 int) USING Foo;
ALTER TABLE t REPLACE COLUMNS (col2 string COMMENT 'comment2', col3 int COMMENT 'comment3');
```
, which drops the existing columns `col1` and `col2`, and add new columns `col2` and `col3`.

### Why are the changes needed?

This is a new DDL statement. Spark currently supports the Hive-style `ALTER TABLE ... CHANGE COLUMN ...`, so this new addition can be useful.

### Does this PR introduce any user-facing change?

Yes, adding a new DDL statement.

### How was this patch tested?

More tests to be added.

Closes #27482 from imback82/replace_cols.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-13 20:13:36 +08:00
maryannxue 453d5261b2 [SPARK-30528][SQL] Turn off DPP subquery duplication by default
### What changes were proposed in this pull request?
This PR adds a config for Dynamic Partition Pruning subquery duplication and turns it off by default due to its potential performance regression.
When planning a DPP filter, it seeks to reuse the broadcast exchange relation if the corresponding join is a BHJ with the filter relation being on the build side, otherwise it will either opt out or plan the filter as an un-reusable subquery duplication based on the cost estimate. However, the cost estimate is not accurate and only takes into account the table scan overhead, thus adding an un-reusable subquery duplication DPP filter can sometimes cause perf regression.
This PR turns off the subquery duplication DPP filter by:
1. adding a config `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcastOnly` and setting it `true` by default.
2. removing the existing meaningless config `spark.sql.optimizer.dynamicPartitionPruning.reuseBroadcast` since we always want to reuse broadcast results if possible.

### Why are the changes needed?
This is to fix a potential performance regression caused by DPP.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Updated DynamicPartitionPruningSuite to test the new configuration.

Closes #27551 from maryannxue/spark-30528.

Authored-by: maryannxue <maryannxue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-13 19:32:38 +08:00
iRakson 926e3a1efe [SPARK-30790] The dataType of map() should be map<null,null>
### What changes were proposed in this pull request?

`spark.sql("select map()")` returns {}.

After these changes it will return map<null,null>

### Why are the changes needed?
After changes introduced due to #27521, it is important to maintain consistency while using map().

### Does this PR introduce any user-facing change?
Yes. Now map() will give map<null,null> instead of {}.

### How was this patch tested?
UT added. Migration guide updated as well

Closes #27542 from iRakson/SPARK-30790.

Authored-by: iRakson <raksonrakesh@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-13 12:23:40 +08:00
Maxim Gekk aa0d13683c [SPARK-30760][SQL] Port millisToDays and daysToMillis on Java 8 time API
### What changes were proposed in this pull request?
In the PR, I propose to rewrite the `millisToDays` and `daysToMillis` of `DateTimeUtils` using Java 8 time API.

I removed `getOffsetFromLocalMillis` from `DateTimeUtils` because it is a private methods, and is not used anymore in Spark SQL.

### Why are the changes needed?
New implementation is based on Proleptic Gregorian calendar which has been already used by other date-time functions. This changes make `millisToDays` and `daysToMillis` consistent to rest Spark SQL API related to date & time operations.

### Does this PR introduce any user-facing change?
Yes, this might effect behavior for old dates before 1582 year.

### How was this patch tested?
By existing test suites `DateTimeUtilsSuite`, `DateFunctionsSuite`, DateExpressionsSuite`, `SQLQuerySuite` and `HiveResultSuite`.

Closes #27494 from MaxGekk/millis-2-days-java8-api.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-13 02:31:48 +08:00
Maxim Gekk 61b1e608f0 [SPARK-30759][SQL][TESTS][FOLLOWUP] Check cache initialization in StringRegexExpression
### What changes were proposed in this pull request?
Added new test to `RegexpExpressionsSuite` which checks that `cache` of compiled pattern is set when the `right` expression (pattern in `LIKE`) is a foldable expression.

### Why are the changes needed?
To be sure that `cache` in `StringRegexExpression` is initialized for foldable patterns.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By running the added test in `RegexpExpressionsSuite`.

Closes #27547 from MaxGekk/regexp-cache-test.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-12 23:50:34 +08:00
Maxim Gekk c1986204e5 [SPARK-30788][SQL] Support SimpleDateFormat and FastDateFormat as legacy date/timestamp formatters
### What changes were proposed in this pull request?
In the PR, I propose to add legacy date/timestamp formatters based on `SimpleDateFormat` and `FastDateFormat`:
- `LegacyFastTimestampFormatter` - uses `FastDateFormat` and supports parsing/formatting in microsecond precision. The code was borrowed from Spark 2.4, see https://github.com/apache/spark/pull/26507 & https://github.com/apache/spark/pull/26582
- `LegacySimpleTimestampFormatter` uses `SimpleDateFormat`, and support the `lenient` mode. When the `lenient` parameter is set to `false`, the parser become much stronger in checking its input.

### Why are the changes needed?
Spark 2.4.x uses the following parsers for parsing/formatting date/timestamp strings:
- `DateTimeFormat` in CSV/JSON datasource
- `SimpleDateFormat` - is used in JDBC datasource, in partitions parsing.
- `SimpleDateFormat` in strong mode (`lenient = false`), see https://github.com/apache/spark/blob/branch-2.4/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/DateTimeUtils.scala#L124. It is used by the `date_format`, `from_unixtime`, `unix_timestamp` and `to_unix_timestamp` functions.

The PR aims to make Spark 3.0 compatible with Spark 2.4.x in all those cases when `spark.sql.legacy.timeParser.enabled` is set to `true`.

### Does this PR introduce any user-facing change?
This shouldn't change behavior with default settings. If `spark.sql.legacy.timeParser.enabled` is set to `true`, users should observe behavior of Spark 2.4.

### How was this patch tested?
- Modified tests in `DateExpressionsSuite` to check the legacy parser - `SimpleDateFormat`.
- Added `CSVLegacyTimeParserSuite` and `JsonLegacyTimeParserSuite` to run `CSVSuite` and `JsonSuite` with the legacy parser - `FastDateFormat`.

Closes #27524 from MaxGekk/timestamp-formatter-legacy-fallback.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-12 20:12:38 +08:00
beliefer f5026b1ba7 [SPARK-30763][SQL] Fix java.lang.IndexOutOfBoundsException No group 1 for regexp_extract
### What changes were proposed in this pull request?
The current implement of `regexp_extract` will throws a unprocessed exception show below:

`SELECT regexp_extract('1a 2b 14m', 'd+')`
```
java.lang.IndexOutOfBoundsException: No group 1
[info] at java.util.regex.Matcher.group(Matcher.java:538)
[info] at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
[info] at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
[info] at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729)
```
I think should treat this exception well.

### Why are the changes needed?
Fix a bug `java.lang.IndexOutOfBoundsException No group 1 `

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
New UT

Closes #27508 from beliefer/fix-regexp_extract-bug.

Authored-by: beliefer <beliefer@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-12 14:49:22 +08:00
Kris Mok b4769998ef [SPARK-30795][SQL] Spark SQL codegen's code() interpolator should treat escapes like Scala's StringContext.s()
### What changes were proposed in this pull request?

This PR proposes to make the `code` string interpolator treat escapes the same way as Scala's builtin `StringContext.s()` string interpolator. This will remove the need for an ugly workaround in `Like` expression's codegen.

### Why are the changes needed?

The `code()` string interpolator in Spark SQL's code generator should treat escapes like Scala's builtin `StringContext.s()` interpolator, i.e. it should treat escapes in the code parts, and should not treat escapes in the input arguments.

For example,
```scala
val arg = "This is an argument."
val str = s"This is string part 1. $arg This is string part 2."
val code = code"This is string part 1. $arg This is string part 2."
assert(code.toString == str)
```
We should expect the `code()` interpolator to produce the same result as the `StringContext.s()` interpolator, where only escapes in the string parts should be treated, while the args should be kept verbatim.

But in the current implementation, due to the eager folding of code parts and literal input args, the escape treatment is incorrectly done on both code parts and literal args.
That causes a problem when an arg contains escape sequences and wants to preserve that in the final produced code string. For example, in `Like` expression's codegen, there's an ugly workaround for this bug:
```scala
      // We need double escape to avoid org.codehaus.commons.compiler.CompileException.
      // '\\' will cause exception 'Single quote must be backslash-escaped in character literal'.
      // '\"' will cause exception 'Line break in literal not allowed'.
      val newEscapeChar = if (escapeChar == '\"' || escapeChar == '\\') {
        s"""\\\\\\$escapeChar"""
      } else {
        escapeChar
      }
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added a new unit test case in `CodeBlockSuite`.

Closes #27544 from rednaxelafx/fix-code-string-interpolator.

Authored-by: Kris Mok <kris.mok@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-12 15:19:16 +09:00
Maxim Gekk 45db48e2d2 Revert "[SPARK-30625][SQL] Support escape as third parameter of the like function
### What changes were proposed in this pull request?

In the PR, I propose to revert the commit 8aebc80e0e.

### Why are the changes needed?
See the concerns https://github.com/apache/spark/pull/27355#issuecomment-584344438

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By existing test suites.

Closes #27531 from MaxGekk/revert-like-3-args.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-11 10:15:34 -08:00
HyukjinKwon 99bd59fe29 [SPARK-29462][SQL][DOCS] Add some more context and details in 'spark.sql.defaultUrlStreamHandlerFactory.enabled' documentation
### What changes were proposed in this pull request?

This PR adds some more information and context to `spark.sql.defaultUrlStreamHandlerFactory.enabled`.

### Why are the changes needed?

It is a bit difficult to understand the documentation of `spark.sql.defaultUrlStreamHandlerFactory.enabled`.

### Does this PR introduce any user-facing change?

Nope, internal doc only fix.

### How was this patch tested?

Nope. I only tested linter.

Closes #27541 from HyukjinKwon/SPARK-29462-followup.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-11 09:55:02 -08:00
Maxim Gekk dc66d57e98 [SPARK-30754][SQL] Reuse results of floorDiv in calculations of floorMod in DateTimeUtils
### What changes were proposed in this pull request?
In the case of back-to-back calculation of `floorDiv` and `floorMod` with the same arguments, the result of `foorDiv` can be reused in calculation of `floorMod`. The `floorMod` method is defined as the following in Java standard library:
```java
    public static int floorMod(int x, int y) {
        int r = x - floorDiv(x, y) * y;
        return r;
    }
```
If `floorDiv(x, y)` has been already calculated, it can be reused in `x - floorDiv(x, y) * y`.

I propose to modify 2 places in `DateTimeUtils`:
1. `microsToInstant` which is widely used in many date-time functions. `Math.floorMod(us, MICROS_PER_SECOND)` is just replaced by its definition from Java Math library.
2. `truncDate`: `Math.floorMod(oldYear, divider) == 0` is replaced by `Math.floorDiv(oldYear, divider) * divider == oldYear` where `floorDiv(...) * divider` is pre-calculated.

### Why are the changes needed?
This reduces the number of arithmetic operations, and can slightly improve performance of date-time functions.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By existing test suites `DateTimeUtilsSuite`, `DateFunctionsSuite` and `DateExpressionsSuite`.

Closes #27491 from MaxGekk/opt-microsToInstant.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-02-11 09:07:40 -06:00
HyukjinKwon 0045be766b [SPARK-29462][SQL] The data type of "array()" should be array<null>
### What changes were proposed in this pull request?

This brings https://github.com/apache/spark/pull/26324 back. It was reverted basically because, firstly Hive compatibility, and the lack of investigations in other DBMSes and ANSI.

- In case of PostgreSQL seems coercing NULL literal to TEXT type.
- Presto seems coercing `array() + array(1)` -> array of int.
- Hive seems  `array() + array(1)` -> array of strings

 Given that, the design choices have been differently made for some reasons. If we pick one of both, seems coercing to array of int makes much more sense.

Another investigation was made offline internally. Seems ANSI SQL 2011, section 6.5 "<contextually typed value specification>" states:

> If ES is specified, then let ET be the element type determined by the context in which ES appears. The declared type DT of ES is Case:
>
> a) If ES simply contains ARRAY, then ET ARRAY[0].
>
> b) If ES simply contains MULTISET, then ET MULTISET.
>
> ES is effectively replaced by CAST ( ES AS DT )

From reading other related context, doing it to `NullType`. Given the investigation made, choosing to `null` seems correct, and we have a reference Presto now. Therefore, this PR proposes to bring it back.

### Why are the changes needed?
When empty array is created, it should be declared as array<null>.

### Does this PR introduce any user-facing change?
Yes, `array()` creates `array<null>`. Now `array(1) + array()` can correctly create `array(1)` instead of `array("1")`.

### How was this patch tested?
Tested manually

Closes #27521 from HyukjinKwon/SPARK-29462.

Lead-authored-by: HyukjinKwon <gurwls223@apache.org>
Co-authored-by: Aman Omer <amanomer1996@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-11 17:22:08 +09:00
Shixiong Zhu e2ebca733c [SPARK-30779][SS] Fix some API issues found when reviewing Structured Streaming API docs
### What changes were proposed in this pull request?

- Fix the scope of `Logging.initializeForcefully` so that it doesn't appear in subclasses' public methods. Right now, `sc.initializeForcefully(false, false)` is allowed to called.
- Don't show classes under `org.apache.spark.internal` package in API docs.
- Add missing `since` annotation.
- Fix the scope of `ArrowUtils` to remove it from the API docs.

### Why are the changes needed?

Avoid leaking APIs unintentionally in Spark 3.0.0.

### Does this PR introduce any user-facing change?

No. All these changes are to avoid leaking APIs unintentionally in Spark 3.0.0.

### How was this patch tested?

Manually generated the API docs and verified the above issues have been fixed.

Closes #27528 from zsxwing/audit-ss-apis.

Authored-by: Shixiong Zhu <zsxwing@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2020-02-10 14:26:14 -08:00
Maxim Gekk 3c1c9b48fc [SPARK-30759][SQL] Initialize cache for foldable patterns in StringRegexExpression
### What changes were proposed in this pull request?
In the PR, I propose to fix `cache` initialization in `StringRegexExpression` by changing `case Literal(value: String, StringType)` to `case p: Expression if p.foldable`

### Why are the changes needed?
Actually, the case doesn't work at all because of:
1. Literals value has type `UTF8String`
2. It doesn't work for foldable expressions like in the example:
```sql
SELECT '%SystemDrive%\Users\John' _FUNC_ '%SystemDrive%\\Users.*';
```
<img width="649" alt="Screen Shot 2020-02-08 at 22 45 50" src="https://user-images.githubusercontent.com/1580697/74091681-0d4a2180-4acb-11ea-8a0d-7e8c65f4214e.png">

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By the `check outputs of expression examples` test from `SQLQuerySuite`.

Closes #27502 from MaxGekk/str-regexp-foldable-pattern.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-10 12:51:37 -08:00
HyukjinKwon 4439b29bd2 Revert "[SPARK-30245][SQL] Add cache for Like and RLike when pattern is not static"
### What changes were proposed in this pull request?

This reverts commit 8ce7962931. There's variable name conflicts with 8aebc80e0e (diff-39298b470865a4cbc67398a4ea11e767).

This can be cleanly ported back to branch-3.0.

### Why are the changes needed?
Performance investigation were not made enough and it's not clear if it really beneficial or now.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Jenkins tests.

Closes #27514 from HyukjinKwon/revert-cache-PR.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2020-02-10 10:56:43 -08:00
Eric Wu b2011a295b [SPARK-30326][SQL] Raise exception if analyzer exceed max iterations
### What changes were proposed in this pull request?
Enhance RuleExecutor strategy to take different actions when exceeding max iterations. And raise exception if analyzer exceed max iterations.

### Why are the changes needed?
Currently, both analyzer and optimizer just log warning message if rule execution exceed max iterations. They should have different behavior. Analyzer should raise exception to indicates the plan is not fixed after max iterations, while optimizer just log warning to keep the current plan. This is more feasible after SPARK-30138 was introduced.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Add test in AnalysisSuite

Closes #26977 from Eric5553/EnhanceMaxIterations.

Authored-by: Eric Wu <492960551@qq.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-10 23:41:39 +08:00
Terry Kim 70e545a94d [SPARK-30757][SQL][DOC] Update the doc on TableCatalog.alterTable's behavior
### What changes were proposed in this pull request?

This PR updates the documentation on `TableCatalog.alterTable`s behavior on the order by which the requested changes are applied. It now explicitly mentions that the changes are applied in the order given.

### Why are the changes needed?

The current documentation on `TableCatalog.alterTable` doesn't mention which order the requested changes are applied. It will be useful to explicitly document this behavior so that the user can expect the behavior. For example, `REPLACE COLUMNS` needs to delete columns before adding new columns, and if the order is guaranteed by `alterTable`, it's much easier to work with the catalog API.

### Does this PR introduce any user-facing change?

Yes, document change.

### How was this patch tested?

Not added (doc changes).

Closes #27496 from imback82/catalog_table_alter_table.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-10 19:04:49 +08:00
Liang-Chi Hsieh 9f8172e96a Revert "[SPARK-29721][SQL] Prune unnecessary nested fields from Generate without Project
This reverts commit a0e63b61e7.

### What changes were proposed in this pull request?

This reverts the patch at #26978 based on gatorsmile's suggestion.

### Why are the changes needed?

Original patch #26978 has not considered a corner case. We may need to put more time on ensuring we can cover all cases.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Unit test.

Closes #27504 from viirya/revert-SPARK-29721.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2020-02-09 19:45:16 -08:00
Nicholas Chammas 339c0f9a62 [SPARK-30510][SQL][DOCS] Publicly document Spark SQL configuration options
### What changes were proposed in this pull request?

This PR adds a doc builder for Spark SQL's configuration options.

Here's what the new Spark SQL config docs look like ([configuration.html.zip](https://github.com/apache/spark/files/4172109/configuration.html.zip)):

![Screen Shot 2020-02-07 at 12 13 23 PM](https://user-images.githubusercontent.com/1039369/74050007-425b5480-49a3-11ea-818c-42700c54d1fb.png)

Compare this to the [current docs](http://spark.apache.org/docs/3.0.0-preview2/configuration.html#spark-sql):

![Screen Shot 2020-02-04 at 4 55 10 PM](https://user-images.githubusercontent.com/1039369/73790828-24a5a980-476f-11ea-998c-12cd613883e8.png)

### Why are the changes needed?

There is no visibility into the various Spark SQL configs on [the config docs page](http://spark.apache.org/docs/3.0.0-preview2/configuration.html#spark-sql).

### Does this PR introduce any user-facing change?

No, apart from new documentation.

### How was this patch tested?

I tested this manually by building the docs and reviewing them in my browser.

Closes #27459 from nchammas/SPARK-30510-spark-sql-options.

Authored-by: Nicholas Chammas <nicholas.chammas@liveramp.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-09 19:20:47 +09:00
Yuanjian Li 3db3e39f11 [SPARK-28228][SQL] Change the default behavior for name conflict in nested WITH clause
### What changes were proposed in this pull request?
This is a follow-up for #25029, in this PR we throw an AnalysisException when name conflict is detected in nested WITH clause. In this way, the config `spark.sql.legacy.ctePrecedence.enabled` should be set explicitly for the expected behavior.

### Why are the changes needed?
The original change might risky to end-users, it changes behavior silently.

### Does this PR introduce any user-facing change?
Yes, change the config `spark.sql.legacy.ctePrecedence.enabled` as optional.

### How was this patch tested?
New UT.

Closes #27454 from xuanyuanking/SPARK-28228-follow.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-08 14:10:28 -08:00
Terry Kim a7451f44d2 [SPARK-30614][SQL] The native ALTER COLUMN syntax should change one property at a time
### What changes were proposed in this pull request?

The current ALTER COLUMN syntax allows to change multiple properties at a time:
```
ALTER TABLE table=multipartIdentifier
  (ALTER | CHANGE) COLUMN? column=multipartIdentifier
  (TYPE dataType)?
  (COMMENT comment=STRING)?
  colPosition?
```
The SQL standard (section 11.12) only allows changing one property at a time. This is also true on other recent SQL systems like [snowflake](https://docs.snowflake.net/manuals/sql-reference/sql/alter-table-column.html) and [redshift](https://docs.aws.amazon.com/redshift/latest/dg/r_ALTER_TABLE.html). (credit to cloud-fan)

This PR proposes to change ALTER COLUMN to follow SQL standard, thus allows altering only one column property at a time.

Note that ALTER COLUMN syntax being changed here is newly added in Spark 3.0, so it doesn't affect Spark 2.4 behavior.

### Why are the changes needed?

To follow SQL standard (and other recent SQL systems) behavior.

### Does this PR introduce any user-facing change?

Yes, now the user can update the column properties only one at a time.

For example,
```
ALTER TABLE table1 ALTER COLUMN a.b.c TYPE bigint COMMENT 'new comment'
```
should be broken into
```
ALTER TABLE table1 ALTER COLUMN a.b.c TYPE bigint
ALTER TABLE table1 ALTER COLUMN a.b.c COMMENT 'new comment'
```

### How was this patch tested?

Updated existing tests.

Closes #27444 from imback82/alter_column_one_at_a_time.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-08 02:47:44 +08:00
Maxim Gekk a3e77773cf [SPARK-30752][SQL] Fix to_utc_timestamp on daylight saving day
### What changes were proposed in this pull request?
- Rewrite the `convertTz` method of `DateTimeUtils` using Java 8 time API
- Change types of `convertTz` parameters from `TimeZone` to `ZoneId`. This allows to avoid unnecessary conversions `TimeZone` -> `ZoneId` and performance regressions as a consequence.

### Why are the changes needed?
- Fixes incorrect behavior of `to_utc_timestamp` on daylight saving day. For example:
```scala
scala> df.select(to_utc_timestamp(lit("2019-11-03T12:00:00"), "Asia/Hong_Kong").as("local UTC")).show
+-------------------+
|          local UTC|
+-------------------+
|2019-11-03 03:00:00|
+-------------------+
```
but the result must be 2019-11-03 04:00:00:
<img width="1013" alt="Screen Shot 2020-02-06 at 20 09 36" src="https://user-images.githubusercontent.com/1580697/73960846-a129bb00-491c-11ea-92f5-45831cb28a62.png">

- Simplifies the code, and make it more maintainable
- Switches `convertTz` on Proleptic Gregorian calendar used by Java 8 time classes by default. That makes the function consistent to other date-time functions.

### Does this PR introduce any user-facing change?
Yes, after the changes `to_utc_timestamp` returns the correct result `2019-11-03 04:00:00`.

### How was this patch tested?
- By existing test suite `DateTimeUtilsSuite`, `DateFunctionsSuite` and `DateExpressionsSuite`.
- Added `convert time zones on a daylight saving day` to DateFunctionsSuite

Closes #27474 from MaxGekk/port-convertTz-on-Java8-api.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-08 02:32:07 +08:00
Wenchen Fan 5a4c70b4e2 [SPARK-27986][SQL][FOLLOWUP] window aggregate function with filter predicate is not supported
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/26656.

We don't support window aggregate function with filter predicate yet and we should fail explicitly.

Observable metrics has the same issue. This PR fixes it as well.

### Why are the changes needed?

If we simply ignore filter predicate when we don't support it, the result is wrong.

### Does this PR introduce any user-facing change?

yea, fix the query result.

### How was this patch tested?

new tests

Closes #27476 from cloud-fan/filter.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-06 13:33:39 -08:00
Wenchen Fan 8ce58627eb [SPARK-30719][SQL] do not log warning if AQE is intentionally skipped and add a config to force apply
### What changes were proposed in this pull request?

Update `InsertAdaptiveSparkPlan` to not log warning if AQE is skipped intentionally.

This PR also add a config to not skip AQE.

### Why are the changes needed?

It's not a warning at all if we intentionally skip AQE.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

run `AdaptiveQueryExecSuite` locally and verify that there is no warning logs.

Closes #27452 from cloud-fan/aqe.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2020-02-06 09:16:14 -08:00
Terry Kim c27a616450 [SPARK-30612][SQL] Resolve qualified column name with v2 tables
### What changes were proposed in this pull request?

This PR fixes the issue where queries with qualified columns like `SELECT t.a FROM t` would fail to resolve for v2 tables.

This PR would allow qualified column names in query as following:
```SQL
SELECT testcat.ns1.ns2.tbl.foo FROM testcat.ns1.ns2.tbl
SELECT ns1.ns2.tbl.foo FROM testcat.ns1.ns2.tbl
SELECT ns2.tbl.foo FROM testcat.ns1.ns2.tbl
SELECT tbl.foo FROM testcat.ns1.ns2.tbl
```

### Why are the changes needed?

This is a bug because you cannot qualify column names in queries.

### Does this PR introduce any user-facing change?

Yes, now users can qualify column names for v2 tables.

### How was this patch tested?

Added new tests.

Closes #27391 from imback82/qualified_col.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-06 13:54:17 +08:00
Yuanjian Li 4938905a1c [SPARK-29864][SQL][FOLLOWUP] Reference the config for the old behavior in error message
### What changes were proposed in this pull request?
Follow up work for SPARK-29864, reference the config  `spark.sql.legacy.fromDayTimeString.enabled` in error message.

### Why are the changes needed?
For better usability.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing tests.

Closes #27464 from xuanyuanking/SPARK-29864-follow.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-05 11:19:42 -08:00
turbofei 6d507b4a31 [SPARK-26218][SQL][FOLLOW UP] Fix the corner case when casting float to Integer
### What changes were proposed in this pull request?
When spark.sql.ansi.enabled is true, for the statement:
```
select cast(cast(2147483648 as Float) as Integer) //result is 2147483647
```
Its result is 2147483647 and does not throw `ArithmeticException`.

The root cause is that, the below code does not work for some corner cases.
94fc0e3235/sql/catalyst/src/main/scala/org/apache/spark/sql/types/numerics.scala (L129-L141)

For example:

![image](https://user-images.githubusercontent.com/6757692/72074911-badfde80-332d-11ea-963e-2db0e43c33e8.png)

In this PR, I fix it by comparing Math.floor(x) with Int.MaxValue directly.

### Why are the changes needed?
Result corrupt.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?

Added Unit test.

Closes #27151 from turboFei/SPARK-26218-follow-up-int-overflow.

Authored-by: turbofei <fwang12@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-05 21:24:02 +08:00
Maxim Gekk 459e757ed4 [SPARK-30668][SQL] Support SimpleDateFormat patterns in parsing timestamps/dates strings
### What changes were proposed in this pull request?
In the PR, I propose to partially revert the commit 51a6ba0181, and provide a legacy parser based on `FastDateFormat` which is compatible to `SimpleDateFormat`.

To enable the legacy parser, set `spark.sql.legacy.timeParser.enabled` to `true`.

### Why are the changes needed?
To allow users to restore old behavior in parsing timestamps/dates using `SimpleDateFormat` patterns. The main reason for restoring is `DateTimeFormatter`'s patterns are not fully compatible to `SimpleDateFormat` patterns, see https://issues.apache.org/jira/browse/SPARK-30668

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
- Added new test to `DateFunctionsSuite`
- Restored additional test cases in `JsonInferSchemaSuite`.

Closes #27441 from MaxGekk/support-simpledateformat.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-05 18:48:45 +08:00
HyukjinKwon 692e3ddb4e [SPARK-27870][PYTHON][FOLLOW-UP] Rename spark.sql.pandas.udf.buffer.size to spark.sql.execution.pandas.udf.buffer.size
### What changes were proposed in this pull request?

This PR renames `spark.sql.pandas.udf.buffer.size` to `spark.sql.execution.pandas.udf.buffer.size` to be more consistent with other pandas configuration prefixes, given:
-  `spark.sql.execution.pandas.arrowSafeTypeConversion`
- `spark.sql.execution.pandas.respectSessionTimeZone`
- `spark.sql.legacy.execution.pandas.groupedMap.assignColumnsByName`
- other configurations like `spark.sql.execution.arrow.*`.

### Why are the changes needed?

To make configuration names consistent.

### Does this PR introduce any user-facing change?

No because this configuration was not released yet.

### How was this patch tested?

Existing tests should cover.

Closes #27450 from HyukjinKwon/SPARK-27870-followup.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-02-05 11:38:33 +09:00
Dongjoon Hyun 898716980d Revert "[SPARK-28310][SQL] Support (FIRST_VALUE|LAST_VALUE)(expr[ (IGNORE|RESPECT) NULLS]?) syntax"
### What changes were proposed in this pull request?

This reverts commit b89c3de1a4.

### Why are the changes needed?

`FIRST_VALUE` is used only for window expression. Please see the discussion on https://github.com/apache/spark/pull/25082 .

### Does this PR introduce any user-facing change?

Yes.

### How was this patch tested?

Pass the Jenkins.

Closes #27458 from dongjoon-hyun/SPARK-28310.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-04 17:26:46 -08:00
Liang-Chi Hsieh 7631275f97 [SPARK-25040][SQL][FOLLOWUP] Add legacy config for allowing empty strings for certain types in json parser
### What changes were proposed in this pull request?

This is a follow-up for #22787. In #22787 we disallowed empty strings for json parser except for string and binary types. This follow-up adds a legacy config for restoring previous behavior of allowing empty string.

### Why are the changes needed?

Adding a legacy config to make migration easy for Spark users.

### Does this PR introduce any user-facing change?

Yes. If set this legacy config to true, the users can restore previous behavior prior to Spark 3.0.0.

### How was this patch tested?

Unit test.

Closes #27456 from viirya/SPARK-25040-followup.

Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-02-04 17:22:23 -08:00
Maxim Gekk f2dd082544 [SPARK-30725][SQL] Make legacy SQL configs as internal configs
### What changes were proposed in this pull request?
All legacy SQL configs are marked as internal configs. In particular, the following configs are updated as internals:
- spark.sql.legacy.sizeOfNull
- spark.sql.legacy.replaceDatabricksSparkAvro.enabled
- spark.sql.legacy.typeCoercion.datetimeToString.enabled
- spark.sql.legacy.looseUpcast
- spark.sql.legacy.arrayExistsFollowsThreeValuedLogic

### Why are the changes needed?
In general case, users shouldn't change legacy configs, so, they can be marked as internals.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Should be tested by jenkins build and run tests.

Closes #27448 from MaxGekk/legacy-internal-sql-conf.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-04 21:17:05 +08:00
Yuanjian Li a4912cee61
[SPARK-29543][SS][FOLLOWUP] Move spark.sql.streaming.ui.* configs to StaticSQLConf
### What changes were proposed in this pull request?
Put the configs below needed by Structured Streaming UI into StaticSQLConf:

- spark.sql.streaming.ui.enabled
- spark.sql.streaming.ui.retainedProgressUpdates
- spark.sql.streaming.ui.retainedQueries

### Why are the changes needed?
Make all SS UI configs consistent with other similar configs in usage and naming.

### Does this PR introduce any user-facing change?
Yes, add new static config `spark.sql.streaming.ui.retainedProgressUpdates`.

### How was this patch tested?
Existing UT.

Closes #27425 from xuanyuanking/SPARK-29543-follow.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
2020-02-02 23:37:13 -08:00
Burak Yavuz 2eccfd8a73 [SPARK-30697][SQL] Handle database and namespace exceptions in catalog.isView
### What changes were proposed in this pull request?

Adds NoSuchDatabaseException and NoSuchNamespaceException to the `isView` method for SessionCatalog.

### Why are the changes needed?

This method prevents specialized resolutions from kicking in within Analysis when using V2 Catalogs if the identifier is a specialized identifier.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Added test to DataSourceV2SessionCatalogSuite

Closes #27423 from brkyvz/isViewF.

Authored-by: Burak Yavuz <brkyvz@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-03 14:08:59 +08:00
Liang-Chi Hsieh 8eecc20b11 [SPARK-27946][SQL] Hive DDL to Spark DDL conversion USING "show create table"
## What changes were proposed in this pull request?

This patch adds a DDL command `SHOW CREATE TABLE AS SERDE`. It is used to generate Hive DDL for a Hive table.

For original `SHOW CREATE TABLE`, it now shows Spark DDL always. If given a Hive table, it tries to generate Spark DDL.

For Hive serde to data source conversion, this uses the existing mapping inside `HiveSerDe`. If can't find a mapping there, throws an analysis exception on unsupported serde configuration.

It is arguably that some Hive fileformat + row serde might be mapped to Spark data source, e.g., CSV. It is not included in this PR. To be conservative, it may not be supported.

For Hive serde properties, for now this doesn't save it to Spark DDL because it may not useful to keep Hive serde properties in Spark table.

## How was this patch tested?

Added test.

Closes #24938 from viirya/SPARK-27946.

Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2020-01-31 19:55:25 -08:00
yi.wu 82b4f753a0 [SPARK-30508][SQL] Add SparkSession.executeCommand API for external datasource
### What changes were proposed in this pull request?

This PR adds `SparkSession.executeCommand` API for external datasource to execute a random command like

```
val df = spark.executeCommand("xxxCommand", "xxxSource", "xxxOptions")
```
Note that the command doesn't execute in Spark, but inside an external execution engine depending on data source. And it will be eagerly executed after `executeCommand` called and the returned `DataFrame` will contain the output of the command(if any).

### Why are the changes needed?

This can be useful when user wants to execute some commands out of Spark. For example, executing custom DDL/DML command for JDBC, creating index for ElasticSearch, creating cores for Solr and so on(as HyukjinKwon suggested).

Previously, user needs to use an option to achieve the goal, e.g. `spark.read.format("xxxSource").option("command", "xxxCommand").load()`, which is kind of cumbersome. With this change, it can be more convenient for user to achieve the same goal.

### Does this PR introduce any user-facing change?

Yes, new API from `SparkSession` and a new interface `ExternalCommandRunnableProvider`.

### How was this patch tested?

Added a new test suite.

Closes #27199 from Ngone51/dev-executeCommand.

Lead-authored-by: yi.wu <yi.wu@databricks.com>
Co-authored-by: Xiao Li <gatorsmile@gmail.com>
Co-authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2020-01-31 15:05:26 -08:00
Maxim Gekk 2d4b5eaee4 [SPARK-30676][CORE][TESTS] Eliminate warnings from deprecated constructors of java.lang.Integer and java.lang.Double
### What changes were proposed in this pull request?
- Replace `new Integer(0)` by a serializable instance in RDD.scala
- Use `.valueOf()` instead of constructors of `java.lang.Integer` and `java.lang.Double` because constructors has been deprecated, see https://docs.oracle.com/javase/9/docs/api/java/lang/Integer.html

### Why are the changes needed?
This fixes the following warnings:
1. RDD.scala:240: constructor Integer in class Integer is deprecated: see corresponding Javadoc for more information.
2. MutableProjectionSuite.scala:63: constructor Integer in class Integer is deprecated: see corresponding Javadoc for more information.
3. UDFSuite.scala:446: constructor Integer in class Integer is deprecated: see corresponding Javadoc for more information.
4. UDFSuite.scala:451: constructor Double in class Double is deprecated: see corresponding Javadoc for more information.
5. HiveUserDefinedTypeSuite.scala:71: constructor Double in class Double is deprecated: see corresponding Javadoc for more information.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
- By RDDSuite, MutableProjectionSuite, UDFSuite and HiveUserDefinedTypeSuite

Closes #27399 from MaxGekk/eliminate-warning-part4.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-01-31 15:03:16 -06:00
yi.wu 5ccbb38a71 [SPARK-29938][SQL][FOLLOW-UP] Improve AlterTableAddPartitionCommand
All credit to Ngone51, Closes #27293.
### What changes were proposed in this pull request?
This PR improves `AlterTableAddPartitionCommand` by:
1. adds an internal config for partitions batch size to avoid hard code
2. reuse `InMemoryFileIndex.bulkListLeafFiles` to perform parallel file listing to improve code reuse

### Why are the changes needed?
Improve code quality.

### Does this PR introduce any user-facing change?
Yes. We renamed `spark.sql.statistics.parallelFileListingInStatsComputation.enabled` to `spark.sql.parallelFileListingInCommands.enabled` as a side effect of this change.

### How was this patch tested?
Pass Jenkins.

Closes #27413 from xuanyuanking/SPARK-29938.

Lead-authored-by: yi.wu <yi.wu@databricks.com>
Co-authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-02-01 01:03:00 +08:00
Burak Yavuz 290a528bff [SPARK-30615][SQL] Introduce Analyzer rule for V2 AlterTable column change resolution
### What changes were proposed in this pull request?

Adds an Analyzer rule to normalize the column names used in V2 AlterTable table changes. We need to handle all ColumnChange operations. We add an extra match statement for future proofing new changes that may be added. This prevents downstream consumers (e.g. catalogs) to deal about case sensitivity or check that columns exist, etc.

We also fix the behavior for ALTER TABLE CHANGE COLUMN (Hive style syntax) for adding comments to complex data types. Currently, the data type needs to be provided as part of the Hive style syntax. This assumes that the data type as changed when it may have not and the user only wants to add a comment, which fails in CheckAnalysis.

### Why are the changes needed?

Currently we do not handle case sensitivity correctly for ALTER TABLE ALTER COLUMN operations.

### Does this PR introduce any user-facing change?

No, fixes a bug.

### How was this patch tested?

Introduced v2CommandsCaseSensitivitySuite and added a test around HiveStyle Change columns to PlanResolutionSuite

Closes #27350 from brkyvz/normalizeAlter.

Authored-by: Burak Yavuz <brkyvz@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-01-31 16:41:10 +08:00
Dongjoon Hyun 05be81d69e [SPARK-30192][SQL][FOLLOWUP] Rename SINGLETON to INSTANCE
### What changes were proposed in this pull request?

This PR renames a variable `SINGLETON` to `INSTANCE`.

### Why are the changes needed?

This is a minor change for consistency with the other parts.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the existing tests.

Closes #27409 from dongjoon-hyun/SPARK-30192.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-30 22:51:51 -08:00
Burak Yavuz 1cd19ad92d [SPARK-30669][SS] Introduce AdmissionControl APIs for StructuredStreaming
### What changes were proposed in this pull request?

We propose to add a new interface `SupportsAdmissionControl` and `ReadLimit`. A ReadLimit defines how much data should be read in the next micro-batch. `SupportsAdmissionControl` specifies that a source can rate limit its ingest into the system. The source can tell the system what the user specified as a read limit, and the system can enforce this limit within each micro-batch or impose its own limit if the Trigger is Trigger.Once() for example.

We then use this interface in FileStreamSource, KafkaSource, and KafkaMicroBatchStream.

### Why are the changes needed?

Sources currently have no information around execution semantics such as whether the stream is being executed in Trigger.Once() mode. This interface will pass this information into the sources as part of planning. With a trigger like Trigger.Once(), the semantics are to process all the data available to the datasource in a single micro-batch. However, this semantic can be broken when data source options such as `maxOffsetsPerTrigger` (in the Kafka source) rate limit the amount of data read for that micro-batch without this interface.

### Does this PR introduce any user-facing change?

DataSource developers can extend this interface for their streaming sources to add admission control into their system and correctly support Trigger.Once().

### How was this patch tested?

Existing tests, as this API is mostly internal

Closes #27380 from brkyvz/rateLimit.

Lead-authored-by: Burak Yavuz <brkyvz@gmail.com>
Co-authored-by: Burak Yavuz <burak@databricks.com>
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
2020-01-30 22:02:48 -08:00
Wenchen Fan 9f42be25eb [SPARK-29665][SQL] refine the TableProvider interface
### What changes were proposed in this pull request?

Instead of having several overloads of `getTable` method in `TableProvider`, it's better to have 2 methods explicitly: `inferSchema` and `inferPartitioning`. With a single `getTable` method that takes everything: schema, partitioning and properties.

This PR also adds a `supportsExternalMetadata` method in `TableProvider`, to indicate if the source support external table metadata. If this flag is false:
1. spark.read.schema... is disallowed and fails
2. when we support creating v2 tables in session catalog,  spark only keeps table properties in the catalog.

### Why are the changes needed?

API improvement.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

existing tests

Closes #26868 from cloud-fan/provider2.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-01-31 13:37:43 +08:00
Wenchen Fan e5f572af06 [SPARK-30680][SQL] ResolvedNamespace does not require a namespace catalog
### What changes were proposed in this pull request?

Update `ResolvedNamespace` to accept catalog as `CatalogPlugin` not `SupportsNamespaces`.

This is extracted from https://github.com/apache/spark/pull/27345

### Why are the changes needed?

not all commands that need to resolve namespaces require a namespace catalog. For example, `SHOW TABLE` is implemented by `TableCatalog.listTables`, and is nothing to do with `SupportsNamespace`.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

existing tests

Closes #27403 from cloud-fan/ns.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-30 10:34:59 -08:00
Wenchen Fan 7503e76af0 [SPARK-30622][SQL] commands should return dummy statistics
### What changes were proposed in this pull request?

override `Command.stats` to return a dummy statistics (Long.Max).

### Why are the changes needed?

Commands are eagerly executed. They will be converted to LocalRelation after the DataFrame is created. That said, the statistics of a command is useless. We should avoid unnecessary statistics calculation of command's children.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

new test

Closes #27344 from cloud-fan/command.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-30 10:27:35 -08:00
Kazuaki Ishizaki b0db6231fd [SPARK-29020][FOLLOWUP][SQL] Update description of array_sort function
### What changes were proposed in this pull request?

This PR is a follow-up of #25728. #25728 introduces additional arguments to determine sort order. Thus, this function does not sort only in ascending order. However, the description was not updated.
This PR updates the description to follow the latest feature.

### Why are the changes needed?

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Existing tests since this PR just updates description text.

Closes #27404 from kiszk/SPARK-29020-followup.

Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-30 09:41:32 -08:00
uncleGen 7173786153
[SPARK-29543][SS][UI] Structured Streaming Web UI
### What changes were proposed in this pull request?

This PR adds two pages to Web UI for Structured Streaming:
   - "/streamingquery": Streaming Query Page, providing some aggregate information for running/completed streaming queries.
  - "/streamingquery/statistics": Streaming Query Statistics Page, providing detailed information for streaming query, including `Input Rate`, `Process Rate`, `Input Rows`, `Batch Duration` and `Operation Duration`

![Screen Shot 2020-01-29 at 1 38 00 PM](https://user-images.githubusercontent.com/1000778/73399837-cd01cc80-429c-11ea-9d4b-1d200a41b8d5.png)
![Screen Shot 2020-01-29 at 1 39 16 PM](https://user-images.githubusercontent.com/1000778/73399838-cd01cc80-429c-11ea-8185-4e56db6866bd.png)

### Why are the changes needed?

It helps users to better monitor Structured Streaming query.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

- new added and existing UTs
- manual test

Closes #26201 from uncleGen/SPARK-29543.

Lead-authored-by: uncleGen <hustyugm@gmail.com>
Co-authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Co-authored-by: Genmao Yu <hustyugm@gmail.com>
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
2020-01-29 13:43:51 -08:00
Takeshi Yamamuro ec1fb6b4e1 [SPARK-30234][SQL][FOLLOWUP] Add .enabled in the suffix of the ADD FILE legacy option
### What changes were proposed in this pull request?

This pr intends to rename `spark.sql.legacy.addDirectory.recursive` into `spark.sql.legacy.addDirectory.recursive.enabled`.

### Why are the changes needed?

For consistent option names.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

N/A

Closes #27372 from maropu/SPARK-30234-FOLLOWUP.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-01-29 12:23:59 +09:00
Dongjoon Hyun 580c2b7e34 [SPARK-27166][SQL][FOLLOWUP] Refactor to build string once
### What changes were proposed in this pull request?

This is a follow-up for https://github.com/apache/spark/pull/24098 to refactor to build string once according to the [review comment](https://github.com/apache/spark/pull/24098#discussion_r369845234)

### Why are the changes needed?

Previously, we chose the minimal change way.
In this PR, we choose a more robust way than the previous post-step string processing.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

The test case is extended with more cases.

Closes #27353 from dongjoon-hyun/SPARK-27166-2.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-28 12:48:16 -08:00
Maxim Gekk 8aebc80e0e [SPARK-30625][SQL] Support escape as third parameter of the like function
### What changes were proposed in this pull request?
In the PR, I propose to transform the `Like` expression to `TernaryExpression`, and add third parameter `escape`. So, the `like` function will have feature parity with `LIKE ... ESCAPE` syntax supported by 187f3c1773.

### Why are the changes needed?
The `like` functions can be called with 2 or 3 parameters, and functionally equivalent to `LIKE` and `LIKE ... ESCAPE` SQL expressions.

### Does this PR introduce any user-facing change?
Yes, before `like` fails with the exception:
```sql
spark-sql> SELECT like('_Apache Spark_', '__%Spark__', '_');
Error in query: Invalid number of arguments for function like. Expected: 2; Found: 3; line 1 pos 7
```
After:
```sql
spark-sql> SELECT like('_Apache Spark_', '__%Spark__', '_');
true
```

### How was this patch tested?
- Add new example for the `like` function which is checked by `SQLQuerySuite`
- Run `RegexpExpressionsSuite` and `ExpressionParserSuite`.

Closes #27355 from MaxGekk/like-3-args.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-27 11:19:32 -08:00
Patrick Cording c5c580ba0d [SPARK-30633][SQL] Append L to seed when type is LongType
### What changes were proposed in this pull request?

Allow for using longs as seed for xxHash.

### Why are the changes needed?

Codegen fails when passing a seed to xxHash that is > 2^31.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests pass. Should more be added?

Closes #27354 from patrickcording/fix_xxhash_seed_bug.

Authored-by: Patrick Cording <patrick.cording@datarobot.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-27 10:32:15 -08:00
Jungtaek Lim (HeartSaVioR) 0436b3d3f8 [SPARK-30653][INFRA][SQL] EOL character enforcement for java/scala/xml/py/R files
### What changes were proposed in this pull request?

This patch converts CR/LF into LF in 3 source files, which most files are only using LF. This patch also add rules to enforce EOL as LF for all java, scala, xml, py, R files.

### Why are the changes needed?

The majority of source code files are using LF and only three files are CR/LF. While using IDE would let us don't bother with the difference, it still has a chance to make unnecessary diff if the file is modified with the editor which doesn't handle it automatically.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

```
grep -IUrl --color "^M" . | grep "\.java\|\.scala\|\.xml\|\.py\|\.R" | grep -v "/target/" | grep -v "/build/" | grep -v "/dist/" | grep -v "dependency-reduced-pom.xml" | grep -v ".pyc"
```

(Please note you'll need to type CTRL+V -> CTRL+M in bash shell to get `^M` because it's representing CR/LF, not a combination of `^` and `M`.)

Before the patch, the result is:

```
./sql/core/src/main/java/org/apache/spark/sql/execution/columnar/ColumnDictionary.java
./sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/optimizer/complexTypesSuite.scala
./sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/ComplexTypes.scala
```

and after the patch, the result is None.

And git shows WARNING message if EOL of any of source files in given types are modified to CR/LF, like below:

```
warning: CRLF will be replaced by LF in sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala.
The file will have its original line endings in your working directory.
```

Closes #27365 from HeartSaVioR/MINOR-remove-CRLF-in-source-codes.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-27 10:20:51 -08:00
Yuchen Huo d0800fc8e2 [SPARK-30314] Add identifier and catalog information to DataSourceV2Relation
### What changes were proposed in this pull request?

Add identifier and catalog information in DataSourceV2Relation so it would be possible to do richer checks in checkAnalysis step.

### Why are the changes needed?

In data source v2, table implementations are all customized so we may not be able to get the resolved identifier from tables them selves. Therefore we encode the table and catalog information in DSV2Relation so no external changes are needed to make sure this information is available.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Unit tests in the following suites:
CatalogManagerSuite.scala
CatalogV2UtilSuite.scala
SupportsCatalogOptionsSuite.scala
PlanResolutionSuite.scala

Closes #26957 from yuchenhuo/SPARK-30314.

Authored-by: Yuchen Huo <yuchen.huo@databricks.com>
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
2020-01-26 12:59:24 -08:00
Xiao Li d69ed9afdf Revert "[SPARK-25496][SQL] Deprecate from_utc_timestamp and to_utc_timestamp"
This reverts commit 1d20d13149.

Closes #27351 from gatorsmile/revertSPARK25496.

Authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-25 21:34:12 -08:00
Liang-Chi Hsieh a0e63b61e7 [SPARK-29721][SQL] Prune unnecessary nested fields from Generate without Project
### What changes were proposed in this pull request?

This patch proposes to prune unnecessary nested fields from Generate which has no Project on top of it.

### Why are the changes needed?

In Optimizer, we can prune nested columns from Project(projectList, Generate). However, unnecessary columns could still possibly be read in Generate, if no Project on top of it. We should prune it too.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Unit test.

Closes #26978 from viirya/SPARK-29721.

Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-24 22:17:28 -08:00
Gengliang Wang ed44926117 [SPARK-30627][SQL] Disable all the V2 file sources by default
### What changes were proposed in this pull request?

Disable all the V2 file sources in Spark 3.0 by default.

### Why are the changes needed?

There are still some missing parts in the file source V2 framework:
1. It doesn't support reporting file scan metrics such as "numOutputRows"/"numFiles"/"fileSize" like `FileSourceScanExec`. This requires another patch in the data source V2 framework. Tracked by [SPARK-30362](https://issues.apache.org/jira/browse/SPARK-30362)
2. It doesn't support partition pruning with subqueries(including dynamic partition pruning) for now. Tracked by [SPARK-30628](https://issues.apache.org/jira/browse/SPARK-30628)

As we are going to code freeze on Jan 31st, this PR proposes to disable all the V2 file sources in Spark 3.0 by default.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Existing tests.

Closes #27348 from gengliangwang/disableFileSourceV2.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-23 21:42:43 -08:00
Xiao Li 3f76bd4002 [SPARK-27083][SQL][FOLLOW-UP] Rename spark.sql.subquery.reuse to spark.sql.execution.subquery.reuse.enabled
### What changes were proposed in this pull request?
This PR is to rename spark.sql.subquery.reuse to spark.sql.execution.subquery.reuse.enabled

### Why are the changes needed?
Make it consistent and clear.

### Does this PR introduce any user-facing change?
N/A. This is a [new conf added in Spark 3.0](https://github.com/apache/spark/pull/23998)

### How was this patch tested?
N/A

Closes #27346 from gatorsmile/spark27083.

Authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2020-01-23 15:34:54 -08:00
Kent Yao 3228d723a4 [SPARK-30603][SQL] Move RESERVED_PROPERTIES from SupportsNamespaces and TableCatalog to CatalogV2Util
### What changes were proposed in this pull request?
In this PR, I propose to move the `RESERVED_PROPERTIES `s from `SupportsNamespaces` and `TableCatalog` to `CatalogV2Util`, which can keep `RESERVED_PROPERTIES ` safe for interval usages only.

### Why are the changes needed?

 the `RESERVED_PROPERTIES` should not be changed by subclasses

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

existing uts

Closes #27318 from yaooqinn/SPARK-30603.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-23 13:13:25 -08:00
Wenchen Fan 976946a910 [SPARK-29947][SQL][FOLLOWUP] Fix table lookup cache
### What changes were proposed in this pull request?

Fix a bug in https://github.com/apache/spark/pull/26589 , to make this feature work.

### Why are the changes needed?

This feature doesn't work actually.

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

new test

Closes #27341 from cloud-fan/cache.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-23 13:02:10 -08:00
Wenchen Fan 3c8b3609a1 [SPARK-30620][SQL] avoid unnecessary serialization in AggregateExpression
### What changes were proposed in this pull request?

Expressions are very likely to be serialized and sent to executors, we should avoid unnecessary serialization overhead as much as we can.

This PR fixes `AggregateExpression`.

### Why are the changes needed?

small improvement

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

existing tests

Closes #27342 from cloud-fan/fix.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-01-24 01:15:57 +08:00
Yuanjian Li 3d7359ad42 [SPARK-29175][SQL][FOLLOW-UP] Rename the config name to spark.sql.maven.additionalRemoteRepositories
### What changes were proposed in this pull request?
Rename the config added in #25849 to `spark.sql.maven.additionalRemoteRepositories`.

### Why are the changes needed?
Follow the advice in [SPARK-29175](https://issues.apache.org/jira/browse/SPARK-29175?focusedCommentId=17021586&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17021586), the new name is more clear.

### Does this PR introduce any user-facing change?
Yes, the config name changed.

### How was this patch tested?
Existing test.

Closes #27339 from xuanyuanking/SPARK-29175.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-23 08:35:32 -08:00
Wenchen Fan dbed4c72f3 [SPARK-30605][SQL] move defaultNamespace from SupportsNamespace to CatalogPlugin
### What changes were proposed in this pull request?

Move the `defaultNamespace` method from the interface `SupportsNamespace` to `CatalogPlugin`.

### Why are the changes needed?

While I'm implementing JDBC V2, I realize that the default namespace is very an important information. Even if you don't want to implement namespace manipulation functionalities like CREATE/DROP/ALTER namespace, you still need to report the default namespace.

The default namespace is not about functionality but a matter of correctness. If you don't know the default namespace of a catalog, it's wrong to assume it's `[]`. You may get table not found exception if you do so.

I think it's more reasonable to put the `defaultNamespace` method in the base class `CatalogPlugin`. It returns `[]` by default so won't bother implementation if they really don't have namespace concept.

### Does this PR introduce any user-facing change?

yes, but for an unreleased API.

### How was this patch tested?

existing tests

Closes #27319 from cloud-fan/ns.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-01-23 21:56:54 +08:00
Xiao Li ffd435b57f [SPARK-27871][SQL][FOLLOW-UP] Remove the conf spark.sql.optimizer.reassignLambdaVariableID.enabled
### What changes were proposed in this pull request?
This PR is to remove the conf

### Why are the changes needed?
This rule can be excluded using spark.sql.optimizer.excludedRules without an extra conf

### Does this PR introduce any user-facing change?
Yes

### How was this patch tested?
N/A

Closes #27334 from gatorsmile/spark27871.

Authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-01-23 21:41:56 +08:00
Burak Yavuz db528e4fe1 [SPARK-30535][SQL] Revert "[] Migrate ALTER TABLE commands to the new framework
### What changes were proposed in this pull request?

This reverts commit b5cb9abdd5.

### Why are the changes needed?

The merged commit (#27243) was too risky for several reasons:
 1. It doesn't fix a bug
 2. It makes the resolution of the table that's going to be altered a child. We had avoided this on purpose as having an arbitrary rule change the child of AlterTable seemed risky. This change alone is a big -1 for me for this change.
 3. While the code may look cleaner, I think this approach makes certain things harder, e.g. differentiating between the Hive based Alter table CHANGE COLUMN and ALTER COLUMN syntax. Resolving and normalizing columns for ALTER COLUMN also becomes a bit harder, as we now have to check every single AlterTable command instead of just a single ALTER TABLE ALTER COLUMN statement

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Existing unit tests

This closes #27315

Closes #27327 from brkyvz/revAlter.

Authored-by: Burak Yavuz <brkyvz@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2020-01-22 22:43:46 -08:00
Tathagata Das d2bca8ff70 [SPARK-30609] Allow default merge command resolution to be bypassed by DSv2 tables
### What changes were proposed in this pull request?
Skip resolving the merge expressions if the target is a DSv2 table with ACCEPT_ANY_SCHEMA capability.

### Why are the changes needed?
Some DSv2 sources may want to customize the merge resolution logic. For example, a table that can accept any schema (TableCapability.ACCEPT_ANY_SCHEMA) may want to allow certain merge queries that are blocked (that is, throws AnalysisError) by the default resolution logic. So there should be a way to completely bypass the merge resolution logic in the Analyzer.

### Does this PR introduce any user-facing change?
No, since merge itself is an unreleased feature

### How was this patch tested?
added unit test to specifically test the skipping.

Closes #27326 from tdas/SPARK-30609.

Authored-by: Tathagata Das <tathagata.das1565@gmail.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
2020-01-22 19:20:25 -08:00
Ajith bbab2bb961 [SPARK-30556][SQL] Copy sparkContext.localproperties to child thread inSubqueryExec.executionContext
### What changes were proposed in this pull request?
In `org.apache.spark.sql.execution.SubqueryExec#relationFuture` make a copy of `org.apache.spark.SparkContext#localProperties` and pass it to the sub-execution thread in `org.apache.spark.sql.execution.SubqueryExec#executionContext`

### Why are the changes needed?
Local properties set via sparkContext are not available as TaskContext properties when executing  jobs and threadpools have idle threads which are reused

Explanation:
When `SubqueryExec`, the relationFuture is evaluated via a separate thread. The threads inherit the `localProperties` from `sparkContext` as they are the child threads.
These threads are created in the `executionContext` (thread pools). Each Thread pool has a default keepAliveSeconds of 60 seconds for idle threads.
Scenarios where the thread pool has threads which are idle and reused for a subsequent new query, the thread local properties will not be inherited from spark context (thread properties are inherited only on thread creation) hence end up having old or no properties set. This will cause taskset properties to be missing when properties are transferred by child thread via `sparkContext.runJob/submitJob`

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Added UT

Closes #27267 from ajithme/subquerylocalprop.

Authored-by: Ajith <ajith2489@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-22 18:21:11 -08:00
Maxim Gekk 4ca31b470f [SPARK-30606][SQL] Fix the like function with 2 parameters
### What changes were proposed in this pull request?
In the PR, I propose to add additional constructor in the `Like` expression. The constructor can be used on applying the `like` function with 2 parameters.

### Why are the changes needed?
`FunctionRegistry` cannot find a constructor if the `like` function is applied to 2 parameters.

### Does this PR introduce any user-facing change?
Yes, before:
```sql
spark-sql> SELECT like('Spark', '_park');

Invalid arguments for function like; line 1 pos 7
org.apache.spark.sql.AnalysisException: Invalid arguments for function like; line 1 pos 7
	at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expression$7(FunctionRegistry.scala:618)
	at scala.Option.getOrElse(Option.scala:189)
	at org.apache.spark.sql.catalyst.analysis.FunctionRegistry$.$anonfun$expression$4(FunctionRegistry.scala:602)
	at org.apache.spark.sql.catalyst.analysis.SimpleFunctionRegistry.lookupFunction(FunctionRegistry.scala:121)
	at org.apache.spark.sql.catalyst.catalog.SessionCatalog.lookupFunction(SessionCatalog.scala:1412)
```
After:
```sql
spark-sql> SELECT like('Spark', '_park');
true
```

### How was this patch tested?
By running `check outputs of expression examples` from `SQLQuerySuite`.

Closes #27323 from MaxGekk/fix-like-func.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-01-22 15:40:24 -08:00
Kent Yao 8e280cebf2 [SPARK-30592][SQL] Interval support for csv and json funtions
### What changes were proposed in this pull request?

In this PR, I'd propose to fully support interval for the CSV and JSON functions.

On one hand, CSV and JSON records consist of string values, in the cast logic, we can cast string from/to interval now, so we can make those functions support intervals easily.

Before this change we can only use this as a workaround.
```sql
SELECT cast(from_csv('1, 1 day', 'a INT, b string').b as interval)
struct<CAST(from_csv(1, 1 day).b AS INTERVAL):interval>
1 days
```

On the other hand,  we ban reading or writing intervals from CSV and JSON files. To directly read and write  with external json/csv storage, you still need explicit cast, e.g.
```scala
spark.read.schema("a string").json("a.json").selectExpr("cast(a as interval)").show
+------+
|     a|
+------+
|1 days|
+------+
```

### Why are the changes needed?

for interval's future-proofing purpose

### Does this PR introduce any user-facing change?

yes, the `to_json`/`from_json` function can deal with intervals now. e.g.
for `from_json` there is no such use case because we do not support `a interval`
for `to_json`, we can use interval values now

#### before

 ```sql

 SELECT to_json(map('a', interval 25 month 100 day 130 minute));
Error in query: cannot resolve 'to_json(map('a', INTERVAL '2 years 1 months 100 days 2 hours 10 minutes'))' due to data type mismatch: Unable to convert column a of type interval to JSON.; line 1 pos 7;
'Project [unresolvedalias(to_json(map(a, 2 years 1 months 100 days 2 hours 10 minutes), Some(Asia/Shanghai)), None)]
+- OneRowRelation
```
#### after
```sql
SELECT to_json(map('a', interval 25 month 100 day 130 minute))
{"a":"2 years 1 months 100 days 2 hours 10 minutes"}

```
### How was this patch tested?

add ut

Closes #27317 from yaooqinn/SPARK-30592.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-01-23 00:41:46 +08:00
Wenchen Fan b8cb52a8a7 [SPARK-30555][SQL] MERGE INTO insert action should only access columns from source table
### What changes were proposed in this pull request?

when resolving the `Assignment` of insert action in MERGE INTO, only resolve with the source table, to avoid ambiguous attribute failure if there is a same-name column in the target table.

### Why are the changes needed?

The insert action is used when NOT MATCHED, so it can't access the row from the target table anyway.

### Does this PR introduce any user-facing change?

on

### How was this patch tested?

new tests

Closes #27265 from cloud-fan/merge.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-01-22 21:31:11 +08:00
Kent Yao f2d71f5838 [SPARK-30591][SQL] Remove the nonstandard SET OWNER syntax for namespaces
### What changes were proposed in this pull request?

This pr removes the nonstandard `SET OWNER` syntax for namespaces and changes the owner reserved properties from `ownerName` and `ownerType` to `owner`.

### Why are the changes needed?

the `SET OWNER` syntax for namespaces is hive-specific and non-sql standard, we need a more future-proofing design before we implement user-facing changes for SQL security issues

### Does this PR introduce any user-facing change?

no, just revert an unpublic syntax

### How was this patch tested?

modified uts

Closes #27300 from yaooqinn/SPARK-30591.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-01-22 16:00:05 +08:00
fuwhu cfb1706eaa [SPARK-15616][SQL] Add optimizer rule PruneHiveTablePartitions
### What changes were proposed in this pull request?
Add optimizer rule PruneHiveTablePartitions pruning hive table partitions based on filters on partition columns.
Doing so, the total size of pruned partitions may be small enough for broadcast join in JoinSelection strategy.

### Why are the changes needed?
In JoinSelection strategy, spark use the "plan.stats.sizeInBytes" to decide whether the plan is suitable for broadcast join.
Currently, "plan.stats.sizeInBytes" does not take "pruned partitions" into account, so it may miss some broadcast join and take sort-merge join instead, which will definitely impact join performance.
This PR aim at taking "pruned partitions" into account for hive table in "plan.stats.sizeInBytes" and then improve performance by using broadcast join if possible.

### Does this PR introduce any user-facing change?
no

### How was this patch tested?
Added unit tests.

This is based on #25919, credits should go to lianhuiwang and advancedxy.

Closes #26805 from fuwhu/SPARK-15616.

Authored-by: fuwhu <bestwwg@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-01-21 21:26:30 +08:00
yi.wu ff39c9271c [SPARK-30252][SQL] Disallow negative scale of Decimal
### What changes were proposed in this pull request?

This PR propose to disallow negative `scale` of `Decimal` in Spark. And this PR brings two behavior changes:

1) for literals like `1.23E4BD` or `1.23E4`(with `spark.sql.legacy.exponentLiteralAsDecimal.enabled`=true, see [SPARK-29956](https://issues.apache.org/jira/browse/SPARK-29956)), we set its `(precision, scale)` to (5, 0) rather than (3, -2);
2) add negative `scale` check inside the decimal method if it exposes to set `scale` explicitly. If check fails, `AnalysisException` throws.

And user could still use `spark.sql.legacy.allowNegativeScaleOfDecimal.enabled` to restore the previous behavior.

### Why are the changes needed?

According to SQL standard,
> 4.4.2 Characteristics of numbers
An exact numeric type has a precision P and a scale S. P is a positive integer that determines the number of significant digits in a particular radix R, where R is either 2 or 10. S is a non-negative integer.

scale of Decimal should always be non-negative. And other mainstream databases, like Presto, PostgreSQL, also don't allow negative scale.

Presto:
```
presto:default> create table t (i decimal(2, -1));
Query 20191213_081238_00017_i448h failed: line 1:30: mismatched input '-'. Expecting: <integer>, <type>
create table t (i decimal(2, -1))
```

PostgrelSQL:
```
postgres=# create table t(i decimal(2, -1));
ERROR:  NUMERIC scale -1 must be between 0 and precision 2
LINE 1: create table t(i decimal(2, -1));
                         ^
```

And, actually, Spark itself already doesn't allow to create table with negative decimal types using SQL:
```
scala> spark.sql("create table t(i decimal(2, -1))");
org.apache.spark.sql.catalyst.parser.ParseException:
no viable alternative at input 'create table t(i decimal(2, -'(line 1, pos 28)

== SQL ==
create table t(i decimal(2, -1))
----------------------------^^^

  at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130)
  at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
  at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:76)
  at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605)
  at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
  at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605)
  ... 35 elided
```

However, it is still possible to create such table or `DatFrame` using Spark SQL programming API:
```
scala> val tb =
 CatalogTable(
  TableIdentifier("test", None),
  CatalogTableType.MANAGED,
  CatalogStorageFormat.empty,
  StructType(StructField("i", DecimalType(2, -1) ) :: Nil))
```
```
scala> spark.sql("SELECT 1.23E4BD")
res2: org.apache.spark.sql.DataFrame = [1.23E+4: decimal(3,-2)]
```
while, these two different behavior could make user confused.

On the other side, even if user creates such table or `DataFrame` with negative scale decimal type, it can't write data out if using format, like `parquet` or `orc`. Because these formats have their own check for negative scale and fail on it.
```
scala> spark.sql("SELECT 1.23E4BD").write.saveAsTable("parquet")
19/12/13 17:37:04 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.IllegalArgumentException: Invalid DECIMAL scale: -2
	at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:53)
	at org.apache.parquet.schema.Types$BasePrimitiveBuilder.decimalMetadata(Types.java:495)
	at org.apache.parquet.schema.Types$BasePrimitiveBuilder.build(Types.java:403)
	at org.apache.parquet.schema.Types$BasePrimitiveBuilder.build(Types.java:309)
	at org.apache.parquet.schema.Types$Builder.named(Types.java:290)
	at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:428)
	at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:334)
	at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.$anonfun$convert$2(ParquetSchemaConverter.scala:326)
	at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
	at scala.collection.Iterator.foreach(Iterator.scala:941)
	at scala.collection.Iterator.foreach$(Iterator.scala:941)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1429)
	at scala.collection.IterableLike.foreach(IterableLike.scala:74)
	at scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99)
	at scala.collection.TraversableLike.map(TraversableLike.scala:238)
	at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
	at org.apache.spark.sql.types.StructType.map(StructType.scala:99)
	at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.convert(ParquetSchemaConverter.scala:326)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.init(ParquetWriteSupport.scala:97)
	at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388)
	at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetOutputWriter.scala:37)
	at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:150)
	at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:124)
	at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:109)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:264)
	at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:205)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:127)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:441)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:444)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
```

So, I think it would be better to disallow negative scale totally and make behaviors above be consistent.

### Does this PR introduce any user-facing change?

Yes, if `spark.sql.legacy.allowNegativeScaleOfDecimal.enabled=false`, user couldn't create Decimal value with negative scale anymore.

### How was this patch tested?

Added new tests in `ExpressionParserSuite` and `DecimalSuite`;
Updated `SQLQueryTestSuite`.

Closes #26881 from Ngone51/nonnegative-scale.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-01-21 21:09:48 +08:00
Kent Yao af705421db [SPARK-30593][SQL] Revert interval ISO/ANSI SQL Standard output since we decide not to follow ANSI and no round trip
### What changes were proposed in this pull request?

This revert https://github.com/apache/spark/pull/26418, file a new ticket under  https://issues.apache.org/jira/browse/SPARK-30546 for better tracking interval behavior
### Why are the changes needed?

Revert interval ISO/ANSI SQL Standard output since we decide not to follow ANSI and there is no round trip

### Does this PR introduce any user-facing change?

no, not released yet

### How was this patch tested?

existing uts

Closes #27304 from yaooqinn/SPARK-30593.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-01-21 20:51:10 +08:00
yi.wu 78df532556 [SPARK-30433][SQL][FOLLOW-UP] Optimize collect conflict plans
### What changes were proposed in this pull request?

For LogicalPlan(e.g. `MultiInstanceRelation`, `Project`, `Aggregate`, etc)  whose output doesn't inherit directly from its children, we could just stop collect on it. Because we could always replace all the lower conflict attributes with the new attributes from the new plan.

Otherwise, we should recursively collect conflict plans, like `Generate`, `Window`.

### Why are the changes needed?

Performance improvement.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass existed tests.

Closes #27263 from Ngone51/spark_30433_followup.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-01-21 14:23:55 +08:00
Kent Yao 0388b7a3ec [SPARK-30568][SQL] Invalidate interval type as a field table schema
### What changes were proposed in this pull request?

After this commit d67b98ea01, we are able to create table or alter table with interval column types if the external catalog accepts which is varying the interval type's purpose for internal usage. With d67b98ea01 's original purpose it should only work from cast logic.

Instead of adding type checker for the interval type from commands to commands to work among catalogs, It much simpler to treat interval as an invalid data type but can be identified by cast only.

### Why are the changes needed?

enhance interval internal usage purpose.

### Does this PR introduce any user-facing change?

NO,
Additionally, this PR restores user behavior when using interval type to create/alter table schema, e.g. for hive catalog
for 2.4,
```java
Caused by: org.apache.spark.sql.catalyst.parser.ParseException:
DataType calendarinterval is not supported.(line 1, pos 0)
```
for master after  d67b98ea01
```java
Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IllegalArgumentException: Error: type expected at the position 0 of 'interval' but 'interval' is found.
  at org.apache.hadoop.hive.ql.metadata.Hive.createTable(Hive.java:862)
```
now with this pr, we restore the type checker in spark side.

### How was this patch tested?

add more ut

Closes #27277 from yaooqinn/SPARK-30568.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-01-21 11:14:26 +08:00
Kent Yao 24efa43826 [SPARK-30019][SQL] Add the owner property to v2 table
### What changes were proposed in this pull request?

Add `owner` property to v2 table, it is reversed by `TableCatalog`, indicates the table's owner.

### Why are the changes needed?

enhance ownership management of catalog API

### Does this PR introduce any user-facing change?

yes, add 1 reserved property - `owner` , and it is not allowed to use in OPTIONS/TBLPROPERTIES anymore, only if legacy on

### How was this patch tested?

add uts

Closes #27249 from yaooqinn/SPARK-30019.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-01-21 10:37:49 +08:00
HyukjinKwon 14bc2a2162 [SPARK-30530][SQL][FOLLOW-UP] Remove unnecessary codes and fix comments accordingly in UnivocityParser
### What changes were proposed in this pull request?

This PR proposes to clean up `UnivocityParser`.

### Why are the changes needed?

It will slightly improve the performance since we don't do unnecessary computation for Array concatenations/creation.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manually ran the existing tests.

Closes #27287 from HyukjinKwon/SPARK-30530-followup.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-01-21 10:20:01 +09:00
Maxim Gekk fd69533593 [SPARK-30482][CORE][SQL][TESTS][FOLLOW-UP] Output caller info in log appenders while reaching the limit
### What changes were proposed in this pull request?
In the PR, I propose to output additional msg from the tests where a log appender is added. The message is printed as a part of `IllegalStateException` in the case of reaching the limit of maximum number of logged events.

### Why are the changes needed?
If a log appender is not removed from the log4j appenders list. the caller message could help to investigate the problem and find the test which doesn't remove the log appender.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By running the modified test suites `AvroSuite`, `CSVSuite`, `ResolveHintsSuite` and etc.

Closes #27296 from MaxGekk/assign-name-to-log-appender.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-01-21 10:19:07 +09:00