Commit graph

10026 commits

Author SHA1 Message Date
ulysses a9077299d7 [SPARK-32743][SQL] Add distinct info at UnresolvedFunction toString
### What changes were proposed in this pull request?

Add distinct info at `UnresolvedFunction.toString`.

### Why are the changes needed?

Make `UnresolvedFunction` info complete.

```
create table test (c1 int, c2 int);
explain extended select sum(distinct c1) from test;

-- before this pr
== Parsed Logical Plan ==
'Project [unresolvedalias('sum('c1), None)]
+- 'UnresolvedRelation [test]

-- after this pr
== Parsed Logical Plan ==
'Project [unresolvedalias('sum(distinct 'c1), None)]
+- 'UnresolvedRelation [test]
```

### Does this PR introduce _any_ user-facing change?

Yes, get distinct info during sql parse.

### How was this patch tested?

manual test.

Closes #29586 from ulysses-you/SPARK-32743.

Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-10-09 09:25:22 +09:00
Max Gekk c5f6af9f17 [SPARK-33094][SQL] Make ORC format propagate Hadoop config from DS options to underlying HDFS file system
### What changes were proposed in this pull request?
Propagate ORC options to Hadoop configs in Hive `OrcFileFormat` and in the regular ORC datasource.

### Why are the changes needed?
There is a bug that when running:
```scala
spark.read.format("orc").options(conf).load(path)
```
The underlying file system will not receive the conf options.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
Added UT to `OrcSourceSuite`.

Closes #29976 from MaxGekk/orc-option-propagation.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-08 11:59:30 -07:00
HyukjinKwon 5effa8ea26 [SPARK-33091][SQL] Avoid using map instead of foreach to avoid potential side effect at callers of OrcUtils.readCatalystSchema
### What changes were proposed in this pull request?

This is a kind of a followup of SPARK-32646. New JIRA was filed to control the fixed versions properly.

When you use `map`, it might be lazily evaluated and not executed. To avoid this,  we should better use `foreach`. See also SPARK-16694. Current codes look not causing any bug for now but it should be best to fix to avoid potential issues.

### Why are the changes needed?

To avoid potential issues from `map` being lazy and not executed.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Ran related tests. CI in this PR should verify.

Closes #29974 from HyukjinKwon/SPARK-32646.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-10-08 16:29:15 +09:00
Max Gekk 7d6e3fb998 [SPARK-33074][SQL] Classify dialect exceptions in JDBC v2 Table Catalog
### What changes were proposed in this pull request?
1. Add new method to the `JdbcDialect` class - `classifyException()`. It converts dialect specific exception to Spark's `AnalysisException` or its sub-classes.
2. Replace H2 exception  `org.h2.jdbc.JdbcSQLException` in `JDBCTableCatalogSuite` by `AnalysisException`.
3. Add `H2Dialect`

### Why are the changes needed?
Currently JDBC v2 Table Catalog implementation throws dialect specific exception and ignores exceptions defined in the `TableCatalog` interface. This PR adds new method for converting dialect specific exception, and assumes that follow up PRs will implement `classifyException()`.

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
By running existing test suites `JDBCTableCatalogSuite` and `JDBCV2Suite`.

Closes #29952 from MaxGekk/jdbcv2-classify-exception.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-08 05:28:33 +00:00
Terry Kim 1c781a4354 [SPARK-32282][SQL] Improve EnsureRquirement.reorderJoinKeys to handle more scenarios such as PartitioningCollection
### What changes were proposed in this pull request?

This PR proposes to improve  `EnsureRquirement.reorderJoinKeys` to handle the following scenarios:
1. If the keys cannot be reordered to match the left-side `HashPartitioning`, consider the right-side `HashPartitioning`.
2. Handle `PartitioningCollection`, which may contain `HashPartitioning`

### Why are the changes needed?

1. For the scenario 1), the current behavior matches either the left-side `HashPartitioning` or the right-side `HashPartitioning`. This means that if both sides are `HashPartitioning`, it will try to match only the left side.
The following will not consider the right-side `HashPartitioning`:
```
val df1 = (0 until 10).map(i => (i % 5, i % 13)).toDF("i1", "j1")
val df2 = (0 until 10).map(i => (i % 7, i % 11)).toDF("i2", "j2")
df1.write.format("parquet").bucketBy(4, "i1", "j1").saveAsTable("t1")df2.write.format("parquet").bucketBy(4, "i2", "j2").saveAsTable("t2")
val t1 = spark.table("t1")
val t2 = spark.table("t2")
val join = t1.join(t2, t1("i1") === t2("j2") && t1("i1") === t2("i2"))
 join.explain

== Physical Plan ==
*(5) SortMergeJoin [i1#26, i1#26], [j2#31, i2#30], Inner
:- *(2) Sort [i1#26 ASC NULLS FIRST, i1#26 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(i1#26, i1#26, 4), true, [id=#69]
:     +- *(1) Project [i1#26, j1#27]
:        +- *(1) Filter isnotnull(i1#26)
:           +- *(1) ColumnarToRow
:              +- FileScan parquet default.t1[i1#26,j1#27] Batched: true, DataFilters: [isnotnull(i1#26)], Format: Parquet, Location: InMemoryFileIndex[..., PartitionFilters: [], PushedFilters: [IsNotNull(i1)], ReadSchema: struct<i1:int,j1:int>, SelectedBucketsCount: 4 out of 4
+- *(4) Sort [j2#31 ASC NULLS FIRST, i2#30 ASC NULLS FIRST], false, 0.
   +- Exchange hashpartitioning(j2#31, i2#30, 4), true, [id=#79].       <===== This can be removed
      +- *(3) Project [i2#30, j2#31]
         +- *(3) Filter (((j2#31 = i2#30) AND isnotnull(j2#31)) AND isnotnull(i2#30))
            +- *(3) ColumnarToRow
               +- FileScan parquet default.t2[i2#30,j2#31] Batched: true, DataFilters: [(j2#31 = i2#30), isnotnull(j2#31), isnotnull(i2#30)], Format: Parquet, Location: InMemoryFileIndex[..., PartitionFilters: [], PushedFilters: [IsNotNull(j2), IsNotNull(i2)], ReadSchema: struct<i2:int,j2:int>, SelectedBucketsCount: 4 out of 4

```

2.  For the scenario 2), the current behavior does not handle `PartitioningCollection`:
```
val df1 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i1", "j1")
val df2 = (0 until 100).map(i => (i % 7, i % 11)).toDF("i2", "j2")
val df3 = (0 until 100).map(i => (i % 5, i % 13)).toDF("i3", "j3")
val join = df1.join(df2, df1("i1") === df2("i2") && df1("j1") === df2("j2")) // PartitioningCollection
val join2 = join.join(df3, join("j1") === df3("j3") && join("i1") === df3("i3"))
join2.explain

== Physical Plan ==
*(9) SortMergeJoin [j1#8, i1#7], [j3#30, i3#29], Inner
:- *(6) Sort [j1#8 ASC NULLS FIRST, i1#7 ASC NULLS FIRST], false, 0.       <===== This can be removed
:  +- Exchange hashpartitioning(j1#8, i1#7, 5), true, [id=#58]             <===== This can be removed
:     +- *(5) SortMergeJoin [i1#7, j1#8], [i2#18, j2#19], Inner
:        :- *(2) Sort [i1#7 ASC NULLS FIRST, j1#8 ASC NULLS FIRST], false, 0
:        :  +- Exchange hashpartitioning(i1#7, j1#8, 5), true, [id=#45]
:        :     +- *(1) Project [_1#2 AS i1#7, _2#3 AS j1#8]
:        :        +- *(1) LocalTableScan [_1#2, _2#3]
:        +- *(4) Sort [i2#18 ASC NULLS FIRST, j2#19 ASC NULLS FIRST], false, 0
:           +- Exchange hashpartitioning(i2#18, j2#19, 5), true, [id=#51]
:              +- *(3) Project [_1#13 AS i2#18, _2#14 AS j2#19]
:                 +- *(3) LocalTableScan [_1#13, _2#14]
+- *(8) Sort [j3#30 ASC NULLS FIRST, i3#29 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(j3#30, i3#29, 5), true, [id=#64]
      +- *(7) Project [_1#24 AS i3#29, _2#25 AS j3#30]
         +- *(7) LocalTableScan [_1#24, _2#25]
```
### Does this PR introduce _any_ user-facing change?

Yes, now from the above examples, the shuffle/sort nodes pointed by `This can be removed` are now removed:
1. Senario 1):
```
== Physical Plan ==
*(4) SortMergeJoin [i1#26, i1#26], [i2#30, j2#31], Inner
:- *(2) Sort [i1#26 ASC NULLS FIRST, i1#26 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(i1#26, i1#26, 4), true, [id=#67]
:     +- *(1) Project [i1#26, j1#27]
:        +- *(1) Filter isnotnull(i1#26)
:           +- *(1) ColumnarToRow
:              +- FileScan parquet default.t1[i1#26,j1#27] Batched: true, DataFilters: [isnotnull(i1#26)], Format: Parquet, Location: InMemoryFileIndex[..., PartitionFilters: [], PushedFilters: [IsNotNull(i1)], ReadSchema: struct<i1:int,j1:int>, SelectedBucketsCount: 4 out of 4
+- *(3) Sort [i2#30 ASC NULLS FIRST, j2#31 ASC NULLS FIRST], false, 0
   +- *(3) Project [i2#30, j2#31]
      +- *(3) Filter (((j2#31 = i2#30) AND isnotnull(j2#31)) AND isnotnull(i2#30))
         +- *(3) ColumnarToRow
            +- FileScan parquet default.t2[i2#30,j2#31] Batched: true, DataFilters: [(j2#31 = i2#30), isnotnull(j2#31), isnotnull(i2#30)], Format: Parquet, Location: InMemoryFileIndex[..., PartitionFilters: [], PushedFilters: [IsNotNull(j2), IsNotNull(i2)], ReadSchema: struct<i2:int,j2:int>, SelectedBucketsCount: 4 out of 4
```
2. Scenario 2):
```
== Physical Plan ==
*(8) SortMergeJoin [i1#7, j1#8], [i3#29, j3#30], Inner
:- *(5) SortMergeJoin [i1#7, j1#8], [i2#18, j2#19], Inner
:  :- *(2) Sort [i1#7 ASC NULLS FIRST, j1#8 ASC NULLS FIRST], false, 0
:  :  +- Exchange hashpartitioning(i1#7, j1#8, 5), true, [id=#43]
:  :     +- *(1) Project [_1#2 AS i1#7, _2#3 AS j1#8]
:  :        +- *(1) LocalTableScan [_1#2, _2#3]
:  +- *(4) Sort [i2#18 ASC NULLS FIRST, j2#19 ASC NULLS FIRST], false, 0
:     +- Exchange hashpartitioning(i2#18, j2#19, 5), true, [id=#49]
:        +- *(3) Project [_1#13 AS i2#18, _2#14 AS j2#19]
:           +- *(3) LocalTableScan [_1#13, _2#14]
+- *(7) Sort [i3#29 ASC NULLS FIRST, j3#30 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(i3#29, j3#30, 5), true, [id=#58]
      +- *(6) Project [_1#24 AS i3#29, _2#25 AS j3#30]
         +- *(6) LocalTableScan [_1#24, _2#25]
```

### How was this patch tested?

Added tests.

Closes #29074 from imback82/reorder_keys.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-08 04:58:41 +00:00
Karen Feng 39510b0e9b [SPARK-32793][SQL] Add raise_error function, adds error message parameter to assert_true
## What changes were proposed in this pull request?

Adds a SQL function `raise_error` which underlies the refactored `assert_true` function. `assert_true` now also (optionally) accepts a custom error message field.
`raise_error` is exposed in SQL, Python, Scala, and R.
`assert_true` was previously only exposed in SQL; it is now also exposed in Python, Scala, and R.

### Why are the changes needed?

Improves usability of `assert_true` by clarifying error messaging, and adds the useful helper function `raise_error`.

### Does this PR introduce _any_ user-facing change?

Yes:
- Adds `raise_error` function to the SQL, Python, Scala, and R APIs.
- Adds `assert_true` function to the SQL, Python and R APIs.

### How was this patch tested?

Adds unit tests in SQL, Python, Scala, and R for `assert_true` and `raise_error`.

Closes #29947 from karenfeng/spark-32793.

Lead-authored-by: Karen Feng <karen.feng@databricks.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-08 12:05:39 +09:00
Max Gekk 23afc930ae [SPARK-26499][SQL][FOLLOWUP] Print the loading provider exception starting from the INFO level
### What changes were proposed in this pull request?
1. Don't print the exception in the error message while loading a built-in provider.
2. Print the exception starting from the INFO level.

Up to the INFO level, the output is:
```
17:48:32.342 ERROR org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider: Failed to load built in provider.
```
and starting from the INFO level:
```
17:48:32.342 ERROR org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider: Failed to load built in provider.
17:48:32.342 INFO org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider: Loading of the provider failed with the exception:
java.util.ServiceConfigurationError: org.apache.spark.sql.jdbc.JdbcConnectionProvider: Provider org.apache.spark.sql.execution.datasources.jdbc.connection.IntentionallyFaultyConnectionProvider could not be instantiated
	at java.util.ServiceLoader.fail(ServiceLoader.java:232)
	at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
	at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
	at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
	at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
	at org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider$.loadProviders(ConnectionProvider.scala:41)
```

### Why are the changes needed?
To avoid "noise" in logs while running tests. Currently, logs are blown up:
```
org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider: Loading of the provider failed with the exception:
java.util.ServiceConfigurationError: org.apache.spark.sql.jdbc.JdbcConnectionProvider: Provider org.apache.spark.sql.execution.datasources.jdbc.connection.IntentionallyFaultyConnectionProvider could not be instantiated
	at java.util.ServiceLoader.fail(ServiceLoader.java:232)
	at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
	at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
	at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
	at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
	at org.apache.spark.sql.execution.datasources.jdbc.connection.ConnectionProvider$.loadProviders(ConnectionProvider.scala:41)
...
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.IllegalArgumentException: Intentional Exception
	at org.apache.spark.sql.execution.datasources.jdbc.connection.IntentionallyFaultyConnectionProvider.<init>(IntentionallyFaultyConnectionProvider.scala:26)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
	at java.lang.Class.newInstance(Class.java:442)
	at java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:380)
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running:
```
$ build/sbt "sql/test:testOnly org.apache.spark.sql.execution.datasources.v2.jdbc.JDBCTableCatalogSuite"
```

Closes #29968 from MaxGekk/gaborgsomogyi-SPARK-32001-followup.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-07 13:50:15 -07:00
Dongjoon Hyun a127387a53 [SPARK-33082][SQL] Remove hive-1.2 workaround code
### What changes were proposed in this pull request?

This PR removes old Hive-1.2 profile related workaround code.

### Why are the changes needed?

To simply the code.
### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CI.

Closes #29961 from dongjoon-hyun/SPARK-HIVE12.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-07 12:27:23 -07:00
Takeshi Yamamuro 94d648dff5 [SPARK-33036][SQL] Refactor RewriteCorrelatedScalarSubquery code to replace exprIds in a bottom-up manner
### What changes were proposed in this pull request?

This PR intends to refactor code in `RewriteCorrelatedScalarSubquery` for replacing `ExprId`s in a bottom-up manner instead of doing in a top-down one.

This PR comes from the talk with cloud-fan in https://github.com/apache/spark/pull/29585#discussion_r490371252.

### Why are the changes needed?

To improve code.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #29913 from maropu/RefactorRewriteCorrelatedScalarSubquery.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-10-07 20:16:40 +09:00
Terry Kim 7e99fcd64e [SPARK-33004][SQL] Migrate DESCRIBE column to use UnresolvedTableOrView to resolve the identifier
### What changes were proposed in this pull request?

This PR proposes to migrate `DESCRIBE tbl colname` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).

### Why are the changes needed?

The current behavior is not consistent between v1 and v2 commands when resolving a temp view.
In v2, the `t` in the following example is resolved to a table:
```scala
sql("CREATE TABLE testcat.ns.t (id bigint) USING foo")
sql("CREATE TEMPORARY VIEW t AS SELECT 2 as i")
sql("USE testcat.ns")
sql("DESCRIBE t i") // 't' is resolved to testcat.ns.t

Describing columns is not supported for v2 tables.;
org.apache.spark.sql.AnalysisException: Describing columns is not supported for v2 tables.;
```
whereas in v1, the `t` is resolved to a temp view:
```scala
sql("CREATE DATABASE test")
sql("CREATE TABLE spark_catalog.test.t (id bigint) USING csv")
sql("CREATE TEMPORARY VIEW t AS SELECT 2 as i")
sql("USE spark_catalog.test")
sql("DESCRIBE t i").show // 't' is resolved to a temp view

+---------+----------+
|info_name|info_value|
+---------+----------+
| col_name|         i|
|data_type|       int|
|  comment|      NULL|
+---------+----------+
```

### Does this PR introduce _any_ user-facing change?

After this PR, `DESCRIBE t i` is resolved to a temp view `t` instead of `testcat.ns.t`.

### How was this patch tested?

Added a new test

Closes #29880 from imback82/describe_column_consistent.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-07 06:33:20 +00:00
Max Gekk aea78d2c8c [SPARK-33034][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (Oracle dialect)
### What changes were proposed in this pull request?
1. Override the default SQL strings in the Oracle Dialect for:
    - ALTER TABLE ADD COLUMN
    - ALTER TABLE UPDATE COLUMN TYPE
    - ALTER TABLE UPDATE COLUMN NULLABILITY
2. Add new docker integration test suite `jdbc/v2/OracleIntegrationSuite.scala`

### Why are the changes needed?
In SPARK-24907, we implemented JDBC v2 Table Catalog but it doesn't support some `ALTER TABLE` at the moment. This PR supports Oracle specific `ALTER TABLE`.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
By running new integration test suite:
```
$ ./build/sbt -Pdocker-integration-tests "test-only *.OracleIntegrationSuite"
```

Closes #29912 from MaxGekk/jdbcv2-oracle-alter-table.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-07 04:48:57 +00:00
Max Gekk 584f90c82e [SPARK-33067][SQL][TESTS][FOLLOWUP] Check error messages in JDBCTableCatalogSuite
### What changes were proposed in this pull request?
Get error message from the expected exception, and check that they are reasonable.

### Why are the changes needed?
To improve tests by expecting particular error messages.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running `JDBCTableCatalogSuite`.

Closes #29957 from MaxGekk/jdbcv2-negative-tests-followup.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-07 09:29:30 +09:00
Liang-Chi Hsieh 57ed5a829b [SPARK-33007][SQL] Simplify named_struct + get struct field + from_json expression chain
### What changes were proposed in this pull request?

This proposes to simplify named_struct + get struct field + from_json expression chain from `struct(from_json.col1, from_json.col2, from_json.col3...)` to `struct(from_json)`.

### Why are the changes needed?

Simplify complex expression tree that could be produced by query optimization or user.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test.

Closes #29942 from viirya/SPARK-33007.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-06 16:59:23 -07:00
Kousuke Saruta 3b2a38d735 [SPARK-32511][SQL][FOLLOWUP] Fix the broken build for Scala 2.13 with Maven
### What changes were proposed in this pull request?

This PR fixes the broken build for Scala 2.13 with Maven.
https://github.com/apache/spark/pull/29913/checks?check_run_id=1187826966

#29795 was merged though it doesn't successfully finish the build for Scala 2.13

### Why are the changes needed?

To fix the build.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

`build/mvn -Pscala-2.13 -Phive -Phive-thriftserver -DskipTests package`

Closes #29954 from sarutak/hotfix-seq.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-06 09:40:16 -07:00
Kent Yao 17d309dfac [SPARK-32963][SQL] empty string should be consistent for schema name in SparkGetSchemasOperation
### What changes were proposed in this pull request?
This PR makes the empty string for schema name pattern match the global temp view as same as it works for other databases.

This PR also add new tests to covering different kinds of wildcards to verify the SparkGetSchemasOperation

### Why are the changes needed?

When the schema name is empty string, it is considered as ".*" and can match all databases in the catalog.
But when it can not match the global temp view as it is not converted to ".*"

### Does this PR introduce _any_ user-facing change?

yes , JDBC operation like `statement.getConnection.getMetaData..getSchemas(null, "")` now also provides the global temp view in the result set.

### How was this patch tested?

new tests

Closes #29834 from yaooqinn/SPARK-32963.

Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-06 16:01:10 +00:00
Wenchen Fan ec6fccb922 [SPARK-32243][SQL][FOLLOWUP] Fix compilation in HiveSessionCatalog
Fix a mistake when merging https://github.com/apache/spark/pull/29054

Closes #29955 from cloud-fan/hot-fix.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-06 14:33:34 +00:00
angerszhu ddc7012b3d [SPARK-32243][SQL] HiveSessionCatalog call super.makeFunctionExpression should throw earlier when got Spark UDAF Invalid arguments number error
### What changes were proposed in this pull request?
When we create a UDAF function use class extended `UserDefinedAggregeteFunction`,  when we call the function,  in support hive mode, in HiveSessionCatalog, it will call super.makeFunctionExpression, 

but it will catch error  such as the function need 2 parameter and we only give 1, throw exception only show 
```
No handler for UDF/UDAF/UDTF xxxxxxxx
```
This is confused for develop , we should show error thrown by super method too,

For this pr's UT :
Before change, throw Exception like
```
No handler for UDF/UDAF/UDTF 'org.apache.spark.sql.hive.execution.LongProductSum'; line 1 pos 7
```
After this pr, throw exception
```
Spark UDAF Error: Invalid number of arguments for function longProductSum. Expected: 2; Found: 1;
Hive UDF/UDAF/UDTF Error: No handler for UDF/UDAF/UDTF 'org.apache.spark.sql.hive.execution.LongProductSum'; line 1 pos 7
```

### Why are the changes needed?
Show more detail error message when define UDAF

### Does this PR introduce _any_ user-facing change?
People will see more detail error message when use spark sql's UDAF  in hive support Mode

### How was this patch tested?
Added UT

Closes #29054 from AngersZhuuuu/SPARK-32243.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-06 09:09:19 +00:00
fqaiser94@gmail.com 2793347972 [SPARK-32511][SQL] Add dropFields method to Column class
### What changes were proposed in this pull request?

1. Refactored `WithFields` Expression to make it more extensible (now `UpdateFields`).
2. Added a new `dropFields` method to the `Column` class. This method should allow users to drop a `StructField` in a `StructType` column (with similar semantics to the `drop` method on `Dataset`).

### Why are the changes needed?

Often Spark users have to work with deeply nested data e.g. to fix a data quality issue with an existing `StructField`. To do this with the existing Spark APIs, users have to rebuild the entire struct column.

For example, let's say you have the following deeply nested data structure which has a data quality issue (`5` is missing):
```
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
import org.apache.spark.sql.types._

val data = spark.createDataFrame(sc.parallelize(
      Seq(Row(Row(Row(1, 2, 3), Row(Row(4, null, 6), Row(7, 8, 9), Row(10, 11, 12)), Row(13, 14, 15))))),
      StructType(Seq(
        StructField("a", StructType(Seq(
          StructField("a", StructType(Seq(
            StructField("a", IntegerType),
            StructField("b", IntegerType),
            StructField("c", IntegerType)))),
          StructField("b", StructType(Seq(
            StructField("a", StructType(Seq(
              StructField("a", IntegerType),
              StructField("b", IntegerType),
              StructField("c", IntegerType)))),
            StructField("b", StructType(Seq(
              StructField("a", IntegerType),
              StructField("b", IntegerType),
              StructField("c", IntegerType)))),
            StructField("c", StructType(Seq(
              StructField("a", IntegerType),
              StructField("b", IntegerType),
              StructField("c", IntegerType))))
          ))),
          StructField("c", StructType(Seq(
            StructField("a", IntegerType),
            StructField("b", IntegerType),
            StructField("c", IntegerType))))
        )))))).cache

data.show(false)
+---------------------------------+
|a                                |
+---------------------------------+
|[[1, 2, 3], [[4,, 6], [7, 8, 9]]]|
+---------------------------------+
```
Currently, to drop the missing value users would have to do something like this:
```
val result = data.withColumn("a",
  struct(
    $"a.a",
    struct(
      struct(
        $"a.b.a.a",
        $"a.b.a.c"
      ).as("a"),
      $"a.b.b",
      $"a.b.c"
    ).as("b"),
    $"a.c"
  ))

result.show(false)
+---------------------------------------------------------------+
|a                                                              |
+---------------------------------------------------------------+
|[[1, 2, 3], [[4, 6], [7, 8, 9], [10, 11, 12]], [13, 14, 15]]|
+---------------------------------------------------------------+
```
As you can see above, with the existing methods users must call the `struct` function and list all fields, including fields they don't want to change. This is not ideal as:
>this leads to complex, fragile code that cannot survive schema evolution.
[SPARK-16483](https://issues.apache.org/jira/browse/SPARK-16483)

In contrast, with the method added in this PR, a user could simply do something like this to get the same result:
```
val result = data.withColumn("a", 'a.dropFields("b.a.b"))
result.show(false)
+---------------------------------------------------------------+
|a                                                              |
+---------------------------------------------------------------+
|[[1, 2, 3], [[4, 6], [7, 8, 9], [10, 11, 12]], [13, 14, 15]]|
+---------------------------------------------------------------+

```

This is the second of maybe 3 methods that could be added to the `Column` class to make it easier to manipulate nested data.
Other methods under discussion in [SPARK-22231](https://issues.apache.org/jira/browse/SPARK-22231) include `withFieldRenamed`.
However, this should be added in a separate PR.

### Does this PR introduce _any_ user-facing change?

The documentation for `Column.withField` method has changed to include an additional note about how to write optimized queries when adding multiple nested Column directly.

### How was this patch tested?

New unit tests were added. Jenkins must pass them.

### Related JIRAs:
More discussion on this topic can be found here:
- https://issues.apache.org/jira/browse/SPARK-22231
- https://issues.apache.org/jira/browse/SPARK-16483

Closes #29795 from fqaiser94/SPARK-32511-dropFields-second-try.

Authored-by: fqaiser94@gmail.com <fqaiser94@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-06 08:53:30 +00:00
Takeshi Yamamuro 4adc2822a3 [SPARK-33035][SQL] Updates the obsoleted entries of attribute mapping in QueryPlan#transformUpWithNewOutput
### What changes were proposed in this pull request?

This PR intends to fix corner-case bugs in the `QueryPlan#transformUpWithNewOutput` that is used to propagate updated `ExprId`s in a bottom-up way. Let's say we have a rule to simply assign new `ExprId`s in a projection list like this;
```
case class TestRule extends Rule[LogicalPlan] {
  override def apply(plan: LogicalPlan): LogicalPlan = plan.transformUpWithNewOutput {
    case p  Project(projList, _) =>
      val newPlan = p.copy(projectList = projList.map { _.transform {
        // Assigns a new `ExprId` for references
        case a: AttributeReference => Alias(a, a.name)()
      }}.asInstanceOf[Seq[NamedExpression]])

      val attrMapping = p.output.zip(newPlan.output)
      newPlan -> attrMapping
  }
}
```
Then, this rule is applied into a plan below;
```
(3) Project [a#5, b#6]
+- (2) Project [a#5, b#6]
   +- (1) Project [a#5, b#6]
      +- LocalRelation <empty>, [a#5, b#6]
```
In the first transformation, the rule assigns new `ExprId`s in `(1) Project` (e.g., a#5 AS a#7, b#6 AS b#8). In the second transformation, the rule corrects the input references of `(2) Project`  first by using attribute mapping given from `(1) Project` (a#5->a#7 and b#6->b#8) and then assigns new `ExprId`s (e.g., a#7 AS a#9, b#8 AS b#10). But, in the third transformation, the rule fails because it tries to correct the references of `(3) Project` by using incorrect attribute mapping (a#7->a#9 and b#8->b#10) even though the correct one is a#5->a#9 and b#6->b#10. To fix this issue, this PR modified the code to update the attribute mapping entries that are obsoleted by generated entries in a given rule.

### Why are the changes needed?

bugfix.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added tests in `QueryPlanSuite`.

Closes #29911 from maropu/QueryPlanBug.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-10-06 08:32:55 +00:00
Max Gekk 9870cf9c08 [SPARK-33067][SQL][TESTS] Add negative checks to JDBC v2 Table Catalog tests
### What changes were proposed in this pull request?
Add checks for the cases when JDBC v2 Table Catalog commands fail.

### Why are the changes needed?
To improve test coverage.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
By running `JDBCTableCatalogSuite`.

Closes #29945 from MaxGekk/jdbcv2-negative-tests.

Lead-authored-by: Max Gekk <max.gekk@gmail.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-06 13:01:57 +09:00
Dongjoon Hyun 008a2ad1f8 [SPARK-20202][BUILD][SQL] Remove references to org.spark-project.hive (Hive 1.2.1)
### What changes were proposed in this pull request?

As of today,
- SPARK-30034 Apache Spark 3.0.0 switched its default Hive execution engine from Hive 1.2 to Hive 2.3. This removes the direct dependency to the forked Hive 1.2.1 in maven repository.
- SPARK-32981 Apache Spark 3.1.0(`master` branch) removed Hive 1.2 related artifacts from Apache Spark binary distributions.

This PR(SPARK-20202) aims to remove the following usage of unofficial Apache Hive fork completely from Apache Spark master for Apache Spark 3.1.0.
```
<hive.group>org.spark-project.hive</hive.group>
<hive.version>1.2.1.spark2</hive.version>
```

For the forked Hive 1.2.1.spark2 users, Apache Spark 2.4(LTS) and 3.0 (~ 2021.12) will provide it.

### Why are the changes needed?

- First, Apache Spark community should not use the unofficial forked release of another Apache project.
- Second, Apache Hive 1.2.1 was released at 2015-06-26 and the forked Hive `1.2.1.spark2` exposed many unfixable bugs in Apache because the forked `1.2.1.spark2` is not maintained at all. Apache Hive 2.3.0 was released at 2017-07-19 and it has been used with less number of bugs compared with `1.2.1.spark2`. Many bugs still exist in `hive-1.2` profile and new Apache Spark unit tests are added with `HiveUtils.isHive23` condition so far.

### Does this PR introduce _any_ user-facing change?

No. This is a dev-only change. PRBuilder will not accept `[test-hive1.2]` on master and `branch-3.1`.

### How was this patch tested?

1. SBT/Hadoop 3.2/Hive 2.3 (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129366)
2. SBT/Hadoop 2.7/Hive 2.3 (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129382)
3. SBT/Hadoop 3.2/Hive 1.2 (This has not been supported already due to Hive 1.2 doesn't work with Hadoop 3.2.)
4. SBT/Hadoop 2.7/Hive 1.2 (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/129383, This is rejected)

Closes #29936 from dongjoon-hyun/SPARK-REMOVE-HIVE1.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-05 15:29:56 -07:00
allisonwang-db 14aeab3b27 [SPARK-33038][SQL] Combine AQE initial and current plan string when two plans are the same
### What changes were proposed in this pull request?
This PR combines the current plan and the initial plan in the AQE query plan string when the two plans are the same. It also removes the `== Current Plan ==` and `== Initial Plan ==` headers:

Before
```scala
AdaptiveSparkPlan isFinalPlan=false
+- == Current Plan ==
   SortMergeJoin [key#13], [a#23], Inner
   :- Sort [key#13 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(key#13, 5), true, [id=#94]
            ...
+- == Initial Plan ==
   SortMergeJoin [key#13], [a#23], Inner
   :- Sort [key#13 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(key#13, 5), true, [id=#94]
            ...
```
After
```scala
AdaptiveSparkPlan isFinalPlan=false
+- SortMergeJoin [key#13], [a#23], Inner
   :- Sort [key#13 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(key#13, 5), true, [id=#94]
            ...
```
For SQL `EXPLAIN` output:
Before
```scala
AdaptiveSparkPlan (8)
+- == Current Plan ==
   Sort (7)
   +- Exchange (6)
      ...
+- == Initial Plan ==
   Sort (7)
   +- Exchange (6)
      ...
```
After
```scala
AdaptiveSparkPlan (8)
+- Sort (7)
   +- Exchange (6)
      ...
```

### Why are the changes needed?
To simplify the AQE plan string by removing the redundant plan information.

### Does this PR introduce _any_ user-facing change?
Yes.

### How was this patch tested?
Modified the existing unit test.

Closes #29915 from allisonwang-db/aqe-explain.

Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2020-10-05 09:30:27 -07:00
Yuming Wang 023eb482b2 [SPARK-32914][SQL] Avoid constructing dataType multiple times
### What changes were proposed in this pull request?

Some expression's data type not a static value. It needs to be constructed a new object when calling `dataType` function. E.g.: `CaseWhen`.
We should avoid constructing dataType multiple times because it may be used many times. E.g.: [`HyperLogLogPlusPlus.update`](10edeafc69/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/HyperLogLogPlusPlus.scala (L122)).

### Why are the changes needed?

Improve query performance. for example:
```scala
spark.range(100000000L).selectExpr("approx_count_distinct(case when id % 400 > 20 then id else 0 end)").show
```

Profiling result:
```
-- Execution profile ---
Total samples       : 18365

Frame buffer usage  : 2.6688%

--- 58443254327 ns (31.82%), 5844 samples
  [ 0] GenericTaskQueueSet<OverflowTaskQueue<StarTask, (MemoryType)1, 131072u>, (MemoryType)1>::steal_best_of_2(unsigned int, int*, StarTask&)
  [ 1] StealTask::do_it(GCTaskManager*, unsigned int)
  [ 2] GCTaskThread::run()
  [ 3] java_start(Thread*)
  [ 4] start_thread

--- 6140668667 ns (3.34%), 614 samples
  [ 0] GenericTaskQueueSet<OverflowTaskQueue<StarTask, (MemoryType)1, 131072u>, (MemoryType)1>::peek()
  [ 1] ParallelTaskTerminator::offer_termination(TerminatorTerminator*)
  [ 2] StealTask::do_it(GCTaskManager*, unsigned int)
  [ 3] GCTaskThread::run()
  [ 4] java_start(Thread*)
  [ 5] start_thread

--- 5679994036 ns (3.09%), 568 samples
  [ 0] scala.collection.generic.Growable.$plus$plus$eq
  [ 1] scala.collection.generic.Growable.$plus$plus$eq$
  [ 2] scala.collection.mutable.ListBuffer.$plus$plus$eq
  [ 3] scala.collection.mutable.ListBuffer.$plus$plus$eq
  [ 4] scala.collection.generic.GenericTraversableTemplate.$anonfun$flatten$1
  [ 5] scala.collection.generic.GenericTraversableTemplate$$Lambda$107.411506101.apply
  [ 6] scala.collection.immutable.List.foreach
  [ 7] scala.collection.generic.GenericTraversableTemplate.flatten
  [ 8] scala.collection.generic.GenericTraversableTemplate.flatten$
  [ 9] scala.collection.AbstractTraversable.flatten
  [10] org.apache.spark.internal.config.ConfigEntry.readString
  [11] org.apache.spark.internal.config.ConfigEntryWithDefault.readFrom
  [12] org.apache.spark.sql.internal.SQLConf.getConf
  [13] org.apache.spark.sql.internal.SQLConf.caseSensitiveAnalysis
  [14] org.apache.spark.sql.types.DataType.sameType
  [15] org.apache.spark.sql.catalyst.analysis.TypeCoercion$.$anonfun$haveSameType$1
  [16] org.apache.spark.sql.catalyst.analysis.TypeCoercion$.$anonfun$haveSameType$1$adapted
  [17] org.apache.spark.sql.catalyst.analysis.TypeCoercion$$$Lambda$1527.1975399904.apply
  [18] scala.collection.IndexedSeqOptimized.prefixLengthImpl
  [19] scala.collection.IndexedSeqOptimized.forall
  [20] scala.collection.IndexedSeqOptimized.forall$
  [21] scala.collection.mutable.ArrayBuffer.forall
  [22] org.apache.spark.sql.catalyst.analysis.TypeCoercion$.haveSameType
  [23] org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck
  [24] org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataTypeCheck$
  [25] org.apache.spark.sql.catalyst.expressions.CaseWhen.dataTypeCheck
  [26] org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType
  [27] org.apache.spark.sql.catalyst.expressions.ComplexTypeMergingExpression.dataType$
  [28] org.apache.spark.sql.catalyst.expressions.CaseWhen.dataType
  [29] org.apache.spark.sql.catalyst.expressions.aggregate.HyperLogLogPlusPlus.update
  [30] org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1.$anonfun$applyOrElse$2
  [31] org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1.$anonfun$applyOrElse$2$adapted
  [32] org.apache.spark.sql.execution.aggregate.AggregationIterator$$anonfun$1$$Lambda$1534.1383512673.apply
  [33] org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateProcessRow$7
  [34] org.apache.spark.sql.execution.aggregate.AggregationIterator.$anonfun$generateProcessRow$7$adapted
  [35] org.apache.spark.sql.execution.aggregate.AggregationIterator$$Lambda$1555.725788712.apply
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test and benchmark test:

Benchmark code | Before this PR(Milliseconds) | After this PR(Milliseconds)
--- | --- | ---
spark.range(100000000L).selectExpr("approx_count_distinct(case   when id % 400 > 20 then id else 0 end)").collect() | 56462 | 3794

Closes #29790 from wangyum/SPARK-32914.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-05 22:00:42 +09:00
Yuning Zhang 0fb2574d4e [SPARK-33042][SQL][TEST] Add a test case to ensure changes to spark.sql.optimizer.maxIterations take effect at runtime
### What changes were proposed in this pull request?

Add a test case to ensure changes to `spark.sql.optimizer.maxIterations` take effect at runtime.

### Why are the changes needed?

Currently, there is only one related test case: https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/internal/SQLConfSuite.scala#L156

However, this test case only checks the value of the conf can be changed at runtime. It does not check the updated value is actually used by the Optimizer.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

unit test

Closes #29919 from yuningzh-db/add_optimizer_test.

Authored-by: Yuning Zhang <yuning.zhang@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-05 20:25:57 +09:00
Liang-Chi Hsieh 37c806af2b [SPARK-32958][SQL] Prune unnecessary columns from JsonToStructs
### What changes were proposed in this pull request?

This patch proposes to do column pruning for `JsonToStructs` expression if we only require some fields from it.

### Why are the changes needed?

`JsonToStructs` takes a schema parameter used to tell `JacksonParser` what fields are needed to parse. If `JsonToStructs` is followed by `GetStructField`. We can prune the schema to only parse certain field.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test

Closes #29900 from viirya/SPARK-32958.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-03 14:55:02 -07:00
Takeshi Yamamuro 82721ce00b [SPARK-32741][SQL][FOLLOWUP] Run plan integrity check only for effective plan changes
### What changes were proposed in this pull request?

(This is a followup PR of #29585) The PR modified `RuleExecutor#isPlanIntegral` code for checking if a plan has globally-unique attribute IDs, but this check made Jenkins maven test jobs much longer (See [the Dongjoon comment](https://github.com/apache/spark/pull/29585#issuecomment-702461314) and thanks, dongjoon-hyun !). To recover running time for the Jenkins tests, this PR intends to update the code to run plan integrity check only for effective plans.

### Why are the changes needed?

To recover running time for Jenkins tests.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #29928 from maropu/PR29585-FOLLOWUP.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-10-02 22:16:19 +09:00
Yuming Wang 9996e252ad [SPARK-33026][SQL] Add numRows to metric of BroadcastExchangeExec
### What changes were proposed in this pull request?

This pr adds `numRows` to the metric and runtimeStatistics of `BroadcastExchangeExec`.

### Why are the changes needed?

[`JoinEstimation.estimateInnerOuterJoin`](d6a68e0b67/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/JoinEstimation.scala (L55-L156)) need row count. The [ShuffleExchangeExec](1c6dff7b5f/sql/core/src/main/scala/org/apache/spark/sql/execution/exchange/ShuffleExchangeExec.scala (L127)) have added the row count, but `BroadcastExchangeExec` missing the row count.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #29904 from wangyum/SPARK-33026.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-10-01 23:01:31 -07:00
Gabor Somogyi 991f7e81d4 [SPARK-32001][SQL] Create JDBC authentication provider developer API
### What changes were proposed in this pull request?
At the moment only the baked in JDBC connection providers can be used but there is a need to support additional databases and use-cases. In this PR I'm proposing a new developer API name `JdbcConnectionProvider`. To show how an external JDBC connection provider can be implemented I've created an example [here](https://github.com/gaborgsomogyi/spark-jdbc-connection-provider).

The PR contains the following changes:
* Added connection provider developer API
* Made JDBC connection providers constructor to noarg => needed to load them w/ service loader
* Connection providers are now loaded w/ service loader
* Added tests to load providers independently
* Moved `SecurityConfigurationLock` into a central place because other areas will change global JVM security config

### Why are the changes needed?
No custom authentication possibility.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
* Existing + additional unit tests
* Docker integration tests
* Tested manually the newly created external JDBC connection provider

Closes #29024 from gaborgsomogyi/SPARK-32001.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-10-02 13:04:40 +09:00
Cheng Su d6f3138352 [SPARK-32859][SQL] Introduce physical rule to decide bucketing dynamically
### What changes were proposed in this pull request?

This PR is to add support to decide bucketed table scan dynamically based on actual query plan. Currently bucketing is enabled by default (`spark.sql.sources.bucketing.enabled`=true), so for all bucketed tables in the query plan, we will use bucket table scan (all input files per the bucket will be read by same task). This has the drawback that if the bucket table scan is not benefitting at all (no join/groupby/etc in the query), we don't need to use bucket table scan as it would restrict the # of tasks to be # of buckets and might hurt parallelism.

The feature is to add a physical plan rule right after `EnsureRequirements`:

The rule goes through plan nodes. For all operators which has "interesting partition" (i.e., require `ClusteredDistribution` or `HashClusteredDistribution`), check if the sub-plan for operator has `Exchange` and bucketed table scan (and only allow certain operators in plan (i.e. `Scan/Filter/Project/Sort/PartialAgg/etc`.), see details in `DisableUnnecessaryBucketedScan.disableBucketWithInterestingPartition`). If yes, disable the bucketed table scan in the sub-plan. In addition, disabling bucketed table scan if there's operator with interesting partition along the sub-plan.

Why the algorithm works is that if there's a shuffle between the bucketed table scan and operator with interesting partition, then bucketed table scan partitioning will be destroyed by the shuffle operator in the middle, and we don't need bucketed table scan for sure.

The idea of "interesting partition" is inspired from "interesting order" in "Access Path Selection in a Relational Database Management System"(http://www.inf.ed.ac.uk/teaching/courses/adbs/AccessPath.pdf), after discussion with cloud-fan .

### Why are the changes needed?

To avoid unnecessary bucketed scan in the query, and this is prerequisite for https://github.com/apache/spark/pull/29625 (decide bucketed sorted scan dynamically will be added later in that PR).

### Does this PR introduce _any_ user-facing change?

A new config `spark.sql.sources.bucketing.autoBucketedScan.enabled` is introduced which set to false by default (the rule is disabled by default as it can regress cached bucketed table query, see discussion in https://github.com/apache/spark/pull/29804#issuecomment-701151447). User can opt-in/opt-out by enabling/disabling the config, as we found in prod, some users rely on assumption of # of tasks == # of buckets when reading bucket table to precisely control # of tasks. This is a bad assumption but it does happen on our side, so leave a config here to allow them opt-out for the feature.

### How was this patch tested?

Added unit tests in `DisableUnnecessaryBucketedScanSuite.scala`

Closes #29804 from c21/bucket-rule.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-10-02 09:01:15 +09:00
ulysses e62d24717e [SPARK-32585][SQL] Support scala enumeration in ScalaReflection
### What changes were proposed in this pull request?

Add code in `ScalaReflection` to support scala enumeration and make enumeration type as string type in Spark.

### Why are the changes needed?

We support java enum but failed with scala enum, it's better to keep the same behavior.

Here is a example.

```
package test

object TestEnum extends Enumeration {
  type TestEnum = Value
  val E1, E2, E3 = Value
}
import TestEnum._
case class TestClass(i: Int,  e: TestEnum) {
}

import test._
Seq(TestClass(1, TestEnum.E1)).toDS
```

Before this PR
```
Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for test.TestEnum.TestEnum
- field (class: "scala.Enumeration.Value", name: "e")
- root class: "test.TestClass"
  at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerFor$1(ScalaReflection.scala:567)
  at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69)
  at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:882)
  at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:881)
```

After this PR
`org.apache.spark.sql.Dataset[test.TestClass] = [i: int, e: string]`

### Does this PR introduce _any_ user-facing change?

Yes, user can make case class which include scala enumeration field as dataset.

### How was this patch tested?

Add test.

Closes #29403 from ulysses-you/SPARK-32585.

Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Tathagata Das <tathagata.das1565@gmail.com>
2020-10-01 15:58:01 -04:00
yangjie01 0963fcd848 [SPARK-33024][SQL] Fix CodeGen fallback issue of UDFSuite in Scala 2.13
### What changes were proposed in this pull request?
After `SPARK-32851` set `CODEGEN_FACTORY_MODE` to `CODEGEN_ONLY` of `sparkConf` in `SharedSparkSessionBase`  to construction `SparkSession`  in test, the test suite `SPARK-32459: UDF should not fail on WrappedArray` in s.sql.UDFSuite exposed a codegen fallback issue in Scala 2.13 as follow:

```
- SPARK-32459: UDF should not fail on WrappedArray *** FAILED ***
Caused by: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 47, Column 99: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 47, Column 99: No applicable constructor/method found for zero actual parameters; candidates are: "public scala.collection.mutable.Builder scala.collection.mutable.ArraySeq$.newBuilder(java.lang.Object)", "public scala.collection.mutable.Builder scala.collection.mutable.ArraySeq$.newBuilder(scala.reflect.ClassTag)", "public abstract scala.collection.mutable.Builder scala.collection.EvidenceIterableFactory.newBuilder(java.lang.Object)"
```

The root cause is `WrappedArray` represent `mutable.ArraySeq`  in Scala 2.13 and has a different constructor of `newBuilder` method.

The main change of is pr is add Scala 2.13 only code part to deal with  `case match WrappedArray` in Scala 2.13.

### Why are the changes needed?
We need to support a Scala 2.13 build

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
- Scala 2.12: Pass the Jenkins or GitHub Action

- Scala 2.13: All tests passed.

Do the following:

```
dev/change-scala-version.sh 2.13
mvn clean install -DskipTests  -pl sql/core -Pscala-2.13 -am
mvn test -pl sql/core -Pscala-2.13
```

**Before**
```
Tests: succeeded 8540, failed 1, canceled 1, ignored 52, pending 0
*** 1 TEST FAILED ***

```

**After**

```
Tests: succeeded 8541, failed 0, canceled 1, ignored 52, pending 0
All tests passed.
```

Closes #29903 from LuciferYang/fix-udfsuite.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-10-01 08:37:07 -05:00
Max Gekk 5651284c3b [SPARK-32992][SQL] Map Oracle's ROWID type to StringType in read via JDBC
### What changes were proposed in this pull request?
Convert the `ROWID` type in the Oracle JDBC dialect to Catalyst's `StringType`. The doc for Oracle 19c says explicitly that the type must be string: https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Data-Types.html#GUID-AEF1FE4C-2DE5-4BE7-BB53-83AD8F1E34EF

### Why are the changes needed?
To avoid the exception showed in https://stackoverflow.com/questions/52244492/spark-jdbc-dataframereader-fails-to-read-oracle-table-with-datatype-as-rowid

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
N/A

Closes #29884 from MaxGekk/jdbc-oracle-rowid-string.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-10-01 14:50:32 +09:00
Takeshi Yamamuro 3a299aa648 [SPARK-32741][SQL] Check if the same ExprId refers to the unique attribute in logical plans
### What changes were proposed in this pull request?

Some plan transformations (e.g., `RemoveNoopOperators`) implicitly assume the same `ExprId` refers to the unique attribute. But, `RuleExecutor` does not check this integrity between logical plan transformations. So, this PR intends to add this check in `isPlanIntegral` of `Analyzer`/`Optimizer`.

This PR comes from the talk with cloud-fan viirya in https://github.com/apache/spark/pull/29485#discussion_r475346278

### Why are the changes needed?

For better logical plan integrity checking.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #29585 from maropu/PlanIntegrityTest.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-09-30 21:37:29 +09:00
Yuming Wang 711d8dd28a [SPARK-33018][SQL] Fix estimate statistics issue if child has 0 bytes
### What changes were proposed in this pull request?

This pr fix estimate statistics issue if child has 0 bytes.

### Why are the changes needed?
The `sizeInBytes` can be `0` when AQE and CBO are enabled(`spark.sql.adaptive.enabled`=true, `spark.sql.cbo.enabled`=true and `spark.sql.cbo.planStats.enabled`=true). This will generate incorrect BroadcastJoin, resulting in Driver OOM. For example:
![SPARK-33018](https://user-images.githubusercontent.com/5399861/94457606-647e3d00-01e7-11eb-85ee-812ae6efe7bb.jpg)

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.

Closes #29894 from wangyum/SPARK-33018.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-29 16:46:04 +00:00
tanel.kiis@gmail.com 90e86f6fac [SPARK-32970][SPARK-32019][SQL][TEST] Reduce the runtime of an UT for
### What changes were proposed in this pull request?

The UT for SPARK-32019 (#28853) tries to write about 16GB of data do the disk. We must change the value of `spark.sql.files.maxPartitionBytes` to a smaller value do check the correct behavior with less data. By default it is `128MB`.
The other parameters in this UT are also changed to smaller values to keep the behavior the same.

### Why are the changes needed?

The runtime of this one UT can be over 7 minutes on Jenkins. After the change it is few seconds.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing UT

Closes #29842 from tanelk/SPARK-32970.

Authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-29 16:51:44 +09:00
Liang-Chi Hsieh 202115e7cd [SPARK-32948][SQL] Optimize to_json and from_json expression chain
### What changes were proposed in this pull request?

This patch proposes to optimize from_json + to_json expression chain.

### Why are the changes needed?

To optimize json expression chain that could be manually generated or generated automatically during query optimization.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Unit test.

Closes #29828 from viirya/SPARK-32948.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-09-28 22:22:47 -07:00
Max Gekk 1b60ff5afe [MINOR][DOCS] Document when current_date and current_timestamp are evaluated
### What changes were proposed in this pull request?
Explicitly document that `current_date` and `current_timestamp` are executed at the start of query evaluation. And all calls of `current_date`/`current_timestamp` within the same query return the same value

### Why are the changes needed?
Users could expect that `current_date` and `current_timestamp` return the current date/timestamp at the moment of query execution but in fact the functions are folded by the optimizer at the start of query evaluation:
0df8dd6073/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/finishAnalysis.scala (L71-L91)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
by running `./dev/scalastyle`.

Closes #29892 from MaxGekk/doc-current_date.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-29 05:20:12 +00:00
Max Gekk 68cd5677ae [SPARK-33015][SQL] Compute the current date only once
### What changes were proposed in this pull request?
Compute the current date at the specified time zone using timestamp taken at the start of query evaluation.

### Why are the changes needed?
According to the doc for [current_date()](http://spark.apache.org/docs/latest/api/sql/#current_date), the current date should be computed at the start of query evaluation but it can be computed multiple times. As a consequence of that, the function can return different values if the query is executed at the border of two dates.

### Does this PR introduce _any_ user-facing change?
Yes

### How was this patch tested?
By existing test suites `ComputeCurrentTimeSuite` and `DateExpressionsSuite`.

Closes #29889 from MaxGekk/fix-current_date.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-29 05:13:01 +00:00
gengjiaan a53fc9b7ae [SPARK-27951][SQL][FOLLOWUP] Improve the window function nth_value
### What changes were proposed in this pull request?
https://github.com/apache/spark/pull/29604 supports the ANSI SQL NTH_VALUE.
We should override the `prettyName` and `sql`.

### Why are the changes needed?
Make the name of nth_value correct.
To show the ignoreNulls parameter correctly.

### Does this PR introduce _any_ user-facing change?
'No'.

### How was this patch tested?
Jenkins test.

Closes #29886 from beliefer/improve-nth_value.

Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: beliefer <beliefer@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-29 09:54:43 +09:00
tanel.kiis@gmail.com f41ba2a2f3 [SPARK-32927][SQL] Bitwise OR, AND and XOR should have similar canonicalization rules to boolean OR and AND
### What changes were proposed in this pull request?

Add canonicalization rules for commutative bitwise operations.

### Why are the changes needed?

Canonical form is used in many other optimization rules. Reduces the number of cases, where plans with identical results are considered to be distinct.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

UT

Closes #29794 from tanelk/SPARK-32927.

Lead-authored-by: tanel.kiis@gmail.com <tanel.kiis@gmail.com>
Co-authored-by: Tanel Kiis <tanel.kiis@reach-u.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-09-28 12:22:15 +09:00
Kris Mok 9a155d42a3 [SPARK-32999][SQL] Use Utils.getSimpleName to avoid hitting Malformed class name in TreeNode
### What changes were proposed in this pull request?

Use `Utils.getSimpleName` to avoid hitting `Malformed class name` error in `TreeNode`.

### Why are the changes needed?

On older JDK versions (e.g. JDK8u), nested Scala classes may trigger `java.lang.Class.getSimpleName` to throw an `java.lang.InternalError: Malformed class name` error.

Similar to https://github.com/apache/spark/pull/29050, we should use  Spark's `Utils.getSimpleName` utility function in place of `Class.getSimpleName` to avoid hitting the issue.

### Does this PR introduce _any_ user-facing change?

Fixes a bug that throws an error when invoking `TreeNode.nodeName`, otherwise no changes.

### How was this patch tested?

Added new unit test case in `TreeNodeSuite`. Note that the test case assumes the test code can trigger the expected error, otherwise it'll skip the test safely, for compatibility with newer JDKs.

Manually tested on JDK8u and JDK11u and observed expected behavior:
- JDK8u: the test case triggers the "Malformed class name" issue and the fix works;
- JDK11u: the test case does not trigger the "Malformed class name" issue, and the test case is safely skipped.

Closes #29875 from rednaxelafx/spark-32999-getsimplename.

Authored-by: Kris Mok <kris.mok@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-09-26 16:03:59 -07:00
gatorsmile e887c639a7 [SPARK-32931][SQL] Unevaluable Expressions are not Foldable
### What changes were proposed in this pull request?
Unevaluable expressions are not foldable because we don't have an eval for it. This PR is to clean up the code and enforce it.

### Why are the changes needed?
Ensure that we will not hit the weird cases that trigger ConstantFolding.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
The existing tests.

Closes #29798 from gatorsmile/refactorUneval.

Lead-authored-by: gatorsmile <gatorsmile@gmail.com>
Co-authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-25 07:27:29 +00:00
Yuanjian Li 9e6882feca [SPARK-32885][SS] Add DataStreamReader.table API
### What changes were proposed in this pull request?
This pr aims to add a new `table` API in DataStreamReader, which is similar to the table API in DataFrameReader.

### Why are the changes needed?
Users can directly use this API to get a Streaming DataFrame on a table. Below is a simple example:

Application 1 for initializing and starting the streaming job:

```
val path = "/home/yuanjian.li/runtime/to_be_deleted"
val tblName = "my_table"

// Write some data to `my_table`
spark.range(3).write.format("parquet").option("path", path).saveAsTable(tblName)

// Read the table as a streaming source, write result to destination directory
val table = spark.readStream.table(tblName)
table.writeStream.format("parquet").option("checkpointLocation", "/home/yuanjian.li/runtime/to_be_deleted_ck").start("/home/yuanjian.li/runtime/to_be_deleted_2")
```

Application 2 for appending new data:

```
// Append new data into the path
spark.range(5).write.format("parquet").option("path", "/home/yuanjian.li/runtime/to_be_deleted").mode("append").save()
```

Check result:
```
// The desitination directory should contains all written data
spark.read.parquet("/home/yuanjian.li/runtime/to_be_deleted_2").show()
```

### Does this PR introduce _any_ user-facing change?
Yes, a new API added.

### How was this patch tested?
New UT added and integrated testing.

Closes #29756 from xuanyuanking/SPARK-32885.

Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-25 06:50:24 +00:00
ulysses f2fc966674 [SPARK-32877][SQL][TEST] Add test for Hive UDF complex decimal type
### What changes were proposed in this pull request?

Add test to cover Hive UDF whose input contains complex decimal type.
Add comment to explain why we can't make `HiveSimpleUDF` extend `ImplicitTypeCasts`.

### Why are the changes needed?

For better test coverage with Hive which we compatible or not.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Add test.

Closes #29863 from ulysses-you/SPARK-32877-test.

Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-09-24 22:16:05 -07:00
Terry Kim e9c98c910a [SPARK-32990][SQL] Migrate REFRESH TABLE to use UnresolvedTableOrView to resolve the identifier
### What changes were proposed in this pull request?

This PR proposes to migrate `REFRESH TABLE` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing).

### Why are the changes needed?

The current behavior is not consistent between v1 and v2 commands when resolving a temp view.
In v2, the `t` in the following example is resolved to a table:
```scala
sql("CREATE TABLE testcat.ns.t (id bigint) USING foo")
sql("CREATE TEMPORARY VIEW t AS SELECT 2")
sql("USE testcat.ns")
sql("REFRESH TABLE t") // 't' is resolved to testcat.ns.t
```
whereas in v1, the `t` is resolved to a temp view:
```scala
sql("CREATE DATABASE test")
sql("CREATE TABLE spark_catalog.test.t (id bigint) USING csv")
sql("CREATE TEMPORARY VIEW t AS SELECT 2")
sql("USE spark_catalog.test")
sql("REFRESH TABLE t") // 't' is resolved to a temp view
```

### Does this PR introduce _any_ user-facing change?

After this PR, `REFRESH TABLE t` is resolved to a temp view `t` instead of `testcat.ns.t`.

### How was this patch tested?

Added a new test

Closes #29866 from imback82/refresh_table_consistent.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-09-25 04:29:09 +00:00
Dongjoon Hyun d7aa3b56e8 [SPARK-32889][SQL][TESTS][FOLLOWUP] Skip special column names test in Hive 1.2
### What changes were proposed in this pull request?

This PR is a followup of SPARK-32889 in order to ignore the special column names test in `hive-1.2` profile.

### Why are the changes needed?

Hive 1.2 is too old to support special column names because it doesn't use Apache ORC. This will recover our `hive-1.2` Jenkins job.
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-2.7-hive-1.2/
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7-hive-1.2/

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the test with Hive 1.2 profile.

Closes #29867 from dongjoon-hyun/SPARK-32889-2.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-09-24 16:22:08 -07:00
Chao Sun 8ccfbc114e [SPARK-32381][CORE][SQL] Move and refactor parallel listing & non-location sensitive listing to core
<!--
Thanks for sending a pull request!  Here are some tips for you:
  1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
  2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
  3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
  4. Be sure to keep the PR description updated to reflect all changes.
  5. Please write your PR title to summarize what this PR proposes.
  6. If possible, provide a concise example to reproduce the issue for a faster review.
  7. If you want to add a new configuration, please read the guideline first for naming configurations in
     'core/src/main/scala/org/apache/spark/internal/config/ConfigEntry.scala'.
-->

### What changes were proposed in this pull request?
<!--
Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
  1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
  2. If you fix some SQL features, you can provide some references of other DBMSes.
  3. If there is design documentation, please add the link.
  4. If there is a discussion in the mailing list, please add the link.
-->

This moves and refactors the parallel listing utilities from `InMemoryFileIndex` to Spark core so it can be reused by modules beside SQL. Along the process this also did some cleanups/refactorings:

- Created a `HadoopFSUtils` class under core
- Moved `InMemoryFileIndex.bulkListLeafFiles` into `HadoopFSUtils.parallelListLeafFiles`. It now depends on a `SparkContext` instead of `SparkSession` in SQL. Also added a few parameters which used to be read from `SparkSession.conf`: `ignoreMissingFiles`, `ignoreLocality`, `parallelismThreshold`, `parallelismMax ` and `filterFun` (for additional filtering support but we may be able to merge this with `filter` parameter in future).
- Moved `InMemoryFileIndex.listLeafFiles` into `HadoopFSUtils.listLeafFiles` with similar changes above.

### Why are the changes needed?
<!--
Please clarify why the changes are needed. For instance,
  1. If you propose a new API, clarify the use case for a new API.
  2. If you fix a bug, you can clarify why it is a bug.
-->

Currently the locality-aware parallel listing mechanism only applies to `InMemoryFileIndex`. By moving this to core, we can potentially reuse the same mechanism for other code paths as well.

### Does this PR introduce _any_ user-facing change?
<!--
Note that it means *any* user-facing change including all aspects such as the documentation fix.
If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
If possible, please also clarify if this is a user-facing change compared to the released Spark versions or within the unreleased branches such as master.
If no, write 'No'.
-->

No.

### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add.
-->

Since this is mostly a refactoring, it relies on existing unit tests such as those for `InMemoryFileIndex`.

Closes #29471 from sunchao/SPARK-32381.

Lead-authored-by: Chao Sun <sunchao@apache.org>
Co-authored-by: Holden Karau <hkarau@apple.com>
Co-authored-by: Chao Sun <sunchao@uber.com>
Signed-off-by: Holden Karau <hkarau@apple.com>
2020-09-24 10:58:52 -07:00
Russell Spitzer b3f0087e39 [SPARK-32977][SQL][DOCS] Fix JavaDoc on Default Save Mode
### What changes were proposed in this pull request?

The default is always ErrorsOnExist regardless of DataSource version. Fixing the JavaDoc to reflect this.

### Why are the changes needed?

To fix documentation

### Does this PR introduce _any_ user-facing change?

Doc change.

### How was this patch tested?

Manual.

Closes #29853 from RussellSpitzer/SPARK-32977.

Authored-by: Russell Spitzer <russell.spitzer@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2020-09-23 20:02:20 -07:00
Michael Munday faeb71b39d [SPARK-32950][SQL] Remove unnecessary big-endian code paths
### What changes were proposed in this pull request?
Remove unnecessary code.

### Why are the changes needed?

General housekeeping. Might be a slight performance improvement, especially on big-endian systems.

There is no need for separate code paths for big- and little-endian
platforms in putDoubles and putFloats anymore (since PR #24861). On
all platforms values are encoded in native byte order and can just
be copied directly.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
Existing tests.

Closes #29815 from mundaym/clean-putfloats.

Authored-by: Michael Munday <mike.munday@ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-09-23 12:38:06 -05:00
Michael Munday 383bb4af00 [SPARK-32892][CORE][SQL] Fix hash functions on big-endian platforms
MurmurHash3 and xxHash64 interpret sequences of bytes as integers
encoded in little-endian byte order. This requires a byte reversal
on big endian platforms.

I've left the hashInt and hashLong functions as-is for now. My
interpretation of these functions is that they perform the hash on
the integer value as if it were serialized in little-endian byte
order. Therefore no byte reversal is necessary.

### What changes were proposed in this pull request?
Modify hash functions to produce correct results on big-endian platforms.

### Why are the changes needed?
Hash functions produce incorrect results on big-endian platforms which, amongst other potential issues, causes test failures.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existing tests run on the IBM Z (s390x) platform which uses a big-endian byte order.

Closes #29762 from mundaym/fix-hashes.

Authored-by: Michael Munday <mike.munday@ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-09-23 12:36:46 -05:00