ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
HyukjinKwon	7cc0f0e9a7	[SPARK-28894][SQL][TESTS] Add a clue to make it easier to debug via Jenkins's test results ### What changes were proposed in this pull request? See https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109834/testReport/junit/org.apache.spark.sql/SQLQueryTestSuite/ ![Screen Shot 2019-08-28 at 4 08 58 PM](https://user-images.githubusercontent.com/6477701/63833484-2a23ea00-c9ae-11e9-91a1-0859cb183fea.png) ```xml <?xml version="1.0" encoding="UTF-8"?> <testsuite hostname="C02Y52ZLJGH5" name="org.apache.spark.sql.SQLQueryTestSuite" tests="3" errors="0" failures="0" skipped="0" time="14.475"> ... <testcase classname="org.apache.spark.sql.SQLQueryTestSuite" name="sql - Scala UDF" time="6.703"> </testcase> <testcase classname="org.apache.spark.sql.SQLQueryTestSuite" name="sql - Regular Python UDF" time="4.442"> </testcase> <testcase classname="org.apache.spark.sql.SQLQueryTestSuite" name="sql - Scalar Pandas UDF" time="3.33"> </testcase> <system-out/> <system-err/> </testsuite> ``` Root cause seems a bug in SBT - it truncates the test name based on the last dot. https://github.com/sbt/sbt/issues/2949 https://github.com/sbt/sbt/blob/v0.13.18/testing/src/main/scala/sbt/JUnitXmlTestsListener.scala#L71-L79 I tried to find a better way but couldn't find. Therefore, this PR proposes a workaround by appending the test file name into the assert log: ```diff [info] - inner-join.sql * FAILED * (4 seconds, 306 milliseconds) + [info] inner-join.sql [info] Expected "1 a [info] 1 a [info] 1 b [info] 1[]", but got "1 a [info] 1 a [info] 1 b [info] 1[ b]" Result did not match for query #6 [info] SELECT tb.* FROM ta INNER JOIN tb ON ta.a = tb.a AND ta.tag = tb.tag (SQLQueryTestSuite.scala:377) [info] org.scalatest.exceptions.TestFailedException: [info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528) ``` It will at least prevent us to search full logs to identify which test file is failed by clicking filed test. Note that this PR does not fully fix the issue but only fix the logs on its failed tests. ### Why are the changes needed? To debug Jenkins logs easier. Otherwise, we should open full logs and search which test was failed. ### Does this PR introduce any user-facing change? It will print out the file name of failed tests in Jenkins' test reports. ### How was this patch tested? Manually tested but Jenkins tests are required in this PR. Now it at least shows which file it is: ![Screen Shot 2019-08-30 at 10 16 32 PM](https://user-images.githubusercontent.com/6477701/64023705-de22a200-cb73-11e9-8806-2e98ad35adef.png) Closes #25630 from HyukjinKwon/SPARK-28894-1. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-30 15:10:40 -07:00
younggyu chun	3b07a4eb28	[SPARK-27931][SQL] Accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type ## What changes were proposed in this pull request? This PR aims to add "true", "yes", "1", "false", "no", "0", and unique prefixes as input for the boolean data type and ignore input whitespace. Please see the following what string representations are using for the boolean type in other databases. https://www.postgresql.org/docs/devel/datatype-boolean.html https://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html ## How was this patch tested? Added new tests to CastSuite. Closes #25458 from younggyuchun/SPARK-27931. Authored-by: younggyu chun <younggyuchun@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-30 14:18:13 -07:00
Burak Yavuz	827969399b	[SPARK-28668][SQL] Support V2SessionCatalog for ALTER TABLE ### What changes were proposed in this pull request? Adds support for the V2SessionCatalog for ALTER TABLE statements. Implementation changes are ~50 loc. The rest is just test refactoring. ### Why are the changes needed? To allow V2 DataSources to plug in through a configurable plugin interface without requiring the explicit use of catalog identifiers, and leverage ALTER TABLE statements. ### How was this patch tested? By re-using existing tests in DataSourceV2SQLSuite. Closes #25502 from brkyvz/alterV3. Authored-by: Burak Yavuz <brkyvz@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-30 14:16:47 +08:00
Wenchen Fan	f8f7c52f12	[SPARK-28899][SQL][TEST] merge the testing in-memory v2 catalogs from catalyst and core ### What changes were proposed in this pull request? There are 2 in-memory `TableCatalog` and `Table` implementations for testing, in sql/catalyst and sql/core. This PR merges them. After merging, there are 3 classes: 1. `InMemoryTable` 2. `InMemoryTableCatalog` 3. `StagingInMemoryTableCatalog` For better maintainability, these 3 classes are put in 3 different files. ### Why are the changes needed? reduce duplicated code ### Does this PR introduce any user-facing change? no ### How was this patch tested? N/A Closes #25610 from cloud-fan/dsv2-test. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Ryan Blue <blue@apache.org>	2019-08-29 12:56:19 -07:00
Gengliang Wang	24655583f1	[SPARK-28495][SQL][FOLLOW-UP] Disallow conversions between timestamp and long in ASNI mode ### What changes were proposed in this pull request? Disallow conversions between `timestamp` type and `long` type in table insertion with ANSI store assignment policy. ### Why are the changes needed? In the PR https://github.com/apache/spark/pull/25581, timestamp type is allowed to be converted to long type, since timestamp type is represented by long type internally, and both legacy mode and strict mode allows the conversion. After reconsideration, I think we should disallow it. As per ANSI SQL section "4.4.2 Characteristics of numbers": > A number is assignable only to sites of numeric type. In PostgreSQL, the conversion between timestamp and long is also disallowed. ### Does this PR introduce any user-facing change? Conversion between timestamp and long is disallowed in table insertion with ANSI store assignment policy. ### How was this patch tested? Unit test Closes #25615 from gengliangwang/disallowTimeStampToLong. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-29 19:59:24 +08:00
Matt Hawes	137b20b964	[SPARK-28818][SQL] Respect source column nullability in the arrays created by `freqItems()` ### What changes were proposed in this pull request? This PR replaces the hard-coded non-nullability of the array elements returned by `freqItems()` with a nullability that reflects the original schema. Essentially [the functional change](https://github.com/apache/spark/pull/25575/files#diff-bf59bb9f3dc351f5bf6624e5edd2dcf4R122) to the schema generation is: ``` StructField(name + "_freqItems", ArrayType(dataType, false)) ``` Becomes: ``` StructField(name + "_freqItems", ArrayType(dataType, originalField.nullable)) ``` Respecting the original nullability prevents issues when Spark depends on `ArrayType`'s `containsNull` being accurate. The example that uncovered this is calling `collect()` on the dataframe (see [ticket](https://issues.apache.org/jira/browse/SPARK-28818) for full repro). Though it's likely that there a several places where this could cause a problem. I've also refactored a small amount of the surrounding code to remove some unnecessary steps and group together related operations. ### Why are the changes needed? I think it's pretty clear why this change is needed. It fixes a bug that currently prevents users from calling `df.freqItems.collect()` along with potentially causing other, as yet unknown, issues. ### Does this PR introduce any user-facing change? Nullability of columns when calling freqItems on them is now respected after the change. ### How was this patch tested? I added a test that specifically tests the carry-through of the nullability as well as explicitly calling `collect()` to catch the exact regression that was observed. I also ran the test against the old version of the code and it fails as expected. Closes #25575 from MGHawes/mhawes/SPARK-28818. Authored-by: Matt Hawes <mhawes@palantir.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-29 10:49:10 +09:00
Yuming Wang	1b404b9b99	[SPARK-28890][SQL] Upgrade Hive Metastore Client to the 3.1.2 for Hive 3.1 ### What changes were proposed in this pull request? Hive 3.1.2 has been released. This PR upgrades the Hive Metastore Client to 3.1.2 for Hive 3.1. Hive 3.1.2 release notes: https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12344397&styleName=Html&projectId=12310843 ### Why are the changes needed? This is an improvement to support a newly release 3.1.2. Otherwise, it will throws `UnsupportedOperationException` if user `set spark.sql.hive.metastore.version=3.1.2`: ```scala Exception in thread "main" java.lang.UnsupportedOperationException: Unsupported Hive Metastore version (3.1.2). Please set spark.sql.hive.metastore.version with a valid version. at org.apache.spark.sql.hive.client.IsolatedClientLoader$.hiveVersion(IsolatedClientLoader.scala:109) ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UT Closes #25604 from wangyum/SPARK-28890. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-28 09:16:54 -07:00
Gengliang Wang	9d6bec183c	[SPARK-28730][SPARK-28495][SQL][FOLLOW-UP] Revise the doc of option spark.sql.storeAssignmentPolicy ### What changes were proposed in this pull request? Revise the documentation of SQL option `spark.sql.storeAssignmentPolicy`. ### Why are the changes needed? 1. Need to point out the ANSI mode is mostly the same with PostgreSQL 2. Need to point out Legacy mode allows type coercion as long as it is valid casting 3. Better examples. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Uni test Closes #25605 from gengliangwang/reviseDoc. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-28 19:59:53 +08:00
Yuming Wang	e3b32da027	[SPARK-25474][SQL][DOCS] Update the docs for spark.sql.statistics.fallBackToHdfs ## What changes were proposed in this pull request? This PR update `spark.sql.statistics.fallBackToHdfs`'s doc: 1. This flag is effective only if it is Hive table. 2. For non-partitioned data source table, it will be automatically recalculated if table statistics are not available 3. For partitioned data source table, It is 'spark.sql.defaultSizeInBytes' if table statistics are not available. Related code: - Non-partitioned data source table: [SizeInBytesOnlyStatsPlanVisitor.default()](`98be8953c7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala (L54-L57)`) -> [LogicalRelation.computeStats()](`a1c1dd3484/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala (L42-L46)`) -> [HadoopFsRelation.sizeInBytes()](`c0632cec04/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/HadoopFsRelation.scala (L72-L75)`) -> [PartitioningAwareFileIndex.sizeInBytes()](`b276788d57/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala (L103)`) `PartitioningAwareFileIndex.sizeInBytes()` is calculated by [`allFiles().map(_.getLen).sum`](`b276788d57/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PartitioningAwareFileIndex.scala (L103)`) if table statistics are not available. - Partitioned data source table: [SizeInBytesOnlyStatsPlanVisitor.default()](`98be8953c7/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/statsEstimation/SizeInBytesOnlyStatsPlanVisitor.scala (L54-L57)`) -> [LogicalRelation.computeStats()](`a1c1dd3484/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/LogicalRelation.scala (L42-L46)`) -> [CatalogFileIndex.sizeInBytes](`5d672b7f3e/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/CatalogFileIndex.scala (L41)`) `CatalogFileIndex.sizeInBytes` is [spark.sql.defaultSizeInBytes](`c30b5297bc/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala (L387)`) if table statistics are not available. ## How was this patch tested? N/A Closes #24715 from wangyum/SPARK-25474. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-28 19:15:26 +08:00
hemanth meka	6252c54e39	[SPARK-23519][SQL] create view should work from query with duplicate output columns What changes were proposed in this pull request? Moving the call for checkColumnNameDuplication out of generateViewProperties. This way we can choose ifcheckColumnNameDuplication will be performed on analyzed or aliased plan without having to pass an additional argument(aliasedPlan) to generateViewProperties. Before the pr column name duplication was performed on the query output of below sql(c1, c1) and the pr makes it perform check on the user provided schema of view definition(c1, c2) Why are the changes needed? Changes are to fix SPARK-23519 bug. Below queries would cause an exception. This pr fixes them and also added a test case. `CREATE TABLE t23519 AS SELECT 1 AS c1 CREATE VIEW v23519 (c1, c2) AS SELECT c1, c1 FROM t23519` Does this PR introduce any user-facing change? No How was this patch tested? new unit test added in SQLViewSuite Closes #25570 from hem1891/SPARK-23519. Lead-authored-by: hemanth meka <hmeka@tibco.com> Co-authored-by: hem1891 <hem1891@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-28 12:11:10 +08:00
Wenchen Fan	90b10b4f7a	[HOT-FIX] fix compilation This is caused by 2 PRs that were merged at the same time: `cb06209fc9` `2b24a71fec` Closes #25597 from cloud-fan/hot-fix. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-27 23:30:44 +08:00
Gengliang Wang	2b24a71fec	[SPARK-28495][SQL] Introduce ANSI store assignment policy for table insertion ### What changes were proposed in this pull request? Introduce ANSI store assignment policy for table insertion. With ANSI policy, Spark performs the type coercion of table insertion as per ANSI SQL. ### Why are the changes needed? In Spark version 2.4 and earlier, when inserting into a table, Spark will cast the data type of input query to the data type of target table by coercion. This can be super confusing, e.g. users make a mistake and write string values to an int column. In data source V2, by default, only upcasting is allowed when inserting data into a table. E.g. int -> long and int -> string are allowed, while decimal -> double or long -> int are not allowed. The rules of UpCast was originally created for Dataset type coercion. They are quite strict and different from the behavior of all existing popular DBMS. This is breaking change. It is possible that existing queries are broken after 3.0 releases. Following ANSI SQL standard makes Spark consistent with the table insertion behaviors of popular DBMS like PostgreSQL/Oracle/Mysql. ### Does this PR introduce any user-facing change? A new optional mode for table insertion. ### How was this patch tested? Unit test Closes #25581 from gengliangwang/ANSImode. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-27 22:13:23 +08:00
WeichenXu	7f605f5559	[SPARK-28621][SQL] Make spark.sql.crossJoin.enabled default value true ### What changes were proposed in this pull request? Make `spark.sql.crossJoin.enabled` default value true ### Why are the changes needed? For implicit cross join, we can set up a watchdog to cancel it if running for a long time. When "spark.sql.crossJoin.enabled" is false, because `CheckCartesianProducts` is implemented in logical plan stage, it may generate some mismatching error which may confuse end user: * it's done in logical phase, so we may fail queries that can be executed via broadcast join, which is very fast. * if we move the check to the physical phase, then a query may success at the beginning, and begin to fail when the table size gets larger (other people insert data to the table). This can be quite confusing. * the CROSS JOIN syntax doesn't work well if join reorder happens. * some non-equi-join will generate plan using cartesian product, but `CheckCartesianProducts` do not detect it and raise error. So that in order to address this in simpler way, we can turn off showing this cross-join error by default. For reference, I list some cases raising mismatching error here: Providing: ``` spark.range(2).createOrReplaceTempView("sm1") // can be broadcast spark.range(50000000).createOrReplaceTempView("bg1") // cannot be broadcast spark.range(60000000).createOrReplaceTempView("bg2") // cannot be broadcast ``` 1) Some join could be convert to broadcast nested loop join, but CheckCartesianProducts raise error. e.g. ``` select sm1.id, bg1.id from bg1 join sm1 where sm1.id < bg1.id ``` 2) Some join will run by CartesianJoin but CheckCartesianProducts DO NOT raise error. e.g. ``` select bg1.id, bg2.id from bg1 join bg2 where bg1.id < bg2.id ``` ### Does this PR introduce any user-facing change? ### How was this patch tested? Closes #25520 from WeichenXu123/SPARK-28621. Authored-by: WeichenXu <weichen.xu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-27 21:53:37 +08:00
Yuming Wang	e12da8b957	[SPARK-28876][SQL] fallBackToHdfs should not support Hive partitioned table ### What changes were proposed in this pull request? This PR makes `spark.sql.statistics.fallBackToHdfs` not support Hive partitioned tables. ### Why are the changes needed? The current implementation is incorrect for external partitions and it is expensive to support partitioned table with external partitions. ### Does this PR introduce any user-facing change? Yes. But I think it will not change the join strategy because partitioned table usually very large. ### How was this patch tested? unit test Closes #25584 from wangyum/SPARK-28876. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-27 21:37:18 +08:00
Yuming Wang	96179732aa	[SPARK-27592][SQL][TEST][FOLLOW-UP] Test set the partitioned bucketed data source table SerDe correctly ### What changes were proposed in this pull request? This PR add test for set the partitioned bucketed data source table SerDe correctly. ### Why are the changes needed? Improve test. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A Closes #25591 from wangyum/SPARK-27592-f1. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-27 21:10:58 +08:00
Wenchen Fan	cb06209fc9	[SPARK-28747][SQL] merge the two data source v2 fallback configs ## What changes were proposed in this pull request? Currently we have 2 configs to specify which v2 sources should fallback to v1 code path. One config for read path, and one config for write path. However, I found it's awkward to work with these 2 configs: 1. for `CREATE TABLE USING format`, should this be read path or write path? 2. for `V2SessionCatalog.loadTable`, we need to return `UnresolvedTable` if it's a DS v1 or we need to fallback to v1 code path. However, at that time, we don't know if the returned table will be used for read or write. We don't have any new features or perf improvement in file source v2. The fallback API is just a safeguard if we have bugs in v2 implementations. There are not many benefits to support falling back to v1 for read and write path separately. This PR proposes to merge these 2 configs into one. ## How was this patch tested? existing tests Closes #25465 from cloud-fan/merge-conf. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-27 20:47:24 +08:00
Yuming Wang	ab1819d38a	[SPARK-28527][SQL][TEST][FOLLOW-UP] Ignores Thrift server ThriftServerQueryTestSuite ### What changes were proposed in this pull request? This PR ignores Thrift server `ThriftServerQueryTestSuite`. ### Why are the changes needed? This ThriftServerQueryTestSuite test case led to frequent Jenkins build failure. ### Does this PR introduce any user-facing change? Yes. ### How was this patch tested? N/A Closes #25592 from wangyum/SPARK-28527-f1. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-27 15:41:22 +09:00
Burak Yavuz	e31aec9be4	[SPARK-28667][SQL] Support InsertInto through the V2SessionCatalog ### What changes were proposed in this pull request? This PR adds support for INSERT INTO through both the SQL and DataFrameWriter APIs through the V2SessionCatalog. ### Why are the changes needed? This will allow V2 tables to be plugged in through the V2SessionCatalog, and be used seamlessly with existing APIs. ### Does this PR introduce any user-facing change? No behavior changes. ### How was this patch tested? Pulled out a lot of tests so that they can be shared across the DataFrameWriter and SQL code paths. Closes #25507 from brkyvz/insertSesh. Lead-authored-by: Burak Yavuz <brkyvz@gmail.com> Co-authored-by: Burak Yavuz <burak@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-27 12:59:53 +08:00
Yuming Wang	6e12b585a9	[SPARK-28527][SQL][TEST] Re-run all the tests in SQLQueryTestSuite via Thrift Server ### What changes were proposed in this pull request? This PR build a test framework that directly re-run all the tests in `SQLQueryTestSuite` via Thrift Server. But it's a little different from `SQLQueryTestSuite`: 1. Can not support [UDF testing](`44e607e921/sql/core/src/test/scala/org/apache/spark/sql/SQLQueryTestSuite.scala (L293-L297)`). 2. Can not support `DESC` command and `SHOW` command because `SQLQueryTestSuite` [formatted the output](`1882912cca/sql/core/src/main/scala/org/apache/spark/sql/execution/HiveResult.scala (L38-L50)`.). When building this framework, found two bug: [SPARK-28624](https://issues.apache.org/jira/browse/SPARK-28624): `make_date` is inconsistent when reading from table [SPARK-28611](https://issues.apache.org/jira/browse/SPARK-28611): Histogram's height is different found two features that ThriftServer can not support: [SPARK-28636](https://issues.apache.org/jira/browse/SPARK-28636): ThriftServer can not support decimal type with negative scale [SPARK-28637](https://issues.apache.org/jira/browse/SPARK-28637): ThriftServer can not support interval type Also, found two inconsistent behavior: [SPARK-28620](https://issues.apache.org/jira/browse/SPARK-28620): Double type returned for float type in Beeline/JDBC [SPARK-28619](https://issues.apache.org/jira/browse/SPARK-28619): The golden result file is different when tested by `bin/spark-sql` ### Why are the changes needed? Improve the overall test coverage for Thrift Server. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #25567 from wangyum/SPARK-28527. Lead-authored-by: Yuming Wang <yumwang@ebay.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-26 22:39:57 +09:00
Dilip Biswal	c61270fd74	[SPARK-27395][SQL] Improve EXPLAIN command ## What changes were proposed in this pull request? This PR aims at improving the way physical plans are explained in spark. Currently, the explain output for physical plan may look very cluttered and each operator's string representation can be very wide and wraps around in the display making it little hard to follow. This especially happens when explaining a query 1) Operating on wide tables 2) Has complex expressions etc. This PR attempts to split the output into two sections. In the header section, we display the basic operator tree with a number associated with each operator. In this section, we strictly control what we output for each operator. In the footer section, each operator is verbosely displayed. Based on the feedback from Maryann, the uncorrelated subqueries (SubqueryExecs) are not included in the main plan. They are printed separately after the main plan and can be correlated by the originating expression id from its parent plan. To illustrate, here is a simple plan displayed in old vs new way. Example query1 : ``` EXPLAIN SELECT key, Max(val) FROM explain_temp1 WHERE key > 0 GROUP BY key HAVING max(val) > 0 ``` Old : ``` (2) Project [key#2, max(val)#15] +- (2) Filter (isnotnull(max(val#3)#18) AND (max(val#3)#18 > 0)) +- (2) HashAggregate(keys=[key#2], functions=[max(val#3)], output=[key#2, max(val)#15, max(val#3)#18]) +- Exchange hashpartitioning(key#2, 200) +- (1) HashAggregate(keys=[key#2], functions=[partial_max(val#3)], output=[key#2, max#21]) +- (1) Project [key#2, val#3] +- (1) Filter (isnotnull(key#2) AND (key#2 > 0)) +- (1) FileScan parquet default.explain_temp1[key#2,val#3] Batched: true, DataFilters: [isnotnull(key#2), (key#2 > 0)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp1], PartitionFilters: [], PushedFilters: [IsNotNull(key), GreaterThan(key,0)], ReadSchema: struct<key:int,val:int> ``` New : ``` Project (8) +- Filter (7) +- HashAggregate (6) +- Exchange (5) +- HashAggregate (4) +- Project (3) +- Filter (2) +- Scan parquet default.explain_temp1 (1) (1) Scan parquet default.explain_temp1 [codegen id : 1] Output: [key#2, val#3] (2) Filter [codegen id : 1] Input : [key#2, val#3] Condition : (isnotnull(key#2) AND (key#2 > 0)) (3) Project [codegen id : 1] Output : [key#2, val#3] Input : [key#2, val#3] (4) HashAggregate [codegen id : 1] Input: [key#2, val#3] (5) Exchange Input: [key#2, max#11] (6) HashAggregate [codegen id : 2] Input: [key#2, max#11] (7) Filter [codegen id : 2] Input : [key#2, max(val)#5, max(val#3)#8] Condition : (isnotnull(max(val#3)#8) AND (max(val#3)#8 > 0)) (8) Project [codegen id : 2] Output : [key#2, max(val)#5] Input : [key#2, max(val)#5, max(val#3)#8] ``` Example Query2 (subquery): ``` SELECT FROM explain_temp1 WHERE KEY = (SELECT Max(KEY) FROM explain_temp2 WHERE KEY = (SELECT Max(KEY) FROM explain_temp3 WHERE val > 0) AND val = 2) AND val > 3 ``` Old: ``` (1) Project [key#2, val#3] +- (1) Filter (((isnotnull(KEY#2) AND isnotnull(val#3)) AND (KEY#2 = Subquery scalar-subquery#39)) AND (val#3 > 3)) : +- Subquery scalar-subquery#39 : +- (2) HashAggregate(keys=[], functions=[max(KEY#26)], output=[max(KEY)#45]) : +- Exchange SinglePartition : +- (1) HashAggregate(keys=[], functions=[partial_max(KEY#26)], output=[max#47]) : +- (1) Project [key#26] : +- (1) Filter (((isnotnull(KEY#26) AND isnotnull(val#27)) AND (KEY#26 = Subquery scalar-subquery#38)) AND (val#27 = 2)) : : +- Subquery scalar-subquery#38 : : +- (2) HashAggregate(keys=[], functions=[max(KEY#28)], output=[max(KEY)#43]) : : +- Exchange SinglePartition : : +- (1) HashAggregate(keys=[], functions=[partial_max(KEY#28)], output=[max#49]) : : +- (1) Project [key#28] : : +- (1) Filter (isnotnull(val#29) AND (val#29 > 0)) : : +- (1) FileScan parquet default.explain_temp3[key#28,val#29] Batched: true, DataFilters: [isnotnull(val#29), (val#29 > 0)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp3], PartitionFilters: [], PushedFilters: [IsNotNull(val), GreaterThan(val,0)], ReadSchema: struct<key:int,val:int> : +- (1) FileScan parquet default.explain_temp2[key#26,val#27] Batched: true, DataFilters: [isnotnull(key#26), isnotnull(val#27), (val#27 = 2)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp2], PartitionFilters: [], PushedFilters: [IsNotNull(key), IsNotNull(val), EqualTo(val,2)], ReadSchema: struct<key:int,val:int> +- *(1) FileScan parquet default.explain_temp1[key#2,val#3] Batched: true, DataFilters: [isnotnull(key#2), isnotnull(val#3), (val#3 > 3)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp1], PartitionFilters: [], PushedFilters: [IsNotNull(key), IsNotNull(val), GreaterThan(val,3)], ReadSchema: struct<key:int,val:int> ``` New: ``` Project (3) +- Filter (2) +- Scan parquet default.explain_temp1 (1) (1) Scan parquet default.explain_temp1 [codegen id : 1] Output: [key#2, val#3] (2) Filter [codegen id : 1] Input : [key#2, val#3] Condition : (((isnotnull(KEY#2) AND isnotnull(val#3)) AND (KEY#2 = Subquery scalar-subquery#23)) AND (val#3 > 3)) (3) Project [codegen id : 1] Output : [key#2, val#3] Input : [key#2, val#3] ===== Subqueries ===== Subquery:1 Hosting operator id = 2 Hosting Expression = Subquery scalar-subquery#23 HashAggregate (9) +- Exchange (8) +- HashAggregate (7) +- Project (6) +- Filter (5) +- Scan parquet default.explain_temp2 (4) (4) Scan parquet default.explain_temp2 [codegen id : 1] Output: [key#26, val#27] (5) Filter [codegen id : 1] Input : [key#26, val#27] Condition : (((isnotnull(KEY#26) AND isnotnull(val#27)) AND (KEY#26 = Subquery scalar-subquery#22)) AND (val#27 = 2)) (6) Project [codegen id : 1] Output : [key#26] Input : [key#26, val#27] (7) HashAggregate [codegen id : 1] Input: [key#26] (8) Exchange Input: [max#35] (9) HashAggregate [codegen id : 2] Input: [max#35] Subquery:2 Hosting operator id = 5 Hosting Expression = Subquery scalar-subquery#22 HashAggregate (15) +- Exchange (14) +- HashAggregate (13) +- Project (12) +- Filter (11) +- Scan parquet default.explain_temp3 (10) (10) Scan parquet default.explain_temp3 [codegen id : 1] Output: [key#28, val#29] (11) Filter [codegen id : 1] Input : [key#28, val#29] Condition : (isnotnull(val#29) AND (val#29 > 0)) (12) Project [codegen id : 1] Output : [key#28] Input : [key#28, val#29] (13) HashAggregate [codegen id : 1] Input: [key#28] (14) Exchange Input: [max#37] (15) HashAggregate [codegen id : 2] Input: [max#37] ``` Note: I opened this PR as a WIP to start getting feedback. I will be on vacation starting tomorrow would not be able to immediately incorporate the feedback. I will start to work on them as soon as i can. Also, currently this PR provides a basic infrastructure for explain enhancement. The details about individual operators will be implemented in follow-up prs ## How was this patch tested? Added a new test `explain.sql` that tests basic scenarios. Need to add more tests. Closes #24759 from dilipbiswal/explain_feature. Authored-by: Dilip Biswal <dbiswal@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-26 20:37:13 +08:00
Yuming Wang	c353a84d1a	[SPARK-28642][SQL][TEST][FOLLOW-UP] Test spark.sql.redaction.options.regex with and without default values ### What changes were proposed in this pull request? Test `spark.sql.redaction.options.regex` with and without default values. ### Why are the changes needed? Normally, we do not rely on the default value of `spark.sql.redaction.options.regex`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #25579 from wangyum/SPARK-28642-f1. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-08-25 23:12:16 -07:00
Yuming Wang	adb506afd7	[SPARK-28852][SQL] Implement SparkGetCatalogsOperation for Thrift Server ### What changes were proposed in this pull request? This PR implements `SparkGetCatalogsOperation` for Thrift Server metadata completeness. ### Why are the changes needed? Thrift Server metadata completeness. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit test Closes #25555 from wangyum/SPARK-28852. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-08-25 22:42:50 -07:00
Terry Kim	a3328cdc0a	[SPARK-28238][SQL][FOLLOW-UP] Clean up attributes for Datasource v2 DESCRIBE TABLE ### What changes were proposed in this pull request? 1. Fix the physical plan (`DescribeTableExec`) to have the same output attributes as the corresponding logical plan. 2. Remove `output` in statements since they are unresolved plans. ### Why are the changes needed? Correctness of how output attributes should work. ### Does this PR introduce any user-facing change? NO ### How was this patch tested? Existing tests Closes #25568 from imback82/describe_table. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-26 13:39:36 +08:00
Yuming Wang	4b16cf11b3	[SPARK-27988][SQL][TEST] Port AGGREGATES.sql [Part 3] ## What changes were proposed in this pull request? This PR is to port AGGREGATES.sql from PostgreSQL regression tests. https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/sql/aggregates.sql#L352-L605 The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_BETA2/src/test/regress/expected/aggregates.out#L986-L1613 When porting the test cases, found seven PostgreSQL specific features that do not exist in Spark SQL: [SPARK-27974](https://issues.apache.org/jira/browse/SPARK-27974): Add built-in Aggregate Function: array_agg [SPARK-27978](https://issues.apache.org/jira/browse/SPARK-27978): Add built-in Aggregate Functions: string_agg [SPARK-27986](https://issues.apache.org/jira/browse/SPARK-27986): Support Aggregate Expressions with filter [SPARK-27987](https://issues.apache.org/jira/browse/SPARK-27987): Support POSIX Regular Expressions [SPARK-28682](https://issues.apache.org/jira/browse/SPARK-28682): ANSI SQL: Collation Support [SPARK-28768](https://issues.apache.org/jira/browse/SPARK-28768): Implement more text pattern operators [SPARK-28865](https://issues.apache.org/jira/browse/SPARK-28865): Table inheritance ## How was this patch tested? N/A Closes #24829 from wangyum/SPARK-27988. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-08-25 23:34:59 +09:00
Yuming Wang	02a0cdea13	[SPARK-28723][SQL] Upgrade to Hive 2.3.6 for HiveMetastore Client and Hadoop-3.2 profile ### What changes were proposed in this pull request? This PR upgrade the built-in Hive to 2.3.6 for `hadoop-3.2`. Hive 2.3.6 release notes: - [HIVE-22096](https://issues.apache.org/jira/browse/HIVE-22096): Backport [HIVE-21584](https://issues.apache.org/jira/browse/HIVE-21584) (Java 11 preparation: system class loader is not URLClassLoader) - [HIVE-21859](https://issues.apache.org/jira/browse/HIVE-21859): Backport [HIVE-17466](https://issues.apache.org/jira/browse/HIVE-17466) (Metastore API to list unique partition-key-value combinations) - [HIVE-21786](https://issues.apache.org/jira/browse/HIVE-21786): Update repo URLs in poms branch 2.3 version ### Why are the changes needed? Make Spark support JDK 11. ### Does this PR introduce any user-facing change? Yes. Please see [SPARK-28684](https://issues.apache.org/jira/browse/SPARK-28684) and [SPARK-24417](https://issues.apache.org/jira/browse/SPARK-24417) for more details. ### How was this patch tested? Existing unit test and manual test. Closes #25443 from wangyum/test-on-jenkins. Lead-authored-by: Yuming Wang <yumwang@ebay.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-23 21:34:30 -07:00
Xiao Li	07c4b9bd1f	Revert "[SPARK-25474][SQL] Support `spark.sql.statistics.fallBackToHdfs` in data source tables" This reverts commit `485ae6d181`. Closes #25563 from gatorsmile/revert. Authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-23 07:41:39 -07:00
Gengliang Wang	8258660f67	[SPARK-28741][SQL] Optional mode: throw exceptions when casting to integers causes overflow ## What changes were proposed in this pull request? To follow ANSI SQL, we should support a configurable mode that throws exceptions when casting to integers causes overflow. The behavior is similar to https://issues.apache.org/jira/browse/SPARK-26218, which throws exceptions on arithmetical operation overflow. To unify it, the configuration is renamed from "spark.sql.arithmeticOperations.failOnOverFlow" to "spark.sql.failOnIntegerOverFlow" ## How was this patch tested? Unit test Closes #25461 from gengliangwang/AnsiCastIntegral. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-23 21:49:45 +08:00
Ali Afroozeh	1472e664ba	[SPARK-28716][SQL] Add id to Exchange and Subquery's stringArgs method for easier identifying their reuses in query plans ## What changes were proposed in this pull request? Add id to Exchange and Subquery's stringArgs method for easier identifying their reuses in query plans, for example: ``` ReusedExchange d_date_sk#827, BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))) [id=#2710] ``` Where `2710` is the id of the reused exchange. ## How was this patch tested? Passes existing tests Closes #25434 from dbaliafroozeh/ImplementStringArgsExchangeSubqueryExec. Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com> Signed-off-by: herman <herman@databricks.com>	2019-08-23 13:29:32 +02:00
Ali Afroozeh	aef7ca1f0b	[SPARK-28836][SQL] Remove the canonicalize(attributes) method from PlanExpression ### What changes were proposed in this pull request? This PR removes the `canonicalize(attrs: AttributeSeq)` from `PlanExpression` and taking care of normalizing expressions in `QueryPlan`. ### Why are the changes needed? `Expression` has already a `canonicalized` method and having the `canonicalize` method in `PlanExpression` is confusing. ### Does this PR introduce any user-facing change? Removes the `canonicalize` plan from `PlanExpression`. Also renames the `normalizeExprId` to `normalizeExpressions` in query plan. ### How was this patch tested? This PR is a refactoring and passes the existing tests Closes #25534 from dbaliafroozeh/ImproveCanonicalizeAPI. Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com> Signed-off-by: herman <herman@databricks.com>	2019-08-23 13:26:58 +02:00
terryk	98e1a4cea4	[SPARK-28319][SQL] Implement SHOW TABLES for Data Source V2 Tables ## What changes were proposed in this pull request? Implements the SHOW TABLES logical and physical plans for data source v2 tables. ## How was this patch tested? Added unit tests to `DataSourceV2SQLSuite`. Closes #25247 from imback82/dsv2_show_tables. Lead-authored-by: terryk <yuminkim@gmail.com> Co-authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-23 14:20:25 +08:00
Ali Afroozeh	9976b876f1	[SPARK-28835][SQL][TEST] Add TPCDSSchema trait ### What changes were proposed in this pull request? This PR extracts the schema information of TPCDS tables into a separate class called `TPCDSSchema` which can be reused for other testing purposes ### How was this patch tested? This PR is only a refactoring for tests and passes existing tests Closes #25535 from dbaliafroozeh/IntroduceTPCDSSchema. Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-22 23:18:46 -07:00
Jungtaek Lim (HeartSaVioR)	406c5331ff	[SPARK-28025][SS] Fix FileContextBasedCheckpointFileManager leaking crc files ### What changes were proposed in this pull request? This PR fixes the leak of crc files from CheckpointFileManager when FileContextBasedCheckpointFileManager is being used. Spark hits the Hadoop bug, [HADOOP-16255](https://issues.apache.org/jira/browse/HADOOP-16255) which seems to be a long-standing issue. This is there're two `renameInternal` methods: ``` public void renameInternal(Path src, Path dst) public void renameInternal(final Path src, final Path dst, boolean overwrite) ``` which should be overridden to handle all cases but ChecksumFs only overrides method with 2 params, so when latter is called FilterFs.renameInternal(...) is called instead, and it will do rename with RawLocalFs as underlying filesystem. The bug is related to FileContext, so FileSystemBasedCheckpointFileManager is not affected. [SPARK-17475](https://issues.apache.org/jira/browse/SPARK-17475) took a workaround for this bug, but [SPARK-23966](https://issues.apache.org/jira/browse/SPARK-23966) seemed to bring regression. This PR deletes crc file as "best-effort" when renaming, as failing to delete crc file is not that critical to fail the task. ### Why are the changes needed? This PR prevents crc files not being cleaned up even purging batches. Too many files in same directory often hurts performance, as well as each crc file occupies more space than its own size so possible to occupy nontrivial amount of space when batches go up to 100000+. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Some unit tests are modified to check leakage of crc files. Closes #25488 from HeartSaVioR/SPARK-28025. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>	2019-08-22 23:10:16 -07:00
Gengliang Wang	895c90b582	[SPARK-28730][SQL] Configurable type coercion policy for table insertion ## What changes were proposed in this pull request? After all the discussions in the dev list: http://apache-spark-developers-list.1001551.n3.nabble.com/Discuss-Follow-ANSI-SQL-on-table-insertion-td27531.html#a27562. Here I propose that we can make the store assignment rules in the analyzer configurable, and the behavior of V1 and V2 should be consistent. When inserting a value into a column with a different data type, Spark will perform type coercion. After this PR, we support 2 policies for the type coercion rules: legacy and strict. 1. With legacy policy, Spark allows casting any value to any data type. The legacy policy is the only behavior in Spark 2.x and it is compatible with Hive. 2. With strict policy, Spark doesn't allow any possible precision loss or data truncation in type coercion, e.g. `int` and `long`, `float` -> `double` are not allowed. Eventually, the "legacy" mode will be removed, so it is disallowed in data source V2. To ensure backward compatibility with existing queries, the default store assignment policy for data source V1 is "legacy". ## How was this patch tested? Unit test Closes #25453 from gengliangwang/tableInsertRule. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-23 13:50:26 +08:00
shivusondur	23bed0d3c0	[SPARK-28702][SQL] Display useful error message (instead of NPE) for invalid Dataset operations ### What changes were proposed in this pull request? Added proper message instead of NPE for invalid Dataset operations (e.g. calling actions inside of transformations) similar to SPARK-5063 for RDD ### Why are the changes needed? To report the user about the exact issue instead of NPE ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually tested ```scala test code snap "import spark.implicits._ val ds1 = spark.sparkContext.parallelize(1 to 100, 100).toDS() val ds2 = spark.sparkContext.parallelize(1 to 100, 100).toDS() ds1.map(x => { // scalastyle:off println(ds2.count + x) x }).collect()" ``` Closes #25503 from shivusondur/jira28702. Authored-by: shivusondur <shivusondur@gmail.com> Signed-off-by: Josh Rosen <rosenville@gmail.com>	2019-08-22 22:15:37 -07:00
Dongjoon Hyun	36da2e3384	[SPARK-28847][TEST] Annotate HiveExternalCatalogVersionsSuite with ExtendedHiveTest ### What changes were proposed in this pull request? This PR aims to annotate `HiveExternalCatalogVersionsSuite` with `ExtendedHiveTest`. ### Why are the changes needed? `HiveExternalCatalogVersionsSuite` is an outstanding test in terms of testing time. This PR aims to allow skipping this test suite when we use `ExtendedHiveTest`. ![time](https://user-images.githubusercontent.com/9700541/63489184-4c75af00-c466-11e9-9e12-d250d4a23292.png) ### Does this PR introduce any user-facing change? No ### How was this patch tested? Since Jenkins doesn't exclude `ExtendedHiveTest`, there is no difference in Jenkins testing. This PR should be tested by manually by the following. BEFORE ``` $ cd sql/hive $ mvn package -Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite -Dtest.exclude.tags=org.apache.spark.tags.ExtendedHiveTest ... Run starting. Expected test count is: 1 HiveExternalCatalogVersionsSuite: 22:32:16.218 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load ... ``` AFTER ``` $ cd sql/hive $ mvn package -Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite -Dtest.exclude.tags=org.apache.spark.tags.ExtendedHiveTest ... Run starting. Expected test count is: 0 HiveExternalCatalogVersionsSuite: Run completed in 772 milliseconds. Total number of tests run: 0 Suites: completed 2, aborted 0 Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0 No tests were executed. ... ``` Closes #25550 from dongjoon-hyun/SPARK-28847. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-22 00:25:56 -07:00
triplesheep	48578a41b5	[SPARK-28844][SQL] Fix typo in SQLConf FILE_COMRESSION_FACTOR ### What changes were proposed in this pull request? Fix minor typo in SQLConf. `FILE_COMRESSION_FACTOR` -> `FILE_COMPRESSION_FACTOR` ### Why are the changes needed? Make conf more understandable. ### Does this PR introduce any user-facing change? No. (`spark.sql.sources.fileCompressionFactor` is unchanged.) ### How was this patch tested? Pass the Jenkins with the existing tests. Closes #25538 from triplesheep/TYPO-FIX. Authored-by: triplesheep <triplesheep0419@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-22 00:07:40 -07:00
maryannxue	aefb2e70e7	[SPARK-28739][SQL] Add a simple cost check for Adaptive Query Execution ### What changes were proposed in this pull request? This PR adds a simple cost model and a mechanism to compare the costs of the before and after plans of each re-optimization in Adaptive Query Execution. Now the workflow of AQE re-optimization is changed to: If the cost of the plan after re-optimization is lower than or equal to that of the plan before re-optimization and the plan has been changed after re-optimization (if equal), the current physical plan will be updated to the plan after re-optimization, otherwise it will remain unchanged until the next re-optimization. ### Why are the changes needed? This new mechanism is to prevent regressions in Adaptive Query Execution caused by change of the plan introducing extra cost, in this PR specifically, change of SMJ to BHJ leading to extra `ShuffleExchangeExec`s. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT. Closes #25456 from maryannxue/aqe-cost. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2019-08-21 19:33:56 -07:00
Wenchen Fan	ed3ea6734c	[SPARK-28837][SQL] CTAS/RTAS should use nullable schema <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html 2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html 3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'. 4. Be sure to keep the PR description updated to reflect all changes. 5. Please write your PR title to summarize what this PR proposes. 6. If possible, provide a concise example to reproduce the issue for a faster review. --> ### What changes were proposed in this pull request? <!-- Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below. 1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers. 2. If you fix some SQL features, you can provide some references of other DBMSes. 3. If there is design documentation, please add the link. 4. If there is a discussion in the mailing list, please add the link. --> When running CTAS/RTAS, use the nullable schema of the input query to create the table. ### Why are the changes needed? <!-- Please clarify why the changes are needed. For instance, 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, you can clarify why it is a bug. --> It's very likely to run CTAS/RTAS with non-nullable input query, e.g. `CREATE TABLE t AS SELECT 1`. However, it's surprising to users if they can't write null to this table later. Non-nullable is kind of a constraint of the column and should be specified by users explicitly. For reference, Postgres also use nullable schema for CTAS: ``` > create table t1(i int not null); > insert into t1 values (1); > create table t2 as select i from t1; > \d+ t1; Column \| Type \| Collation \| Nullable \| Default \| Storage \| Stats target \| Description --------+---------+-----------+----------+---------+---------+--------------+------------- i \| integer \| \| not null \| \| plain \| \| > \d+ t2; Column \| Type \| Collation \| Nullable \| Default \| Storage \| Stats target \| Description --------+---------+-----------+----------+---------+---------+--------------+------------- i \| integer \| \| \| \| plain \| \| ``` File source V1 has the same behavior. ### Does this PR introduce any user-facing change? <!-- If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If no, write 'No'. --> Yes, after this PR CTAS/RTAS creates tables with nullable schema, then users can insert null values later. ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> new test Closes #25536 from cloud-fan/ctas. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-22 09:49:18 +08:00
Wenchen Fan	97b046f06f	[SPARK-28635][SQL][FOLLOWUP] CatalogManager should reflect the changes of default catalog <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html 2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html 3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'. 4. Be sure to keep the PR description updated to reflect all changes. 5. Please write your PR title to summarize what this PR proposes. 6. If possible, provide a concise example to reproduce the issue for a faster review. --> ### What changes were proposed in this pull request? <!-- Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue. If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below. 1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers. 2. If you fix some SQL features, you can provide some references of other DBMSes. 3. If there is design documentation, please add the link. 4. If there is a discussion in the mailing list, please add the link. --> The current namespace/catalog should be set to None at the beginning, so that we can read the new configs when reporting currennt namespace/catalog later. ### Why are the changes needed? <!-- Please clarify why the changes are needed. For instance, 1. If you propose a new API, clarify the use case for a new API. 2. If you fix a bug, you can clarify why it is a bug. --> Fix a bug in CatalogManager, to reflect the change of default catalog config when reporting current catalog. ### Does this PR introduce any user-facing change? <!-- If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If no, write 'No'. --> No. The current namespace/catalog stuff is still internal right now. ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> a new test suite Closes #25521 from cloud-fan/fix. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>	2019-08-21 12:23:42 -07:00
Yuanjian Li	2d9cc42aa8	[SPARK-28699][SQL] Disable using radix sort for ShuffleExchangeExec in repartition case ## What changes were proposed in this pull request? Disable using radix sort in ShuffleExchangeExec when we do repartition. In #20393, we fixed the indeterministic result in the shuffle repartition case by performing a local sort before repartitioning. But for the newly added sort operation, we use radix sort which is wrong because binary data can't be compared by only the prefix. This makes the sort unstable and fails to solve the indeterminate shuffle output problem. ### Why are the changes needed? Fix the correctness bug caused by repartition after a shuffle. ### Does this PR introduce any user-facing change? Yes, user will get the right result in the case of repartition stage rerun. ## How was this patch tested? Test with `local-cluster[5, 2, 5120]`, use the integrated test below, it can return a right answer 100000000. ``` import scala.sys.process._ import org.apache.spark.TaskContext val res = spark.range(0, 10000 * 10000, 1).map{ x => (x % 1000, x)} // kill an executor in the stage that performs repartition(239) val df = res.repartition(113).map{ x => (x._1 + 1, x._2)}.repartition(239).map { x => if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && TaskContext.get.stageAttemptNumber == 0) { throw new Exception("pkill -f -n java".!!) } x } val r2 = df.distinct.count() ``` Closes #25491 from xuanyuanking/SPARK-28699-fix. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-21 10:56:50 -07:00
Ali Afroozeh	4dc3093513	[SPARK-28715][SQL] Introduce collectInPlanAndSubqueries and subqueriesAll in QueryPlan ## What changes were proposed in this pull request? Introduces the collectInPlanAndSubqueries and subqueriesAll methods in QueryPlan that consider all the plans in the query plan, including the ones in nested subqueries. ## How was this patch tested? Unit test added Closes #25433 from dbaliafroozeh/IntroduceCollectInPlanAndSubqueries. Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com> Signed-off-by: herman <herman@databricks.com>	2019-08-21 18:05:18 +02:00
Robert (Bobby) Evans	fac469e2e0	[SPARK-28774][SQL] Fix exchange reuse for columnar data ### What changes were proposed in this pull request? The rule ReuseExchange optimization rule will look for instances of Exchange that have the same plan and convert dedupe them to them to a ReuseExchangeExec instance. In the current Spark codebase all Exchange instances are row based, but if we use the spark.sql.extensions config to put in our own columnar based exchange implementation reuse will throw an exception saying that there was a columnar mismatch. ### Why are the changes needed? Without it Reused Columnar Exchanges throw an exception ### Does this PR introduce any user-facing change? No ### How was this patch tested? I tested this patch by running it against a query that was showing this exact issue and it fixed it. I also added a very simple unit test that shows the issue. Closes #25499 from revans2/reused-columnar-exchange. Authored-by: Robert (Bobby) Evans <bobby@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-21 18:10:26 +08:00
Burak Yavuz	4855bfe16b	[SPARK-28554][SQL] Adds a v1 fallback writer implementation for v2 data source codepaths ## What changes were proposed in this pull request? This PR adds a V1 fallback interface for writing to V2 Tables using V1 Writer interfaces. The only supported SaveMode that will be called on the target table will be an Append. The target table must use V2 interfaces such as `SupportsOverwrite` or `SupportsTruncate` to support Overwrite operations. It is up to the target DataSource implementation if this operation can be atomic or not. We do not support dynamicPartitionOverwrite, as we cannot call a `commit` method that actually cleans up the data in the partitions that were touched through this fallback. ## How was this patch tested? Will add tests and example implementation after comments + feedback. This is a proposal at this point. Closes #25348 from brkyvz/v1WriteFallback. Lead-authored-by: Burak Yavuz <brkyvz@gmail.com> Co-authored-by: Burak Yavuz <burak@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-21 17:25:25 +08:00
Marco Gaido	0bfcf9c210	[SPARK-28322][SQL] Add support to Decimal type for integral divide ## What changes were proposed in this pull request? The expression `IntegralDivide`, which corresponds to the `div` operator, support only integral type. Postgres, though, allows it to work also with decimals. The PR adds the support to decimal operands for this operation in order to have feature parity with postgres. ## How was this patch tested? added UTs Closes #25136 from mgaido91/SPARK-28322. Authored-by: Marco Gaido <marcogaido91@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-08-21 08:43:00 +09:00
maryannxue	39c11273e0	[SPARK-28753][SQL] Dynamically reuse subqueries in AQE ### What changes were proposed in this pull request? This PR changes subquery reuse in Adaptive Query Execution from compile-time static reuse to execution-time dynamic reuse. This PR adds a `ReuseAdaptiveSubquery` rule that applies to a query stage after it is created and before it is executed. The new dynamic reuse enables subqueries to be reused across all different subquery levels. ### Why are the changes needed? This is an improvement to the current subquery reuse in Adaptive Query Execution, which allows subquery reuse to happen in a lazy fashion as well as at different subquery levels. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Passed existing tests. Closes #25471 from maryannxue/aqe-dynamic-sub-reuse. Authored-by: maryannxue <maryannxue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-20 19:58:29 +08:00
Wenchen Fan	d04522187a	[SPARK-28635][SQL] create CatalogManager to track registered v2 catalogs ## What changes were proposed in this pull request? This is a pure refactor PR, which creates a new class `CatalogManager` to track the registered v2 catalogs, and provide the catalog up functionality. `CatalogManager` also tracks the current catalog/namespace. We will implement corresponding commands in other PRs, like `USE CATALOG my_catalog` ## How was this patch tested? existing tests Closes #25368 from cloud-fan/refactor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-20 19:40:21 +08:00
Jungtaek Lim (HeartSaVioR)	b37c8d5cea	[SPARK-28650][SS][DOC] Correct explanation of guarantee for ForeachWriter # What changes were proposed in this pull request? This patch modifies the explanation of guarantee for ForeachWriter as it doesn't guarantee same output for `(partitionId, epochId)`. Refer the description of [SPARK-28650](https://issues.apache.org/jira/browse/SPARK-28650) for more details. Spark itself still guarantees same output for same epochId (batch) if the preconditions are met, 1) source is always providing the same input records for same offset request. 2) the query is idempotent in overall (indeterministic calculation like now(), random() can break this). Assuming breaking preconditions as an exceptional case (the preconditions are implicitly required even before), we still can describe the guarantee with `epochId`, though it will be harder to leverage the guarantee: 1) ForeachWriter should implement a feature to track whether all the partitions are written successfully for given `epochId` 2) There's pretty less chance to leverage the fact, as the chance for Spark to successfully write all partitions and fail to checkpoint the batch is small. Credit to zsxwing on discovering the broken guarantee. ## How was this patch tested? This is just a documentation change, both on javadoc and guide doc. Closes #25407 from HeartSaVioR/SPARK-28650. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>	2019-08-20 00:56:53 -07:00
lihao	79464bed2f	[SPARK-28662][SQL] Create Hive Partitioned Table DDL should fail when partition column type missed ## What changes were proposed in this pull request? Create Hive Partitioned Table without specifying data type for partition column will success unexpectedly. ```HiveQL // create a hive table partition by b, but the data type of b isn't specified. CREATE TABLE tbl(a int) PARTITIONED BY (b) STORED AS parquet ``` In https://issues.apache.org/jira/browse/SPARK-26435 , PARTITIONED BY clause are extended to support Hive CTAS as following: ```ANTLR // Before (PARTITIONED BY '(' partitionColumns=colTypeList ')' // After (PARTITIONED BY '(' partitionColumns=colTypeList ')'\| PARTITIONED BY partitionColumnNames=identifierList) \| ``` Create Table Statement like above case will pass the syntax check, and recognized as (PARTITIONED BY partitionColumnNames=identifierList) 。 This PR will check this case in visitCreateHiveTable and throw a exception which contains explicit error message to user. ## How was this patch tested? Added tests. Closes #25390 from lidinghao/hive-ddl-fix. Authored-by: lihao <lihaowhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-08-20 14:37:04 +08:00
Sean Owen	3b4e345fa1	[SPARK-28775][CORE][TESTS] Skip date 8633 in Kwajalein due to changes in tzdata2018i that only some JDK 8s use ### What changes were proposed in this pull request? Some newer JDKs use the tzdata2018i database, which changes how certain (obscure) historical dates and timezones are handled. As previously, we can pretty much safely ignore these in tests, as the value may vary by JDK. ### Why are the changes needed? Test otherwise fails using, for example, JDK 1.8.0_222. https://bugs.openjdk.java.net/browse/JDK-8215982 has a full list of JDKs which has this. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests Closes #25504 from srowen/SPARK-28775. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-08-19 17:54:25 -07:00
Mick Jermsurawong	b79cf0d143	[SPARK-28224][SQL] Check overflow in decimal Sum aggregate ## What changes were proposed in this pull request? - Currently `sum` in aggregates for decimal type can overflow and return null. - `Sum` expression codegens arithmetic on `sql.Decimal` and the output which preserves scale and precision goes into `UnsafeRowWriter`. Here overflowing will be converted to null when writing out. - It also does not go through this branch in `DecimalAggregates` because it's expecting precision of the sum (not the elements to be summed) to be less than 5. `4ebff5b6d6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/optimizer/Optimizer.scala (L1400-L1403)` - This PR adds the check at the final result of the sum operator itself. `4ebff5b6d6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregate/interfaces.scala (L372-L376)` https://issues.apache.org/jira/browse/SPARK-28224 ## How was this patch tested? - Added an integration test on dataframe suite cc mgaido91 JoshRosen Closes #25033 from mickjermsurawong-stripe/SPARK-28224. Authored-by: Mick Jermsurawong <mickjermsurawong@stripe.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-08-20 09:47:04 +09:00

1 2 3 4 5 ...

8282 commits