ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
yangjie01	43f355b5f2	[SPARK-34597][SQL] Replaces `ParquetFileReader.readFooter` with `ParquetFileReader.open and getFooter` ### What changes were proposed in this pull request? `ParquetFileReader.readFooter` related methods has been identified as `Deprecated` and `Apache Parquet` suggests replace it with the combination of `ParquetFileReader.open() and getFooter()` methods. This PR introduces the `ParquetFooterReader` utility class due to some repetitive code patterns when read parquet file footer. ### Why are the changes needed? Cleanup deprecated API usage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31711 from LuciferYang/parquet-read-footer. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-03-07 23:38:40 -08:00
Angerszhuuuu	401e270c17	[SPARK-34567][SQL] CreateTableAsSelect should update metrics too ### What changes were proposed in this pull request? For command `CreateTableAsSelect` we use `InsertIntoHiveTable`, `InsertIntoHadoopFsRelationCommand` to insert data. We will update metrics of `InsertIntoHiveTable`, `InsertIntoHadoopFsRelationCommand` in `FileFormatWriter.write()`, but we only show CreateTableAsSelectCommand in WebUI SQL Tab. We need to update `CreateTableAsSelectCommand`'s metrics too. Before this PR: ![image](https://user-images.githubusercontent.com/46485123/109411226-81f44480-79db-11eb-99cb-b9686b15bf61.png) After this PR: ![image](https://user-images.githubusercontent.com/46485123/109411232-8ae51600-79db-11eb-9111-3bea0bc2d475.png) ![image](https://user-images.githubusercontent.com/46485123/109905192-62aa2f80-7cd9-11eb-91f9-04b16c9238ae.png) ### Why are the changes needed? Complete SQL Metrics ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? <!-- MT Closes #31679 from AngersZhuuuu/SPARK-34567. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-04 20:42:47 +08:00
Angerszhuuuu	db627107b7	[SPARK-34577][SQL] Fix drop/add columns to a dataset of `DESCRIBE NAMESPACE` ### What changes were proposed in this pull request? In the PR, I propose to generate "stable" output attributes per the logical node of the DESCRIBE NAMESPACE command. ### Why are the changes needed? This fixes the issue demonstrated by the example: ``` sql(s"CREATE NAMESPACE ns") val description = sql(s"DESCRIBE NAMESPACE ns") description.drop("name") ``` ``` [info] org.apache.spark.sql.AnalysisException: Resolved attribute(s) name#74 missing from name#25,value#26 in operator !Project [name#74]. Attribute(s) with the same name appear in the operation: name. Please check if the right attribute(s) are used.; [info] !Project [name#74] [info] +- LocalRelation [name#25, value#26] ``` ### Does this PR introduce _any_ user-facing change? After this change user `drop()/add()` works well. ### How was this patch tested? Added UT Closes #31705 from AngersZhuuuu/SPARK-34577. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-04 13:22:10 +08:00
Kent Yao	6093a78dbd	[SPARK-34558][SQL] warehouse path should be qualified ahead of populating and use ### What changes were proposed in this pull request? Currently, the warehouse path gets fully qualified in the caller side for creating a database, table, partition, etc. An unqualified path is populated into Spark and Hadoop confs, which leads to inconsistent API behaviors. We should make it qualified ahead. When the value is a relative path `spark.sql.warehouse.dir=lakehouse`, some behaviors become inconsistent, for example. If the default database is absent at runtime, the app fails with ```java Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: file:./lakehouse at org.apache.hadoop.fs.Path.initialize(Path.java:263) at org.apache.hadoop.fs.Path.<init>(Path.java:254) at org.apache.hadoop.hive.metastore.Warehouse.getDnsPath(Warehouse.java:133) at org.apache.hadoop.hive.metastore.Warehouse.getDnsPath(Warehouse.java:137) at org.apache.hadoop.hive.metastore.Warehouse.getWhRoot(Warehouse.java:150) at org.apache.hadoop.hive.metastore.Warehouse.getDefaultDatabasePath(Warehouse.java:163) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB_core(HiveMetaStore.java:636) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:655) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:431) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:148) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.<init>(RetryingHMSHandler.java:79) ... 73 more ``` If the default database is present at runtime, the app can work with it, and if we create a database, it gets fully qualified, for example ```sql spark-sql> create database test; Time taken: 0.052 seconds spark-sql> desc database test; Database Name test Comment Location file:/Users/kentyao/Downloads/spark/spark-3.2.0-SNAPSHOT-bin-20210226/lakehouse/test.db Owner kentyao Time taken: 0.023 seconds, Fetched 4 row(s) ``` Another thing is that the log becomes nubilous, for example. ```logtalk 21/02/27 13:54:17 INFO SharedState: Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('datalake'). 21/02/27 13:54:17 INFO SharedState: Warehouse path is 'lakehouse'. ``` ### Why are the changes needed? fix bug and ambiguity ### Does this PR introduce _any_ user-facing change? yes, the path now resolved with proper order - `warehouse->database->table->partition` ### How was this patch tested? w/ ut added Closes #31671 from yaooqinn/SPARK-34558. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-03-02 15:14:19 +00:00
Kent Yao	1afe284ed8	[SPARK-34570][SQL] Remove dead code from constructors of [Hive]SessionStateBuilder ### What changes were proposed in this pull request? the parameter - `options` is never used. The changes here was part of https://github.com/apache/spark/pull/30642, It got reverted for easier backporting #30642 as a hotfix by `dad24543aa`, this PR brings it back to master. ### Why are the changes needed? remove unless dead code ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? Passing CI is enough. Closes #31683 from yaooqinn/SPARK-34570. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-03-01 09:30:18 +09:00
Angerszhuuuu	d574308864	[SPARK-34579][SQL][TEST] Fix wrong UT in SQLQuerySuite ### What changes were proposed in this pull request? Some UT in SQLQuerySuite is not incorrect, it have wrong table name in `withTable`, this pr to make it correct. ### Why are the changes needed? Fix UT ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existed UT Closes #31681 from AngersZhuuuu/SPARK-34569. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2021-02-28 16:21:42 -08:00
Shardul Mahadik	0216051aca	[SPARK-34506][CORE] ADD JAR with ivy coordinates should be compatible with Hive transitive behavior ### What changes were proposed in this pull request? SPARK-33084 added the ability to use ivy coordinates with `SparkContext.addJar`. PR #29966 claims to mimic Hive behavior although I found a few cases where it doesn't 1) The default value of the transitive parameter is false, both in case of parameter not being specified in coordinate or parameter value being invalid. The Hive behavior is that transitive is [true if not specified](`cb2ac3dcc6/ql/src/java/org/apache/hadoop/hive/ql/util/DependencyResolver.java (L169)`) in the coordinate and [false for invalid values](`cb2ac3dcc6/ql/src/java/org/apache/hadoop/hive/ql/util/DependencyResolver.java (L124)`). Also, regardless of Hive, I think a default of true for the transitive parameter also matches [ivy's own defaults](https://ant.apache.org/ivy/history/2.5.0/ivyfile/dependency.html#_attributes). 2) The parameter value for transitive parameter is regarded as case-sensitive [based on the understanding](https://github.com/apache/spark/pull/29966#discussion_r547752259) that Hive behavior is case-sensitive. However, this is not correct, Hive [treats the parameter value case-insensitively](`cb2ac3dcc6/ql/src/java/org/apache/hadoop/hive/ql/util/DependencyResolver.java (L122)`). I propose that we be compatible with Hive for these behaviors ### Why are the changes needed? To make `ADD JAR` with ivy coordinates compatible with Hive's transitive behavior ### Does this PR introduce _any_ user-facing change? The user-facing changes here are within master as the feature introduced in SPARK-33084 has not been released yet 1. Previously an ivy coordinate without `transitive` parameter specified did not resolve transitive dependency, now it does. 2. Previously an `transitive` parameter value was treated case-sensitively. e.g. `transitive=TRUE` would be treated as false as it did not match exactly `true`. Now it will be treated case-insensitively. ### How was this patch tested? Modified existing unit tests to test new behavior Add new unit test to cover usage of `exclude` with unspecified `transitive` Closes #31623 from shardulm94/spark-34506. Authored-by: Shardul Mahadik <smahadik@linkedin.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-03-01 09:10:20 +09:00
ulysses-you	82267acfe8	[SPARK-34550][SQL] Skip InSet null value during push filter to Hive metastore ### What changes were proposed in this pull request? Skip `InSet` null value during push filter to Hive metastore. ### Why are the changes needed? If `InSet` contains a null value, we should skip it and push other values to metastore. To keep same behavior with `In`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test. Closes #31659 from ulysses-you/SPARK-34550. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-26 21:29:14 +09:00
ulysses-you	999d3b89b6	[SPARK-34515][SQL] Fix NPE if InSet contains null value during getPartitionsByFilter ### What changes were proposed in this pull request? Skip null value during rewrite `InSet` to `>= and <=` at getPartitionsByFilter. ### Why are the changes needed? Spark will convert `InSet` to `>= and <=` if it's values size over `spark.sql.hive.metastorePartitionPruningInSetThreshold` during pruning partition . At this case, if values contain a null, we will get such exception ``` java.lang.NullPointerException at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:1389) at org.apache.spark.unsafe.types.UTF8String.compareTo(UTF8String.java:50) at scala.math.LowPriorityOrderingImplicits$$anon$3.compare(Ordering.scala:153) at java.util.TimSort.countRunAndMakeAscending(TimSort.java:355) at java.util.TimSort.sort(TimSort.java:220) at java.util.Arrays.sort(Arrays.java:1438) at scala.collection.SeqLike.sorted(SeqLike.scala:659) at scala.collection.SeqLike.sorted$(SeqLike.scala:647) at scala.collection.AbstractSeq.sorted(Seq.scala:45) at org.apache.spark.sql.hive.client.Shim_v0_13.convert$1(HiveShim.scala:772) at org.apache.spark.sql.hive.client.Shim_v0_13.$anonfun$convertFilters$4(HiveShim.scala:826) at scala.collection.immutable.Stream.flatMap(Stream.scala:489) at org.apache.spark.sql.hive.client.Shim_v0_13.convertFilters(HiveShim.scala:826) at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:848) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:750) ``` ### Does this PR introduce _any_ user-facing change? Yes, bug fix. ### How was this patch tested? Add test. Closes #31632 from ulysses-you/SPARK-34515. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-24 21:32:19 +08:00
Max Gekk	7f27d33a3c	[SPARK-31891][SQL] Support `MSCK REPAIR TABLE .. [{ADD\|DROP\|SYNC} PARTITIONS]` ### What changes were proposed in this pull request? In the PR, I propose to extend the `MSCK REPAIR TABLE` command, and support new options `{ADD\|DROP\|SYNC} PARTITIONS`. In particular: 1. Extend the logical node `RepairTable`, and add two new flags `enableAddPartitions` and `enableDropPartitions`. 2. Add similar flags to the v1 execution node `AlterTableRecoverPartitionsCommand` 3. Add new method `dropPartitions()` to `AlterTableRecoverPartitionsCommand` which drops partitions from the catalog if their locations in the file system don't exist. 4. Updated public docs about the `MSCK REPAIR TABLE` command: <img width="1037" alt="Screenshot 2021-02-16 at 13 46 39" src="https://user-images.githubusercontent.com/1580697/108052607-7446d280-705d-11eb-8e25-7398254787a4.png"> Closes #31097 ### Why are the changes needed? - The changes allow to recover tables with removed partitions. The example below portraits the problem: ```sql spark-sql> create table tbl2 (col int, part int) partitioned by (part); spark-sql> insert into tbl2 partition (part=1) select 1; spark-sql> insert into tbl2 partition (part=0) select 0; spark-sql> show table extended like 'tbl2' partition (part = 0); default tbl2 false Partition Values: [part=0] Location: file:/Users/maximgekk/proj/apache-spark/spark-warehouse/tbl2/part=0 ... ``` Remove the partition (part = 0) from the filesystem: ``` $ rm -rf /Users/maximgekk/proj/apache-spark/spark-warehouse/tbl2/part=0 ``` Even after recovering, we cannot query the table: ```sql spark-sql> msck repair table tbl2; spark-sql> select * from tbl2; 21/01/08 22:49:13 ERROR SparkSQLDriver: Failed in [select * from tbl2] org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/Users/maximgekk/proj/apache-spark/spark-warehouse/tbl2/part=0 ``` - To have feature parity with Hive: https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RecoverPartitions(MSCKREPAIRTABLE) ### Does this PR introduce _any_ user-facing change? Yes. After the changes, we can query recovered table: ```sql spark-sql> msck repair table tbl2 sync partitions; spark-sql> select * from tbl2; 1 1 spark-sql> show partitions tbl2; part=1 ``` ### How was this patch tested? - By running the modified test suite: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly MsckRepairTableParserSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly PlanResolutionSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableRecoverPartitionsSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableRecoverPartitionsParallelSuite" ``` - Added unified v1 and v2 tests for `MSCK REPAIR TABLE`: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *MsckRepairTableSuite" ``` Closes #31499 from MaxGekk/repair-table-drop-partitions. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-23 13:45:15 -08:00
Wenchen Fan	0d5d248bdc	[SPARK-34508][SQL][TEST] Skip HiveExternalCatalogVersionsSuite if network is down ### What changes were proposed in this pull request? It's possible that the network is down when running Spark tests, and it's annoying to see `HiveExternalCatalogVersionsSuite` keep failing. This PR proposes to skip this test suite if we can't get the latest Spark version from the Apache website. ### Why are the changes needed? Make the Spark tests more robust. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #31627 from cloud-fan/test. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-23 13:35:29 -08:00
Max Gekk	23a5996a46	[SPARK-34450][SQL][TESTS] Unify v1 and v2 ALTER TABLE .. RENAME tests ### What changes were proposed in this pull request? 1. Move parser tests from `DDLParserSuite` to `AlterTableRenameParserSuite`. 2. Port DS v1 tests from `DDLSuite` and other test suites to `v1.AlterTableRenameBase` and to `v1.AlterTableRenameSuite`. 3. Add a test for DSv2 `ALTER TABLE .. RENAME` to `v2.AlterTableRenameSuite`. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running new test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableRenameSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableRenameParserSuite" ``` Closes #31575 from MaxGekk/unify-rename-table-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-22 08:36:16 +00:00
Max Gekk	5957bc18a1	[SPARK-34451][SQL] Add alternatives for datetime rebasing SQL configs and deprecate legacy configs ### What changes were proposed in this pull request? Move the datetime rebase SQL configs from the `legacy` namespace by: 1. Renaming of the existing rebase configs like `spark.sql.legacy.parquet.datetimeRebaseModeInRead` -> `spark.sql.parquet.datetimeRebaseModeInRead`. 2. Add the legacy configs as alternatives 3. Deprecate the legacy rebase configs. ### Why are the changes needed? The rebasing SQL configs like `spark.sql.legacy.parquet.datetimeRebaseModeInRead` can be used not only for migration from previous Spark versions but also to read/write datatime columns saved by other systems/frameworks/libs. So, the configs shouldn't be considered as legacy configs. ### Does this PR introduce _any_ user-facing change? Should not. Users will see a warning if they still use one of the legacy configs. ### How was this patch tested? 1. Manually checking new configs: ```scala scala> spark.conf.get("spark.sql.parquet.datetimeRebaseModeInRead") res0: String = EXCEPTION scala> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "LEGACY") 21/02/17 14:57:10 WARN SQLConf: The SQL config 'spark.sql.legacy.parquet.datetimeRebaseModeInRead' has been deprecated in Spark v3.2 and may be removed in the future. Use 'spark.sql.parquet.datetimeRebaseModeInRead' instead. scala> spark.conf.get("spark.sql.parquet.datetimeRebaseModeInRead") res2: String = LEGACY ``` 2. By running a datetime rebasing test suite: ``` $ build/sbt "test:testOnly *ParquetRebaseDatetimeV1Suite" ``` Closes #31576 from MaxGekk/rebase-confs-alternatives. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-17 14:04:47 +00:00
Max Gekk	03161055de	[SPARK-34424][SQL][TESTS] Fix failures of HiveOrcHadoopFsRelationSuite ### What changes were proposed in this pull request? Modify `RandomDataGenerator.forType()` to allow generation of dates/timestamps that are valid in both Julian and Proleptic Gregorian calendars. Currently, the function can produce a date (for example `1582-10-06`) which is valid in the Proleptic Gregorian calendar. Though it cannot be saved to ORC files AS IS since ORC format (ORC libs in fact) assumes Julian calendar. So, Spark shifts `1582-10-06` to the next valid date `1582-10-15` while saving it to ORC files. And as a consequence of that, the test fails because it compares original date `1582-10-06` and the date `1582-10-15` loaded back from the ORC files. In this PR, I propose to generate valid dates/timestamps in both calendars for ORC datasource till SPARK-34440 is resolved. ### Why are the changes needed? The changes fix failures of `HiveOrcHadoopFsRelationSuite`. For instance, the test "test all data types" fails with the seed 610710213676: ``` == Results == !== Correct Answer - 20 == == Spark Answer - 20 == struct<index:int,col:date> struct<index:int,col:date> ... ![9,1582-10-06] [9,1582-10-15] ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the modified test suite: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly *HiveOrcHadoopFsRelationSuite" ``` Closes #31552 from MaxGekk/fix-HiveOrcHadoopFsRelationSuite. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-16 11:53:26 +09:00
Angerszhuuuu	123365e05c	[SPARK-34240][SQL] Unify output of `SHOW TBLPROPERTIES` clause's output attribute's schema and ExprID ### What changes were proposed in this pull request? Passing around the output attributes should have more benefits like keeping the exprID unchanged to avoid bugs when we apply more operators above the command output DataFrame. This PR did 2 things ： 1. After this pr, a `SHOW TBLPROPERTIES` clause's output shows `key` and `value` columns whether you specify the table property `key`. Before this pr, a `SHOW TBLPROPERTIES` clause's output only show a `value` column when you specify the table property `key`.. 2. Keep `SHOW TBLPROPERTIES` command's output attribute exprId unchanged. ### Why are the changes needed? 1. Keep `SHOW TBLPROPERTIES`'s output schema consistence 2. Keep `SHOW TBLPROPERTIES` command's output attribute exprId unchanged. ### Does this PR introduce _any_ user-facing change? After this pr, a `SHOW TBLPROPERTIES` clause's output shows `key` and `value` columns whether you specify the table property `key`. Before this pr, a `SHOW TBLPROPERTIES` clause's output only show a `value` column when you specify the table property `key`. Before this PR: ``` sql > SHOW TBLPROPERTIES tabe_name('key') value value_of_key ``` After this PR ``` sql > SHOW TBLPROPERTIES tabe_name('key') key value key value_of_key ``` ### How was this patch tested? Added UT Closes #31378 from AngersZhuuuu/SPARK-34240. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-10 03:19:52 +00:00
“attilapiros”	cc508d17c7	[SPARK-34370][SQL] Support Avro schema evolution for partitioned Hive tables using "avro.schema.url" ### What changes were proposed in this pull request? With https://github.com/apache/spark/pull/31133 Avro schema evolution is introduce for partitioned hive tables where the schema is given by `avro.schema.literal`. Here that functionality is extended to support schema evolution where the schema is defined via `avro.schema.url`. ### Why are the changes needed? Without this PR the problem described in https://github.com/apache/spark/pull/31133 can be reproduced by tables where `avro.schema.url` is used. As in this case always the property value given at partition level is used for the `avro.schema.url`. So for example when a new column (with a default value) is added to the table then one the following problem happens: - when the new field is added after the last one the cell values will be null values instead of the default value - when the schema is extended somewhere before the last field then values will be listed for the wrong column positions Similar error will happen when one of the field is removed from the schema. For details please check the attached unit tests where both cases are checked. ### Does this PR introduce _any_ user-facing change? Fixes the potential value error. ### How was this patch tested? The existing unit tests for schema evolution is generalized and reused. New tests: - `SPARK-34370: support Avro schema evolution (add column with avro.schema.url)` - `SPARK-34370: support Avro schema evolution (remove column with avro.schema.url)` Closes #31501 from attilapiros/SPARK-34370. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-06 17:25:39 -08:00
“attilapiros”	e614f34c7a	[SPARK-26836][SQL] Supporting Avro schema evolution for partitioned Hive tables with "avro.schema.literal" ### What changes were proposed in this pull request? Before this PR for a partitioned Avro Hive table when the SerDe is configured to read the partition data the table level properties were overwritten by the partition level properties. This PR changes this ordering by giving table level properties higher precedence thus when a new evolved schema is set for the table this new schema will be used to read the partition data and not the original schema which was used for writing the data. This new behavior is consistent with Apache Hive. See the example used in the unit test `SPARK-26836: support Avro schema evolution`, in Hive this results in: ``` 0: jdbc:hive2://<IP>:10000> select * from t; INFO : Compiling command(queryId=hive_20210111141102_7a6349d0-f9ed-4aad-ac07-b94b44de2394): select * from t INFO : Semantic Analysis Completed INFO : Returning Hive schema: Schema(fieldSchemas:[FieldSchema(name:t.col1, type:string, comment:null), FieldSchema(name:t.col2, type:string, comment:null), FieldSchema(name:t.ds, type:string, comment:null)], properties:null) INFO : Completed compiling command(queryId=hive_20210111141102_7a6349d0-f9ed-4aad-ac07-b94b44de2394); Time taken: 0.098 seconds INFO : Executing command(queryId=hive_20210111141102_7a6349d0-f9ed-4aad-ac07-b94b44de2394): select * from t INFO : Completed executing command(queryId=hive_20210111141102_7a6349d0-f9ed-4aad-ac07-b94b44de2394); Time taken: 0.013 seconds INFO : OK +---------------+-------------+-------------+ \| t.col1 \| t.col2 \| t.ds \| +---------------+-------------+-------------+ \| col1_default \| col2_value \| 1981-01-07 \| \| col1_value \| col2_value \| 1983-04-27 \| +---------------+-------------+-------------+ 2 rows selected (0.159 seconds) ``` ### Why are the changes needed? Without this change the old schema would be used. This can use a correctness issue when the new schema introduces a new field with a default value (following the rules of schema evolution) before an existing field. In this case the rows coming from the partition where the old schema was used will contain values in wrong column positions. For example check the attached unit test `SPARK-26836: support Avro schema evolution` Without this fix the result of the select on the table would be: ``` +----------+----------+----------+ \| col1\| col2\| ds\| +----------+----------+----------+ \|col2_value\| null\|1981-01-07\| \|col1_value\|col2_value\|1983-04-27\| +----------+----------+----------+ ``` With this fix: ``` +------------+----------+----------+ \| col1\| col2\| ds\| +------------+----------+----------+ \|col1_default\|col2_value\|1981-01-07\| \| col1_value\|col2_value\|1983-04-27\| +------------+----------+----------+ ``` ### Does this PR introduce _any_ user-facing change? Just fixes the value errors. When a new column is introduced even to the last position then instead of 'null' the given default will be used. ### How was this patch tested? This was tested with the unit tested included to the PR. And manually on Apache Spark / Hive. Closes #31133 from attilapiros/SPARK-26836. Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-02-05 10:56:25 -08:00
Terry Kim	a1d4bb3300	[SPARK-34313][SQL] Migrate ALTER TABLE SET/UNSET TBLPROPERTIES commands to use UnresolvedTable to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `ALTER TABLE ... SET/UNSET TBLPROPERTIES` to use `UnresolvedTable` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? This is a part of effort to make the relation lookup behavior consistent: [SPARK-29900](https://issues.apache.org/jira/browse/SPARK-29900). ### Does this PR introduce _any_ user-facing change? After this PR, `ALTER TABLE SET/UNSET TBLPROPERTIES` will have a consistent resolution behavior. ### How was this patch tested? Updated existing tests / added new tests. Closes #31422 from imback82/v2_alter_table_set_unset_properties. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-03 05:44:58 +00:00
Max Gekk	79515b82f1	[SPARK-34282][SQL][TESTS] Unify v1 and v2 TRUNCATE TABLE tests ### What changes were proposed in this pull request? 1. Move parser tests from `DDLParserSuite` to `TruncateTableParserSuite`. 2. Port DS v1 tests from `DDLSuite` and other test suites to `v1.TruncateTableSuiteBase` and to `v1.TruncateTableSuite`. 3. Add a test for DSv2 `TRUNCATE TABLE` to `v2.TruncateTableSuite`. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running new test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly TruncateTableSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly CatalogedDDLSuite" ``` Closes #31387 from MaxGekk/unify-truncate-table-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-02 14:32:35 +00:00
Terry Kim	f024d3051c	[SPARK-34317][SQL] Introduce relationTypeMismatchHint to UnresolvedTable for a better error message ### What changes were proposed in this pull request? This PR proposes to add `relationTypeMismatchHint` to `UnresolvedTable` so that if a relation is resolved to a view when a table is expected, a hint message can be included as a part of the analysis exception message. Note that the same feature is already introduced to `UnresolvedView` in #30636. This mostly affects `ALTER TABLE` commands where the analysis exception message will now contain `Please use ALTER VIEW as instead`. ### Why are the changes needed? To give a better error message. (The hint used to exist but got removed for commands that migrated to the new resolution framework) ### Does this PR introduce _any_ user-facing change? Yes, now `ALTER TABLE` commands include a hint to use `ALTER VIEW` instead. ``` sql("ALTER TABLE v SET SERDE 'whatever'") ``` Before: ``` "v is a view. 'ALTER TABLE ... SET [SERDE\|SERDEPROPERTIES]' expects a table. ``` After this PR: ``` "v is a view. 'ALTER TABLE ... SET [SERDE\|SERDEPROPERTIES]' expects a table. Please use ALTER VIEW instead. ``` ### How was this patch tested? Updated existing test cases to include the hint. Closes #31424 from imback82/better_error. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-02 08:24:44 +00:00
HyukjinKwon	30468a9015	[SPARK-34306][SQL][PYTHON][R] Use Snake naming rule across the function APIs ### What changes were proposed in this pull request? This PR completes snake_case rule at functions APIs across the languages, see also SPARK-10621. In more details, this PR: - Adds `count_distinct` in Scala Python, and R, and document that `count_distinct` is encouraged. This was not deprecated because `countDistinct` is pretty commonly used. We could deprecate in the future releases. - (Scala-specific) adds `typedlit` but doesn't deprecate `typedLit` which is arguably commonly used. Likewise, we could deprecate in the future releases. - Deprecates and renames: - `sumDistinct` -> `sum_distinct` - `bitwiseNOT` -> `bitwise_not` - `shiftLeft` -> `shiftleft` (matched with SQL name in `FunctionRegistry`) - `shiftRight` -> `shiftright` (matched with SQL name in `FunctionRegistry`) - `shiftRightUnsigned` -> `shiftrightunsigned` (matched with SQL name in `FunctionRegistry`) - (Scala-specific) `callUDF` -> `call_udf` ### Why are the changes needed? To keep the consistent naming in APIs. ### Does this PR introduce _any_ user-facing change? Yes, it deprecates some APIs and add new renamed APIs as described above. ### How was this patch tested? Unittests were added. Closes #31408 from HyukjinKwon/SPARK-34306. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-02-02 09:29:40 +09:00
Angerszhuuuu	74116b6b25	[SPARK-34239][SQL] Unify output of SHOW COLUMNS pass output attributes properly ### What changes were proposed in this pull request? Passing around the output attributes should have more benefits like keeping the expr ID unchanged to avoid bugs when we apply more operators above the command output dataframe. This PR keep SHOW COLUMNS command's output attribute exprId unchanged. ### Why are the changes needed? Keep SHOW PARTITIONS command's output attribute exprid unchanged. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #31377 from AngersZhuuuu/SPARK-34239. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-01 14:16:03 +00:00
Max Gekk	0837c1aa3d	[SPARK-34303][SQL] Migrate ALTER TABLE .. SET LOCATION to new resolution framework ### What changes were proposed in this pull request? 1. Remove old statement `AlterTableSetLocationStatement` 2. Introduce new command `AlterTableSetLocation` for `ALTER TABLE .. SET LOCATION`. ### Why are the changes needed? This is a part of effort to make the relation lookup behavior consistent: SPARK-29900. ### Does this PR introduce _any_ user-facing change? It can change the error message for views. ### How was this patch tested? By running `ALTER TABLE .. SET LOCATION` tests: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly DataSourceV2SQLSuite" $ build/sbt -Phive -Phive-thriftserver "test:testOnly CatalogedDDLSuite" ``` Closes #31414 from MaxGekk/migrate-set-location-resolv-table. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-02-01 13:41:15 +00:00
Terry Kim	a8eb443bf8	[SPARK-34299][SQL] Clean up ResolveSessionCatalog's isTempView and isTempFunction ### What changes were proposed in this pull request? `ResolveSessionCatalog`'s `isTempView` and `isTempFunction` are not being used anymore since the resolution of temp view/function has moved to `Analyzer`. This PR proposes to remove `isTempView` and `isTempFunction` from `ResolveSessionCatalog`. ### Why are the changes needed? To clean up unused variables. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests should cover as this PR just removes the unused variables. Closes #31400 from imback82/cleanup_resolve_session_catalog. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-31 13:03:30 +09:00
Bo Zhang	3f350dbd78	[SPARK-33212][FOLLOW-UP][BUILD] Fix test "built-in Hadoop version should support shaded client" for hadoop-2.7 ### What changes were proposed in this pull request? We added test "built-in Hadoop version should support shaded client" in https://github.com/apache/spark/pull/31203, but it fails when profile hadoop-2.7 is activated. This change fixes the test by skipping the assertion when Hadoop version is 2. ### Why are the changes needed? The test fails in master branch when profile hadoop-2.7 is activated. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Ran the test with hadoop-2.7 profile. Closes #31391 from bozhang2820/fix-hadoop-2-version-test. Authored-by: Bo Zhang <bo.zhang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-29 15:47:02 +09:00
ulysses-you	72b7f8abfb	[SPARK-34261][SQL] Avoid side effect if create exists temporary function ### What changes were proposed in this pull request? Add function exists check before load resource. ### Why are the changes needed? We should not add jar into classpath if the create temporary function is already exists. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test. Closes #31358 from ulysses-you/SPARK-34261. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-01-29 10:39:02 +09:00
Yuming Wang	a7683afdf4	[SPARK-26346][BUILD][SQL] Upgrade Parquet to 1.11.1 ### What changes were proposed in this pull request? This PR upgrade Parquet to 1.11.1. Parquet 1.11.1 new features: - [PARQUET-1201](https://issues.apache.org/jira/browse/PARQUET-1201) - Column indexes - [PARQUET-1253](https://issues.apache.org/jira/browse/PARQUET-1253) - Support for new logical type representation - [PARQUET-1388](https://issues.apache.org/jira/browse/PARQUET-1388) - Nanosecond precision time and timestamp - parquet-mr More details: https://github.com/apache/parquet-mr/blob/apache-parquet-1.11.1/CHANGES.md ### Why are the changes needed? Support column indexes to improve query performance. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing test. Closes #26804 from wangyum/SPARK-26346. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2021-01-29 08:07:49 +08:00
Max Gekk	d242166b8f	[SPARK-34262][SQL] Refresh cached data of v1 table in `ALTER TABLE .. SET LOCATION` ### What changes were proposed in this pull request? Invoke `CatalogImpl.refreshTable()` in v1 implementation of the `ALTER TABLE .. SET LOCATION` command to refresh cached table data. ### Why are the changes needed? The example below portraits the issue: - Create a source table: ```sql spark-sql> CREATE TABLE src_tbl (c0 int, part int) USING hive PARTITIONED BY (part); spark-sql> INSERT INTO src_tbl PARTITION (part=0) SELECT 0; spark-sql> SHOW TABLE EXTENDED LIKE 'src_tbl' PARTITION (part=0); default src_tbl false Partition Values: [part=0] Location: file:/Users/maximgekk/proj/refresh-cache-set-location/spark-warehouse/src_tbl/part=0 ... ``` - Set new location for the empty partition (part=0): ```sql spark-sql> CREATE TABLE dst_tbl (c0 int, part int) USING hive PARTITIONED BY (part); spark-sql> ALTER TABLE dst_tbl ADD PARTITION (part=0); spark-sql> INSERT INTO dst_tbl PARTITION (part=1) SELECT 1; spark-sql> CACHE TABLE dst_tbl; spark-sql> SELECT * FROM dst_tbl; 1 1 spark-sql> ALTER TABLE dst_tbl PARTITION (part=0) SET LOCATION '/Users/maximgekk/proj/refresh-cache-set-location/spark-warehouse/src_tbl/part=0'; spark-sql> SELECT * FROM dst_tbl; 1 1 ``` The last query does not return new loaded data. ### Does this PR introduce _any_ user-facing change? Yes. After the changes, the example above works correctly: ```sql spark-sql> ALTER TABLE dst_tbl PARTITION (part=0) SET LOCATION '/Users/maximgekk/proj/refresh-cache-set-location/spark-warehouse/src_tbl/part=0'; spark-sql> SELECT * FROM dst_tbl; 0 0 1 1 ``` ### How was this patch tested? Added new test to `org.apache.spark.sql.hive.CachedTableSuite`: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly *CachedTableSuite" ``` Closes #31361 from MaxGekk/refresh-cache-set-location. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-28 15:05:22 +09:00
Chao Sun	6ec3cf6219	[SPARK-34271][SQL] Use majorMinorPatchVersion for Hive version parsing ### What changes were proposed in this pull request? Use `majorMinorPatchVersion` to check major & minor version in `IsolatedClientLoader.hiveVersion`. ### Why are the changes needed? Currently `IsolatedClientLoader.hiveVersion` needs to enumerate all Hive patch versions. Therefore, whenever we upgrade Hive version we'd need to remember to update the method as well. It would be better if we just check major & minor version. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This is a refactoring and relies on existing tests. Closes #31371 from sunchao/replace-hive-version. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-28 14:00:10 +09:00
Max Gekk	1318be7ee9	[SPARK-34267][SQL] Remove `refreshTable()` from `SessionState` ### What changes were proposed in this pull request? Remove `SessionState.refreshTable()` and modify the tests where the method is used. ### Why are the changes needed? There are already 2 methods with the same name in: - `SessionCatalog` - `CatalogImpl` One more method in `SessionState` does not give any benefits. By removing it, we can improve code maintenance. ### Does this PR introduce _any_ user-facing change? Should not because `SessionState` is an internal class. ### How was this patch tested? By running the modified test suites: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly MetastoreDataSourcesSuite" $ build/sbt -Phive -Phive-thriftserver "test:testOnly HiveOrcQuerySuite" $ build/sbt -Phive -Phive-thriftserver "test:testOnly *HiveParquetMetastoreSuite" ``` Closes #31366 from MaxGekk/remove-refreshTable-from-SessionState. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-27 09:43:59 -08:00
Chao Sun	abf7e81712	[SPARK-33212][FOLLOW-UP][BUILD] Bring back duplicate dependency check and add more strict Hadoop version check ### What changes were proposed in this pull request? 1. Add back Maven enforcer for duplicate dependencies check 2. More strict check on Hadoop versions which support shaded client in `IsolatedClientLoader`. To do proper version check, this adds a util function `majorMinorPatchVersion` to extract major/minor/patch version from a string. 3. Cleanup unnecessary code ### Why are the changes needed? The Maven enforcer was removed as part of #30556. This proposes to add it back. Also, Hadoop shaded client doesn't work in certain cases (see [these comments](https://github.com/apache/spark/pull/30701#discussion_r558522227) for details). This strictly checks that the current Hadoop version (i.e., 3.2.2 at the moment) has good support of shaded client or otherwise fallback to old unshaded ones. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #31203 from sunchao/SPARK-33212-followup. Lead-authored-by: Chao Sun <sunchao@apple.com> Co-authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-26 15:34:55 -08:00
Max Gekk	ac8307d75c	[SPARK-34215][SQL] Keep tables cached after truncation ### What changes were proposed in this pull request? Invoke `CatalogImpl.refreshTable()` instead of combination of `SessionCatalog.refreshTable()` + `uncacheQuery()`. This allows to clear cached table data while keeping the table cached. ### Why are the changes needed? 1. To improve user experience with Spark SQL 2. To be consistent to other commands, see https://github.com/apache/spark/pull/31206 ### Does this PR introduce _any_ user-facing change? Yes. Before: ```scala scala> sql("CREATE TABLE tbl (c0 int)") res1: org.apache.spark.sql.DataFrame = [] scala> sql("INSERT INTO tbl SELECT 0") res2: org.apache.spark.sql.DataFrame = [] scala> sql("CACHE TABLE tbl") res3: org.apache.spark.sql.DataFrame = [] scala> sql("SELECT * FROM tbl").show(false) +---+ \|c0 \| +---+ \|0 \| +---+ scala> spark.catalog.isCached("tbl") res5: Boolean = true scala> sql("TRUNCATE TABLE tbl") res6: org.apache.spark.sql.DataFrame = [] scala> spark.catalog.isCached("tbl") res7: Boolean = false ``` After: ```scala scala> sql("TRUNCATE TABLE tbl") res6: org.apache.spark.sql.DataFrame = [] scala> spark.catalog.isCached("tbl") res7: Boolean = true ``` ### How was this patch tested? Added new test to `CachedTableSuite`: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly CachedTableSuite" $ build/sbt -Phive -Phive-thriftserver "test:testOnly CatalogedDDLSuite" ``` Closes #31308 from MaxGekk/truncate-table-cached. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-26 15:36:44 +00:00
yangjie01	8999e8805d	[SPARK-34224][CORE][SQL][SS][DSTREAM][YARN][TEST][EXAMPLES] Ensure all resource opened by `Source.fromXXX` are closed ### What changes were proposed in this pull request? Using a function like `.mkString` or `.getLines` directly on a `scala.io.Source` opened by `fromFile`, `fromURL`, `fromURI ` will leak the underlying file handle, this pr use the `Utils.tryWithResource` method wrap the `BufferedSource` to ensure these `BufferedSource` closed. ### Why are the changes needed? Avoid file handle leak. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31323 from LuciferYang/source-not-closed. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-26 19:06:37 +09:00
Yuanjian Li	0a1a029622	[SPARK-34235][SS] Make spark.sql.hive as a private package ### What changes were proposed in this pull request? Follow the comment https://github.com/apache/spark/pull/31271#discussion_r562598983: - Remove the API tag `Unstable` for `HiveSessionStateBuilder` - Add document for spark.sql.hive package to emphasize it's a private package ### Why are the changes needed? Follow the rule for a private package. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Doc change only. Closes #31321 from xuanyuanking/SPARK-34185-follow. Authored-by: Yuanjian Li <yuanjian.li@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-26 17:13:11 +09:00
Angerszhuuuu	7bd4165c11	[SPARK-32852][SQL][FOLLOW_UP] Add notice about keep hive version consistence when config hive jars location ### What changes were proposed in this pull request? Add notice about keep hive version consistence when config hive jars location With PR #29881, if we don't keep hive version consistence. we will got below error. ``` Builtin jars can only be used when hive execution version == hive metastore version. Execution: 2.3.8 != Metastore: 1.2.1. Specify a valid path to the correct hive jars using spark.sql.hive.metastore.jars or change spark.sql.hive.metastore.version to 2.3.8. ``` ![image](https://user-images.githubusercontent.com/46485123/105795169-512d8380-5fc7-11eb-97c3-0259a0d2aa58.png) ### Why are the changes needed? Make config doc detail ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #31317 from AngersZhuuuu/SPARK-32852-followup. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-26 13:40:20 +09:00
Kent Yao	d1177b5230	[SPARK-34192][SQL] Move char padding to write side and remove length check on read side too ### What changes were proposed in this pull request? On the read-side, the char length check and padding bring issues to CBO and predicate pushdown and other issues to the catalyst. This PR reverts `6da5cdf1db` that added read side length check) so that we only do length check for the write side, and data sources/vendors are responsible to enforce the char/varchar constraints for data import operations like ADD PARTITION. It doesn't make sense for Spark to report errors on the read-side if the data is already dirty. This PR also moves the char padding to the write-side, so that it 1) avoids read side issues like CBO and filter pushdown. 2) the data source can preserve char type semantic better even if it's read by systems other than Spark. ### Why are the changes needed? fix perf regression when tables have char/varchar type columns closes #31278 ### Does this PR introduce _any_ user-facing change? yes, spark will not raise error for oversized char/varchar values in read side ### How was this patch tested? modified ut the dropped read side benchmark ``` ================================================================================================ Char Varchar Read Side Perf w/o Tailing Spaces ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 20: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 20 1564 1573 9 63.9 15.6 1.0X read char with length 20 1532 1551 18 65.3 15.3 1.0X read varchar with length 20 1520 1531 13 65.8 15.2 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 40: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 40 1573 1613 41 63.6 15.7 1.0X read char with length 40 1575 1577 2 63.5 15.7 1.0X read varchar with length 40 1568 1576 11 63.8 15.7 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 60: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 60 1526 1540 23 65.5 15.3 1.0X read char with length 60 1514 1539 23 66.0 15.1 1.0X read varchar with length 60 1486 1497 10 67.3 14.9 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 80: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 80 1531 1542 19 65.3 15.3 1.0X read char with length 80 1514 1529 15 66.0 15.1 1.0X read varchar with length 80 1524 1565 42 65.6 15.2 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 100: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 100 1597 1623 25 62.6 16.0 1.0X read char with length 100 1499 1512 16 66.7 15.0 1.1X read varchar with length 100 1517 1524 8 65.9 15.2 1.1X ================================================================================================ Char Varchar Read Side Perf w/ Tailing Spaces ================================================================================================ Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 20: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 20 1524 1526 1 65.6 15.2 1.0X read char with length 20 1532 1537 9 65.3 15.3 1.0X read varchar with length 20 1520 1532 15 65.8 15.2 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 40: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 40 1556 1580 32 64.3 15.6 1.0X read char with length 40 1600 1611 17 62.5 16.0 1.0X read varchar with length 40 1648 1716 88 60.7 16.5 0.9X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 60: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 60 1504 1524 20 66.5 15.0 1.0X read char with length 60 1509 1512 3 66.2 15.1 1.0X read varchar with length 60 1519 1535 21 65.8 15.2 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 80: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 80 1640 1652 17 61.0 16.4 1.0X read char with length 80 1625 1666 35 61.5 16.3 1.0X read varchar with length 80 1590 1605 13 62.9 15.9 1.0X Java HotSpot(TM) 64-Bit Server VM 1.8.0_251-b08 on Mac OS X 10.16 Intel(R) Core(TM) i9-9980HK CPU 2.40GHz Read with length 100: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ read string with length 100 1622 1628 5 61.6 16.2 1.0X read char with length 100 1614 1646 30 62.0 16.1 1.0X read varchar with length 100 1594 1606 11 62.7 15.9 1.0X ``` Closes #31281 from yaooqinn/SPARK-34192. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-26 02:08:35 +08:00
Yuanjian Li	59cbacaddf	[SPARK-34185][DOCS] Review and fix issues in API docs ### What changes were proposed in this pull request? Compare the 3.1.1 API doc with the latest release version 3.0.1. Fix the following issues: - Add missing `Since` annotation for new APIs - Remove the leaking class/object in API doc ### Why are the changes needed? Fix the issues in the Spark 3.1.1 release API docs. ### Does this PR introduce _any_ user-facing change? Yes, API doc changes. ### How was this patch tested? Manually test. Closes #31271 from xuanyuanking/SPARK-34185. Lead-authored-by: Yuanjian Li <yuanjian.li@databricks.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-25 11:38:20 +09:00
Max Gekk	f8bf72ed5d	[SPARK-34213][SQL] Refresh cached data of v1 table in `LOAD DATA` ### What changes were proposed in this pull request? Invoke `CatalogImpl.refreshTable()` instead of `SessionCatalog.refreshTable` in v1 implementation of the `LOAD DATA` command. `SessionCatalog.refreshTable` just refreshes metadata comparing to `CatalogImpl.refreshTable()` which refreshes cached table data as well. ### Why are the changes needed? The example below portraits the issue: - Create a source table: ```sql spark-sql> CREATE TABLE src_tbl (c0 int, part int) USING hive PARTITIONED BY (part); spark-sql> INSERT INTO src_tbl PARTITION (part=0) SELECT 0; spark-sql> SHOW TABLE EXTENDED LIKE 'src_tbl' PARTITION (part=0); default src_tbl false Partition Values: [part=0] Location: file:/Users/maximgekk/proj/load-data-refresh-cache/spark-warehouse/src_tbl/part=0 ... ``` - Load data from the source table to a cached destination table: ```sql spark-sql> CREATE TABLE dst_tbl (c0 int, part int) USING hive PARTITIONED BY (part); spark-sql> INSERT INTO dst_tbl PARTITION (part=1) SELECT 1; spark-sql> CACHE TABLE dst_tbl; spark-sql> SELECT * FROM dst_tbl; 1 1 spark-sql> LOAD DATA LOCAL INPATH '/Users/maximgekk/proj/load-data-refresh-cache/spark-warehouse/src_tbl/part=0' INTO TABLE dst_tbl PARTITION (part=0); spark-sql> SELECT * FROM dst_tbl; 1 1 ``` The last query does not return new loaded data. ### Does this PR introduce _any_ user-facing change? Yes. After the changes, the example above works correctly: ```sql spark-sql> LOAD DATA LOCAL INPATH '/Users/maximgekk/proj/load-data-refresh-cache/spark-warehouse/src_tbl/part=0' INTO TABLE dst_tbl PARTITION (part=0); spark-sql> SELECT * FROM dst_tbl; 0 0 1 1 ``` ### How was this patch tested? Added new test to `org.apache.spark.sql.hive.CachedTableSuite`: ``` $ build/sbt -Phive -Phive-thriftserver "test:testOnly *CachedTableSuite" ``` Closes #31304 from MaxGekk/load-data-refresh-cache. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-23 15:49:10 -08:00
yangjie01	e48a8ad1a2	[SPARK-34202][SQL][TEST] Add ability to fetch spark release package from internal environment in HiveExternalCatalogVersionsSuite ### What changes were proposed in this pull request? `HiveExternalCatalogVersionsSuite` can't run in orgs internal environment where access to outside internet is not allowed because `HiveExternalCatalogVersionsSuite` will download spark release package from internet. Similar to SPARK-32998, this pr add 1 environment variables `SPARK_RELEASE_MIRROR` to let user can specify an accessible download address of spark release package and run `HiveExternalCatalogVersionsSuite` in orgs internal environment. ### Why are the changes needed? Let `HiveExternalCatalogVersionsSuite` can run in orgs internal environment without relying on external spark release download address. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the Jenkins or GitHub Action - Manual test with and without env variables set in internal environment can't access internet. execute ``` mvn clean install -Dhadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -PhPhive -pl sql/hive -am -DskipTests mvn clean install -Dhadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -PhPhive -pl sql/hive -DwildcardSuites=org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite -Dtest=none ``` Without env ``` HiveExternalCatalogVersionsSuite: 19:50:35.123 WARN org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite: Failed to download Spark 3.0.1 from https://archive.apache.org/dist/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz: Network is unreachable (connect failed) 19:50:35.126 WARN org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite: Failed to download Spark 3.0.1 from https://dist.apache.org/repos/dist/release/spark/spark-3.0.1/spark-3.0.1-bin-hadoop3.2.tgz: Network is unreachable (connect failed) org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite * ABORTED * Exception encountered when invoking run on a nested suite - Unable to download Spark 3.0.1 (HiveExternalCatalogVersionsSuite.scala:125) Run completed in 2 seconds, 669 milliseconds. Total number of tests run: 0 Suites: completed 1, aborted 1 Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0 ``` With env ``` export SPARK_RELEASE_MIRROR=${spark-release.internal.com}/dist/release/ ``` ``` HiveExternalCatalogVersionsSuite - backward compatibility Run completed in 1 minute, 32 seconds. Total number of tests run: 1 Suites: completed 2, aborted 0 Tests: succeeded 1, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #31294 from LuciferYang/SPARK-34202. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-23 08:02:52 -08:00
Max Gekk	e79c1cde1b	[SPARK-34138][SQL] Keep dependants cached while refreshing v1 tables ### What changes were proposed in this pull request? This PR changes cache refreshing of v1 tables in v1 commands. In particular, v1 table dependents are not removed from the cache after this PR. Comparing to current implementation, we just clear cached data of all dependents and keep them in the cache. So, the next actions will fill in the cached data of the original v1 table and its dependents. In more details: 1. Modified the `CatalogImpl.refreshTable()` method to use `recacheByPlan()` instead of `lookupCachedData()`, `uncacheQuery()` and `cacheQuery()`. Users can call this method via public API like `spark.catalog.refreshTable()`. 2. Rewritten the part in `CatalogImpl.refreshTable()` which was responsible for table meta-data refreshing because this code stopped to work properly after removing of the second `sparkSession.table(tableIdent)`. 3. Added new private method `invalidateCachedTable()` to `SessionCatalog`. Comparing to the existing `SessionCatalog.refreshTable`, it invalidates the relation cache only. If we called `SessionCatalog.refreshTable` from `CatalogImpl.refreshTable()`, we would refresh temporary and global temporary views twice (that could lead to refreshing file index twice). ### Why are the changes needed? 1. This should improve user experience with table/view caching. For example, let's imagine that an user has cached v1 table and cached view based on the table. And the user passed the table to external library which drops/renames/adds partitions in the v1 table. Unfortunately, the user gets the view uncached after that even he/she hasn't uncached the view explicitly. 2. To improve code maintenance. 3. To reduce the amount of calls to Hive external catalog. 4. Also this should speed up table recaching. 5. To have the same behavior as for v2 tables supported by https://github.com/apache/spark/pull/31172 ### Does this PR introduce _any_ user-facing change? From the view of the correctness of query results, there are no behavior changes but the changes might influence on consuming memory and query execution time. For example: Before: ```scala scala> sql("CREATE TABLE tbl (c int)") scala> sql("CACHE TABLE tbl") scala> sql("CREATE VIEW v AS SELECT * FROM tbl") scala> sql("CACHE TABLE v") scala> spark.catalog.isCached("v") res6: Boolean = true scala> spark.catalog.refreshTable("tbl") scala> spark.catalog.isCached("v") res8: Boolean = false ``` After: ```scala scala> spark.catalog.refreshTable("tbl") scala> spark.catalog.isCached("v") res8: Boolean = true ``` ### How was this patch tested? 1. Added new unit tests that create a view, a temporary view and a global temporary view on top of v1/v2 tables, and refresh the base table via `ALTER TABLE .. ADD/DROP/RENAME PARTITION`. 2. By running the unified test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableAddPartitionSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableDropPartitionSuite" # build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRenamePartitionSuite" ``` Closes #31206 from MaxGekk/refreshTable-recache-by-plan. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-21 13:03:24 +00:00
Ismaël Mejía	e9e81f798f	[SPARK-27733][CORE] Upgrade Avro to version 1.10.1 ### What changes were proposed in this pull request? Update Avro dependency to version 1.10.1 ### Why are the changes needed? To catch up multiple improvements of Avro as well as fix security issues on transitive dependencies. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Since there were no API changes required we just run the tests Closes #31232 from iemejia/SPARK-27733-avro-upgrade. Authored-by: Ismaël Mejía <iemejia@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-20 15:42:27 -08:00
yangjie01	d68612a008	[SPARK-34176][BUILD] Restore the independent mvn test ability of sql/hive module in Scala 2.13 ### What changes were proposed in this pull request? There is one Java UT error when testing sql/hive module independently in Scala 2.13 after SPARK-33212, the error message as follow: ``` [ERROR] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 20.353 s <<< FAILURE! - in org.apache.spark.sql.hive.JavaDataFrameSuite [ERROR] org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF Time elapsed: 18.548 s <<< ERROR! java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport at org.apache.spark.sql.hive.JavaDataFrameSuite.checkAnswer(JavaDataFrameSuite.java:41) at org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF(JavaDataFrameSuite.java:92) Caused by: java.lang.ClassNotFoundException: scala.collection.parallel.TaskSupport at org.apache.spark.sql.hive.JavaDataFrameSuite.checkAnswer(JavaDataFrameSuite.java:41) at org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF(JavaDataFrameSuite.java:92) ``` This pr add a Scala-2.13 profile with dependency of `scala-parallel-collections_` to `sql/hive` module to fix the Java UT in Scala 2.13. ### Why are the changes needed? Recover the independent mvn test ability of sql/hive module in Scala 2.13. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? - Pass the Jenkins or GitHub Action - Manual test ``` dev/change-scala-version.sh 2.13 mvn clean install -Dhadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -Pscala-2.13 -pl sql/hive -am -DskipTests mvn test -Dhadoop-3.2 -Phive-2.3 -Phadoop-cloud -Pmesos -Pyarn -Pkinesis-asl -Phive-thriftserver -Pspark-ganglia-lgpl -Pkubernetes -Phive -Pscala-2.13 -pl sql/hive ``` Before ``` [ERROR] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 18.725 s <<< FAILURE! - in org.apache.spark.sql.hive.JavaDataFrameSuite [ERROR] org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF Time elapsed: 16.853 s <<< ERROR! java.lang.NoClassDefFoundError: scala/collection/parallel/TaskSupport at org.apache.spark.sql.hive.JavaDataFrameSuite.checkAnswer(JavaDataFrameSuite.java:41) at org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF(JavaDataFrameSuite.java:92) Caused by: java.lang.ClassNotFoundException: scala.collection.parallel.TaskSupport at org.apache.spark.sql.hive.JavaDataFrameSuite.checkAnswer(JavaDataFrameSuite.java:41) at org.apache.spark.sql.hive.JavaDataFrameSuite.testUDAF(JavaDataFrameSuite.java:92) [INFO] Running org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite 16:15:36.186 WARN org.apache.spark.sql.hive.test.TestHiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider org.apache.spark.sql.json. Persisting data source table `default`.`javasavedtable` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. 16:15:36.288 WARN org.apache.hadoop.hive.ql.session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. 16:15:36.396 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist 16:15:36.397 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 16:15:36.397 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.stats.retries.wait does not exist [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 3.481 s - in org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite [INFO] [INFO] Results: [INFO] [ERROR] Errors: [ERROR] JavaDataFrameSuite.testUDAF:92->checkAnswer:41 » NoClassDefFound scala/collect... [INFO] [ERROR] Tests run: 3, Failures: 0, Errors: 1, Skipped: 0 ``` After ``` [INFO] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 19.287 s - in org.apache.spark.sql.hive.JavaDataFrameSuite [INFO] Running org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite 16:12:16.697 WARN org.apache.spark.sql.hive.test.TestHiveExternalCatalog: Couldn't find corresponding Hive SerDe for data source provider org.apache.spark.sql.json. Persisting data source table `default`.`javasavedtable` into Hive metastore in Spark SQL specific format, which is NOT compatible with Hive. 16:12:17.540 WARN org.apache.hadoop.hive.ql.session.SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory. 16:12:17.653 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist 16:12:17.653 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist 16:12:17.654 WARN org.apache.hadoop.hive.conf.HiveConf: HiveConf of name hive.stats.retries.wait does not exist [INFO] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 3.58 s - in org.apache.spark.sql.hive.JavaMetastoreDataSourcesSuite [INFO] [INFO] Results: [INFO] [INFO] Tests run: 3, Failures: 0, Errors: 0, Skipped: 0 ``` Closes #31259 from LuciferYang/SPARK-34176. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-20 15:33:31 -08:00
Max Gekk	00b444d5ed	[SPARK-34056][SQL][TESTS] Unify v1 and v2 ALTER TABLE .. RECOVER PARTITIONS tests ### What changes were proposed in this pull request? 1. Port DS V2 tests from `AlterTablePartitionV2SQLSuite ` to the test suite `v2.AlterTableRecoverPartitionsSuite`. 2. Port DS v1 tests from `DDLSuite` to `v1.AlterTableRecoverPartitionsSuiteBase`. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running new test suites: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableRecoverPartitionsParserSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly AlterTableRecoverPartitionsSuite" $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CatalogedDDLSuite" ``` Closes #31105 from MaxGekk/unify-recover-partitions-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-20 01:49:31 +00:00
Yuming Wang	030639f456	[SPARK-34119][SQL] Keep necessary stats after partition pruning ### What changes were proposed in this pull request? This pr keep necessary stats after partition pruning. ### Why are the changes needed? Improve query performance. It will push down aggregate since SPARK-34081 because it can be planed as BroadcastHashJoin. But it lacks column statistics after [`PruneFileSourcePartitions`](`d0c83f372b/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/PruneFileSourcePartitions.scala (L102-L103)`). Therefore, it will eventually be planned as SortMergeJoin. Please see the log: ``` join.right.stats: org.apache.spark.sql.catalyst.optimizer.PushDownPredicates: Statistics(sizeInBytes=348.8 KiB, rowCount=1.79E+4) join.right.stats: org.apache.spark.sql.execution.datasources.PruneFileSourcePartitions: Statistics(sizeInBytes=1414.2 EiB) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test and benchmark test SQL \| Before this PR(Seconds) \| After this PR(Seconds) -- \| -- \| -- q14a \| 594 \| 384 q14b \| 600 \| 402 This change will not affect the results of `PlanStabilitySuite`, because it does not have partition column. Closes #31205 from wangyum/SPARK-34119. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-19 06:09:16 +00:00
Max Gekk	bea10a6274	[SPARK-34153][SQL] Remove unused `getRawTable()` from `HiveExternalCatalog.alterPartitions()` ### What changes were proposed in this pull request? Remove unused call of `getRawTable()` from `HiveExternalCatalog.alterPartitions()`. ### Why are the changes needed? It reduces the number of calls to Hive External catalog. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the modified test suite: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *AlterTableRenamePartitionSuite" ``` Closes #31234 from MaxGekk/remove-getRawTable-from-alterPartitions. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-19 11:42:33 +09:00
Max Gekk	dee596e3ef	[SPARK-34027][SQL] Refresh cache in `ALTER TABLE .. RECOVER PARTITIONS` ### What changes were proposed in this pull request? Invoke `refreshTable()` from `CatalogImpl` which refreshes the cache in v1 `ALTER TABLE .. RECOVER PARTITIONS`. ### Why are the changes needed? This fixes the issues portrayed by the example: ```sql spark-sql> create table tbl (col int, part int) using parquet partitioned by (part); spark-sql> insert into tbl partition (part=0) select 0; spark-sql> cache table tbl; spark-sql> select * from tbl; 0 0 spark-sql> show table extended like 'tbl' partition(part=0); default tbl false Partition Values: [part=0] Location: file:/Users/maximgekk/proj/recover-partitions-refresh-cache/spark-warehouse/tbl/part=0 ... ``` Create new partition by copying the existing one: ``` $ cp -r /Users/maximgekk/proj/recover-partitions-refresh-cache/spark-warehouse/tbl/part=0 /Users/maximgekk/proj/recover-partitions-refresh-cache/spark-warehouse/tbl/part=1 ``` ```sql spark-sql> alter table tbl recover partitions; spark-sql> select * from tbl; 0 0 ``` The last query must return `0 1` since it has been recovered by `ALTER TABLE .. RECOVER PARTITIONS`. ### Does this PR introduce _any_ user-facing change? Yes. After the changes for the example above: ```sql ... spark-sql> alter table tbl recover partitions; spark-sql> select * from tbl; 0 0 0 1 ``` ### How was this patch tested? By running the affected test suite: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *CachedTableSuite" ``` Closes #31066 from MaxGekk/recover-partitions-refresh-cache. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-01-18 13:52:39 +00:00
yangjie01	163afa6fcf	[SPARK-34151][SQL] Replaces `java.io.File.toURL` with `java.io.File.toURI.toURL` ### What changes were proposed in this pull request? `java.io.FIle.toURL` method does not automatically escape characters that are illegal in URLs. Java doc recommended that new code convert an abstract pathname into a URL by first converting it into a URI, via the `toURI` method, and then converting the URI into a URL via the `URI.toURL` method. So this pr cleaned up the relevant cases in Spark code. ### Why are the changes needed? Cleaning up `Deprecated` Java API usage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31230 from LuciferYang/SPARK-34151. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-01-18 21:39:00 +09:00
Yuming Wang	c87b0085c9	[SPARK-33696][BUILD][SQL] Upgrade built-in Hive to 2.3.8 ### What changes were proposed in this pull request? Hive 2.3.8 changes: HIVE-19662: Upgrade Avro to 1.8.2 HIVE-24324: Remove deprecated API usage from Avro HIVE-23980: Shade Guava from hive-exec in Hive 2.3 HIVE-24436: Fix Avro NULL_DEFAULT_VALUE compatibility issue HIVE-24512: Exclude calcite in packaging. HIVE-22708: Fix for HttpTransport to replace String.equals HIVE-24551: Hive should include transitive dependencies from calcite after shading it HIVE-24553: Exclude calcite from test-jar dependency of hive-exec ### Why are the changes needed? Upgrade Avro and Parquet to latest version. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing test add test try to upgrade Parquet to 1.11.1 and Avro to 1.10.1: https://github.com/apache/spark/pull/30517 Closes #30657 from wangyum/SPARK-33696. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-17 21:54:35 -08:00
Chao Sun	b6f46ca297	[SPARK-33212][BUILD] Upgrade to Hadoop 3.2.2 and move to shaded clients for Hadoop 3.x profile ### What changes were proposed in this pull request? This: 1. switches Spark to use shaded Hadoop clients, namely hadoop-client-api and hadoop-client-runtime, for Hadoop 3.x. 2. upgrade built-in version for Hadoop 3.x to Hadoop 3.2.2 Note that for Hadoop 2.7, we'll still use the same modules such as hadoop-client. In order to still keep default Hadoop profile to be hadoop-3.2, this defines the following Maven properties: ``` hadoop-client-api.artifact hadoop-client-runtime.artifact hadoop-client-minicluster.artifact ``` which default to: ``` hadoop-client-api hadoop-client-runtime hadoop-client-minicluster ``` but all switch to `hadoop-client` when the Hadoop profile is hadoop-2.7. A side affect from this is we'll import the same dependency multiple times. For this I have to disable Maven enforcer `banDuplicatePomDependencyVersions`. Besides above, there are the following changes: - explicitly add a few dependencies which are imported via transitive dependencies from Hadoop jars, but are removed from the shaded client jars. - removed the use of `ProxyUriUtils.getPath` from `ApplicationMaster` which is a server-side/private API. - modified `IsolatedClientLoader` to exclude `hadoop-auth` jars when Hadoop version is 3.x. This change should only matter when we're not sharing Hadoop classes with Spark (which is _mostly_ used in tests). ### Why are the changes needed? Hadoop 3.2.2 is released with new features and bug fixes, so it's good for the Spark community to adopt it. However, latest Hadoop versions starting from Hadoop 3.2.1 have upgraded to use Guava 27+. In order to resolve Guava conflicts, this takes the approach by switching to shaded client jars provided by Hadoop. This also has the benefits of avoid pulling other 3rd party dependencies from Hadoop side so as to avoid more potential future conflicts. ### Does this PR introduce _any_ user-facing change? When people use Spark with `hadoop-provided` option, they should make sure class path contains `hadoop-client-api` and `hadoop-client-runtime` jars. In addition, they may need to make sure these jars appear before other Hadoop jars in the order. Otherwise, classes may be loaded from the other non-shaded Hadoop jars and cause potential conflicts. ### How was this patch tested? Relying on existing tests. Closes #30701 from sunchao/test-hadoop-3.2.2. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-01-15 14:06:50 -08:00
yangjie01	9e33d49b5b	[SPARK-33346][CORE][SQL][MLLIB][DSTREAM][K8S] Change the never changed 'var' to 'val' ### What changes were proposed in this pull request? Some local variables are declared as `var`, but they are never reassigned and should be declared as `val`, so this pr turn these from `var` to `val` except for `mockito` related cases. ### Why are the changes needed? Use `val` instead of `var` when possible. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #31142 from LuciferYang/SPARK-33346. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2021-01-15 08:47:02 -06:00

1 2 3 4 5 ...

2537 commits