ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
CC Highman	d338af3101	[SPARK-31962][SQL] Provide modifiedAfter and modifiedBefore options when filtering from a batch-based file data source ### What changes were proposed in this pull request? Two new options, _modifiiedBefore_ and _modifiedAfter_, is provided expecting a value in 'YYYY-MM-DDTHH:mm:ss' format. _PartioningAwareFileIndex_ considers these options during the process of checking for files, just before considering applied _PathFilters_ such as `pathGlobFilter.` In order to filter file results, a new PathFilter class was derived for this purpose. General house-keeping around classes extending PathFilter was performed for neatness. It became apparent support was needed to handle multiple potential path filters. Logic was introduced for this purpose and the associated tests written. ### Why are the changes needed? When loading files from a data source, there can often times be thousands of file within a respective file path. In many cases I've seen, we want to start loading from a folder path and ideally be able to begin loading files having modification dates past a certain point. This would mean out of thousands of potential files, only the ones with modification dates greater than the specified timestamp would be considered. This saves a ton of time automatically and reduces significant complexity managing this in code. ### Does this PR introduce _any_ user-facing change? This PR introduces an option that can be used with batch-based Spark file data sources. A documentation update was made to reflect an example and usage of the new data source option. Example Usages _Load all CSV files modified after date:_ `spark.read.format("csv").option("modifiedAfter","2020-06-15T05:00:00").load()` _Load all CSV files modified before date:_ `spark.read.format("csv").option("modifiedBefore","2020-06-15T05:00:00").load()` _Load all CSV files modified between two dates:_ `spark.read.format("csv").option("modifiedAfter","2019-01-15T05:00:00").option("modifiedBefore","2020-06-15T05:00:00").load() ` ### How was this patch tested? A handful of unit tests were added to support the positive, negative, and edge case code paths. It's also live in a handful of our Databricks dev environments. (quoted from cchighman) Closes #30411 from HeartSaVioR/SPARK-31962. Lead-authored-by: CC Highman <christopher.highman@microsoft.com> Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-11-23 08:30:41 +09:00
angerszhu	d7f4b2ad50	[SPARK-28704][SQL][TEST] Add back Skiped HiveExternalCatalogVersionsSuite in HiveSparkSubmitSuite at JDK9+ ### What changes were proposed in this pull request? We skip test HiveExternalCatalogVersionsSuite when testing with JAVA_9 or later because our previous version does not support JAVA_9 or later. We now add it back since we have a version supports JAVA_9 or later. ### Why are the changes needed? To recover test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Check CI logs. Closes #30451 from AngersZhuuuu/SPARK-28704. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-22 10:29:15 -08:00
Gustavo Martin Morcuende	517b810dfa	[SPARK-33463][SQL] Keep Job Id during incremental collect in Spark Thrift Server ### What changes were proposed in this pull request? When enabling spark.sql.thriftServer.incrementalCollect Job Ids get lost and tracing queries in Spark Thrift Server ends up being too complicated. ### Why are the changes needed? Because it will make easier tracing Spark Thrift Server queries. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? The current tests are enough. No need of more tests. Closes #30390 from gumartinm/master. Authored-by: Gustavo Martin Morcuende <gu.martinm@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-21 08:39:16 -08:00
Dongjoon Hyun	cf7490112a	Revert "[SPARK-28704][SQL][TEST] Add back Skiped HiveExternalCatalogVersionsSuite in HiveSparkSubmitSuite at JDK9+" This reverts commit `47326ac1c6`.	2020-11-20 19:01:58 -08:00
Max Gekk	530c0a8e28	[SPARK-33505][SQL][TESTS] Fix adding new partitions by INSERT INTO `InMemoryPartitionTable` ### What changes were proposed in this pull request? 1. Add a hook method to `addPartitionKey()` of `InMemoryTable` which is called per every row. 2. Override `addPartitionKey()` in `InMemoryPartitionTable`, and add partition key every time when new row is inserted to the table. ### Why are the changes needed? To be able to write unified tests for datasources V1 and V2. Currently, INSERT INTO a V1 table creates partitions but the same doesn't work for the custom catalog `InMemoryPartitionTableCatalog` used in DSv2 tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the affected test suite `DataSourceV2SQLSuite`. Closes #30449 from MaxGekk/insert-into-InMemoryPartitionTable. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-20 18:41:25 -08:00
Jungtaek Lim (HeartSaVioR)	67c6ed9068	[SPARK-33223][SS][FOLLOWUP] Clarify the meaning of "number of rows dropped by watermark" in SS UI page ### What changes were proposed in this pull request? This PR fixes the representation to clarify the meaning of "number of rows dropped by watermark" in SS UI page. ### Why are the changes needed? `Aggregated Number Of State Rows Dropped By Watermark` says that the dropped rows are from the state, whereas they're not. We say "evicted from the state" for the case, which is "normal" to emit outputs and reduce memory usage of the state. The metric actually represents the number of "input" rows dropped by watermark, and the meaning of "input" is relative to the "stateful operator". That's a bit confusing as we normally think "input" as "input from source" whereas it's not. ### Does this PR introduce _any_ user-facing change? Yes, UI element & tooltip change. ### How was this patch tested? Only text change in UI, so we know how thing will be changed intuitively. Closes #30439 from HeartSaVioR/SPARK-33223-FOLLOWUP. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-11-21 10:27:00 +09:00
anchovYu	de0f50abf4	[SPARK-32670][SQL] Group exception messages in Catalyst Analyzer in one file ### What changes were proposed in this pull request? Group all messages of `AnalysisExcpetions` created and thrown directly in org.apache.spark.sql.catalyst.analysis.Analyzer in one file. * Create a new object: `org.apache.spark.sql.CatalystErrors` with many exception-creating functions. * When the `Analyzer` wants to create and throw a new `AnalysisException`, call functions of `CatalystErrors` ### Why are the changes needed? This is the sample PR that groups exception messages together in several files. It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce _any_ user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. ### Naming of exception functions All function names ended with `Error`. * For specific errors like `groupingIDMismatch` and `groupingColInvalid`, directly use them as name, just like `groupingIDMismatchError` and `groupingColInvalidError`. * For generic errors like `dataTypeMismatch`, * if confident with the context, prefix and condition can be added, like `pivotValDataTypeMismatchError` * if not sure about the context, add a `For` suffix of the specific component that this exception is related to, like `dataTypeMismatchForDeserializerError` Closes #29497 from anchovYu/32670. Lead-authored-by: anchovYu <aureole@sjtu.edu.cn> Co-authored-by: anchovYu <xyyu15@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-21 08:33:39 +09:00
Chao Sun	2479778934	[SPARK-33492][SQL] DSv2: Append/Overwrite/ReplaceTable should invalidate cache ### What changes were proposed in this pull request? This adds changes in the following places: - logic to also refresh caches referencing the target table in v2 `AppendDataExec`, `OverwriteByExpressionExec`, `OverwritePartitionsDynamicExec`, as well as their v1 fallbacks `AppendDataExecV1` and `OverwriteByExpressionExecV1`. - logic to invalidate caches referencing the target table in v2 `ReplaceTableAsSelectExec` and its atomic version `AtomicReplaceTableAsSelectExec`. These are only supported in v2 at the moment though. In addition to the above, in order to test the v1 write fallback behavior, I extended `InMemoryTableWithV1Fallback` to also support batch reads. ### Why are the changes needed? Currently in DataSource v2 we don't refresh or invalidate caches referencing the target table when the table content is changed by operations such as append, overwrite, or replace table. This is different from DataSource v1, and could potentially cause data correctness issue if the staled caches are queried later. ### Does this PR introduce _any_ user-facing change? Yes. Now When a data source v2 is cached (either directly or indirectly), all the relevant caches will be refreshed or invalidated if the table is replaced. ### How was this patch tested? Added unit tests for the new code path. Closes #30429 from sunchao/SPARK-33492. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-20 14:59:56 -08:00
angerszhu	47326ac1c6	[SPARK-28704][SQL][TEST] Add back Skiped HiveExternalCatalogVersionsSuite in HiveSparkSubmitSuite at JDK9+ ### What changes were proposed in this pull request? We skip test HiveExternalCatalogVersionsSuite when testing with JAVA_9 or later because our previous version does not support JAVA_9 or later. We now add it back since we have a version supports JAVA_9 or later. ### Why are the changes needed? To recover test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Check CI logs. Closes #30428 from AngersZhuuuu/SPARK-28704. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-20 08:40:14 -08:00
ulysses	3384bda453	[SPARK-33468][SQL] ParseUrl in ANSI mode should fail if input string is not a valid url ### What changes were proposed in this pull request? With `ParseUrl`, instead of return null we throw exception if input string is not a vaild url. ### Why are the changes needed? For ANSI mode. ### Does this PR introduce _any_ user-facing change? Yes, user will get exception if `set spark.sql.ansi.enabled=true`. ### How was this patch tested? Add test. Closes #30399 from ulysses-you/SPARK-33468. Lead-authored-by: ulysses <youxiduo@weidian.com> Co-authored-by: ulysses-you <youxiduo@weidian.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-20 13:23:08 +00:00
Max Gekk	870d409533	[SPARK-32512][SQL][TESTS][FOLLOWUP] Remove duplicate tests for ALTER TABLE .. PARTITIONS from DataSourceV2SQLSuite ### What changes were proposed in this pull request? Remove tests from `DataSourceV2SQLSuite` that were copied to `AlterTablePartitionV2SQLSuite` by https://github.com/apache/spark/pull/29339. ### Why are the changes needed? - To reduce tests execution time - To improve test maintenance ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the modified tests: ``` $ build/sbt "test:testOnly DataSourceV2SQLSuite" $ build/sbt "test:testOnly AlterTablePartitionV2SQLSuite" ``` Closes #30444 from MaxGekk/dedup-tests-AlterTablePartitionV2SQLSuite. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-20 12:53:45 +00:00
Gabor Somogyi	883a213a8f	[MINOR] Structured Streaming statistics page indent fix ### What changes were proposed in this pull request? Structured Streaming statistics page code contains an indentation issue. This PR fixes it. ### Why are the changes needed? Indent fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #30434 from gaborgsomogyi/STAT-INDENT-FIX. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-19 13:36:45 -08:00
Chao Sun	6da8ade5f4	[SPARK-33045][SQL][FOLLOWUP] Fix build failure with Scala 2.13 ### What changes were proposed in this pull request? Explicitly convert `scala.collection.mutable.Buffer` to `Seq`. In Scala 2.13 `Seq` is an alias of `scala.collection.immutable.Seq` instead of `scala.collection.Seq`. ### Why are the changes needed? Without the change build with Scala 2.13 fails with the following: ``` [error] /home/runner/work/spark/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala:1417:41: type mismatch; [error] found : scala.collection.mutable.Buffer[org.apache.spark.unsafe.types.UTF8String] [error] required: Seq[org.apache.spark.unsafe.types.UTF8String] [error] case null => LikeAll(e, patterns) [error] ^ [error] /home/runner/work/spark/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala:1418:41: type mismatch; [error] found : scala.collection.mutable.Buffer[org.apache.spark.unsafe.types.UTF8String] [error] required: Seq[org.apache.spark.unsafe.types.UTF8String] [error] case _ => NotLikeAll(e, patterns) [error] ^ ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #30431 from sunchao/SPARK-33045-followup. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-19 12:42:33 -08:00
gengjiaan	3695e997d5	[SPARK-33045][SQL] Support build-in function like_all and fix StackOverflowError issue ### What changes were proposed in this pull request? Spark already support `LIKE ALL` syntax, but it will throw `StackOverflowError` if there are many elements(more than 14378 elements). We should implement built-in function for LIKE ALL to fix this issue. Why the stack overflow can happen in the current approach ? The current approach uses reduceLeft to connect each `Like(e, p)`, this will lead the the call depth of the thread is too large, causing `StackOverflowError` problems. Why the fix in this PR can avoid the error? This PR support built-in function for `LIKE ALL` and avoid this issue. ### Why are the changes needed? 1.Fix the `StackOverflowError` issue. 2.Support built-in function `like_all`. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #29999 from beliefer/SPARK-33045-like_all. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-19 16:56:21 +00:00
ulysses	21b13506cd	[SPARK-33442][SQL] Change Combine Limit to Eliminate limit using max row ### What changes were proposed in this pull request? Change `CombineLimits` name to `EliminateLimits` and add check if `Limit` child max row <= limit. ### Why are the changes needed? In Add-hoc scene, we always add limit for the query if user have no special limit value, but not all limit is nesessary. A general negative example is ``` select count(*) from t limit 100000; ``` It will be great if we can eliminate limit at Spark side. Also, we make a benchmark for this case ``` runBenchmark("Sort and Limit") { val N = 100000 val benchmark = new Benchmark("benchmark sort and limit", N) benchmark.addCase("TakeOrderedAndProject", 3) { _ => spark.range(N).toDF("c").repartition(200).sort("c").take(200000) } benchmark.addCase("Sort And Limit", 3) { _ => withSQLConf("spark.sql.execution.topKSortFallbackThreshold" -> "-1") { spark.range(N).toDF("c").repartition(200).sort("c").take(200000) } } benchmark.addCase("Sort", 3) { _ => spark.range(N).toDF("c").repartition(200).sort("c").collect() } benchmark.run() } ``` and the result is ``` Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.15.6 Intel(R) Core(TM) i5-5257U CPU 2.70GHz benchmark sort and limit: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ TakeOrderedAndProject 1833 2259 382 0.1 18327.1 1.0X Sort And Limit 1417 1658 285 0.1 14167.5 1.3X Sort 1324 1484 225 0.1 13238.3 1.4X ``` It shows that it makes sense to replace `TakeOrderedAndProjectExec` with `Sort + Project`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test. Closes #30368 from ulysses-you/SPARK-33442. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-19 13:31:10 +00:00
allisonwang-db	a03c540cf7	[SPARK-33472][SQL] Adjust RemoveRedundantSorts rule order ### What changes were proposed in this pull request? This PR switched the order for the rule `RemoveRedundantSorts` and `EnsureRequirements` so that `EnsureRequirements` will be invoked before `RemoveRedundantSorts` to avoid IllegalArgumentException when instantiating PartitioningCollection. ### Why are the changes needed? `RemoveRedundantSorts` rule uses SparkPlan's `outputPartitioning` to check whether a sort node is redundant. Currently, it is added before `EnsureRequirements`. Since `PartitioningCollection` requires left and right partitioning to have the same number of partitions, which is not necessarily true before applying `EnsureRequirements`, the rule can fail with the following exception: ``` IllegalArgumentException: requirement failed: PartitioningCollection requires all of its partitionings have the same numPartitions. ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #30373 from allisonwang-db/sort-follow-up. Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-19 13:29:01 +00:00
allisonwang-db	ef2638c3e3	[SPARK-33183][SQL][FOLLOW-UP] Update rule RemoveRedundantSorts config version ### What changes were proposed in this pull request? This PR is a follow up for #30093 to updates the config `spark.sql.execution.removeRedundantSorts` version to 2.4.8. ### Why are the changes needed? To update the rule version it has been backported to 2.4. #30194 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #30420 from allisonwang-db/spark-33183-follow-up. Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-19 00:12:22 -08:00
Dongjoon Hyun	d5e7bd0cc4	[SPARK-33483][INFRA][TESTS] Fix rat exclusion patterns and add a LICENSE ### What changes were proposed in this pull request? This PR fixes the RAT exclusion rule which was originated from SPARK-1144 (Apache Spark 1.0) ### Why are the changes needed? This prevents the situation like https://github.com/apache/spark/pull/30415. Currently, it missed `catalog` directory due to `.log` rule. ``` $ dev/check-license Could not find Apache license headers in the following files: !????? /Users/dongjoon/APACHE/spark-merge/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/MetadataColumn.java !????? /Users/dongjoon/APACHE/spark-merge/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsMetadataColumns.java ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CI with the new rule. Closes #30418 from dongjoon-hyun/SPARK-RAT. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-18 23:59:11 -08:00
Prakhar Jain	0b0fb70b09	[SPARK-33400][SQL] Normalize sameOrderExpressions in SortOrder to avoid unnecessary sort operations ### What changes were proposed in this pull request? This pull request tries to normalize the SortOrder properly to prevent unnecessary sort operators. Currently the sameOrderExpressions are not normalized as part of AliasAwareOutputOrdering. Example: consider this join of three tables: """ \|SELECT t2id, t3.id as t3id \|FROM ( \| SELECT t1.id as t1id, t2.id as t2id \| FROM t1, t2 \| WHERE t1.id = t2.id \|) t12, t3 \|WHERE t1id = t3.id """. The plan for this looks like: (8) Project [t2id#1059L, id#1004L AS t3id#1060L] +- (8) SortMergeJoin [t2id#1059L], [id#1004L], Inner :- (5) Sort [t2id#1059L ASC NULLS FIRST ], false, 0 <----------------------------- : +- (5) Project [id#1000L AS t2id#1059L] : +- (5) SortMergeJoin [id#996L], [id#1000L], Inner : :- (2) Sort [id#996L ASC NULLS FIRST ], false, 0 : : +- Exchange hashpartitioning(id#996L, 5), true, [id=#1426] : : +- (1) Range (0, 10, step=1, splits=2) : +- (4) Sort [id#1000L ASC NULLS FIRST ], false, 0 : +- Exchange hashpartitioning(id#1000L, 5), true, [id=#1432] : +- (3) Range (0, 20, step=1, splits=2) +- (7) Sort [id#1004L ASC NULLS FIRST ], false, 0 +- Exchange hashpartitioning(id#1004L, 5), true, [id=#1443] +- *(6) Range (0, 30, step=1, splits=2) In this plan, the marked sort node could have been avoided as the data is already sorted on "t2.id" by the lower SortMergeJoin. ### Why are the changes needed? To remove unneeded Sort operators. ### Does this PR introduce any user-facing change? No ### How was this patch tested? New UT added. Closes #30302 from prakharjain09/SPARK-33400-sortorder. Authored-by: Prakhar Jain <prakharjain09@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-19 06:25:37 +00:00
Yuming Wang	014e1fbb3a	[SPARK-27421][SQL] Fix filter for int column and value class java.lang.String when pruning partition column ### What changes were proposed in this pull request? This pr fix filter for int column and value class java.lang.String when pruning partition column. How to reproduce this issue: ```scala spark.sql("CREATE table test (name STRING) partitioned by (id int) STORED AS PARQUET") spark.sql("CREATE VIEW test_view as select cast(id as string) as id, name from test") spark.sql("SELECT * FROM test_view WHERE id = '0'").explain ``` ``` 20/11/15 06:19:01 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_partitions_by_filter : db=default tbl=test 20/11/15 06:19:01 INFO MetaStoreDirectSql: Unable to push down SQL filter: Cannot push down filter for int column and value class java.lang.String 20/11/15 06:19:01 ERROR SparkSQLDriver: Failed in [SELECT * FROM test_view WHERE id = '0'] java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive. You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem, however this will result in degraded performance. Please report a bug: https://issues.apache.org/jira/browse/SPARK at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:828) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:745) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:276) at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:743) ``` ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30380 from wangyum/SPARK-27421. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2020-11-19 14:01:42 +08:00
yangjie01	e3058ba17c	[SPARK-33441][BUILD] Add unused-imports compilation check and remove all unused-imports ### What changes were proposed in this pull request? This pr add a new Scala compile arg to `pom.xml` to defense against new unused imports: - `-Ywarn-unused-import` for Scala 2.12 - `-Wconf:cat=unused-imports:e` for Scala 2.13 The other fIles change are remove all unused imports in Spark code ### Why are the changes needed? Cleanup code and add guarantee to defense against new unused imports ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #30351 from LuciferYang/remove-imports-core-module. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-19 14:20:39 +09:00
Ryan Blue	66a76378cf	[SPARK-31255][SQL][FOLLOWUP] Add missing license headers ### What changes were proposed in this pull request? Add missing license headers for new files added in #28027. ### Why are the changes needed? To fix licenses. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This is a purely non-functional change. Closes #30415 from rdblue/license-headers. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-18 19:18:28 -08:00
Liang-Chi Hsieh	e518008ca9	[SPARK-33473][SQL] Extend interpreted subexpression elimination to other interpreted projections ### What changes were proposed in this pull request? Similar to `InterpretedUnsafeProjection`, this patch proposes to extend interpreted subexpression elimination to `InterpretedMutableProjection` and `InterpretedSafeProjection`. ### Why are the changes needed? Enabling subexpression elimination can improve the performance of interpreted projections, as shown in `InterpretedUnsafeProjection`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #30406 from viirya/SPARK-33473. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-18 18:58:06 -08:00
Liang-Chi Hsieh	97d2cee4af	[SPARK-33427][SQL][FOLLOWUP] Prevent test flakyness in SubExprEvaluationRuntimeSuite ### What changes were proposed in this pull request? This followup is to prevent possible test flakyness of `SubExprEvaluationRuntimeSuite`. ### Why are the changes needed? Because HashMap doesn't guarantee the order, in `proxyExpressions` the proxy expression id is not deterministic. So in `SubExprEvaluationRuntimeSuite` we should not test against it. ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? Unit test. Closes #30414 from viirya/SPARK-33427-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2020-11-18 18:35:11 -08:00
Gengliang Wang	9a4c79073b	[SPARK-33354][SQL] New explicit cast syntax rules in ANSI mode ### What changes were proposed in this pull request? In section 6.13 of the ANSI SQL standard, there are syntax rules for valid combinations of the source and target data types. ![image](https://user-images.githubusercontent.com/1097932/98212874-17356f80-1ef9-11eb-8f2b-385f32db404a.png) Comparing the ANSI CAST syntax rules with the current default behavior of Spark: ![image](https://user-images.githubusercontent.com/1097932/98789831-b7870a80-23b7-11eb-9b5f-469a42e0ee4a.png) To make Spark's ANSI mode more ANSI SQL Compatible，I propose to disallow the following casting in ANSI mode: ``` TimeStamp <=> Boolean Date <=> Boolean Numeric <=> Timestamp Numeric <=> Date Numeric <=> Binary String <=> Array String <=> Map String <=> Struct ``` The following castings are considered invalid in ANSI SQL standard, but they are quite straight forward. Let's Allow them for now ``` Numeric <=> Boolean String <=> Binary ``` ### Why are the changes needed? Better ANSI SQL compliance ### Does this PR introduce _any_ user-facing change? Yes, the following castings will not be allowed in ANSI mode: ``` TimeStamp <=> Boolean Date <=> Boolean Numeric <=> Timestamp Numeric <=> Date Numeric <=> Binary String <=> Array String <=> Map String <=> Struct ``` ### How was this patch tested? Unit test The ANSI Compliance doc preview: ![image](https://user-images.githubusercontent.com/1097932/98946017-2cd20880-24a8-11eb-8161-65749bfdd03a.png) Closes #30260 from gengliangwang/ansiCanCast. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-11-19 09:23:36 +09:00
Ryan Blue	1df69f7e32	[SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2 ### What changes were proposed in this pull request? This adds support for metadata columns to DataSourceV2. If a source implements `SupportsMetadataColumns` it must also implement `SupportsPushDownRequiredColumns` to support projecting those columns. The analyzer is updated to resolve metadata columns from `LogicalPlan.metadataOutput`, and this adds a rule that will add metadata columns to the output of `DataSourceV2Relation` if one is used. ### Why are the changes needed? This is the solution discussed for exposing additional data in the Kafka source. It is also needed for a generic `MERGE INTO` plan. ### Does this PR introduce any user-facing change? Yes. Users can project additional columns from sources that implement the new API. This also updates `DescribeTableExec` to show metadata columns. ### How was this patch tested? Will include new unit tests. Closes #28027 from rdblue/add-dsv2-metadata-columns. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>	2020-11-18 14:07:51 -08:00
Chao Sun	27cd945c15	[SPARK-32381][CORE][SQL][FOLLOWUP] More cleanup on HadoopFSUtils ### What changes were proposed in this pull request? This PR is a follow-up of #29471 and does the following improvements for `HadoopFSUtils`: 1. Removes the extra `filterFun` from the listing API and combines it with the `filter`. 2. Removes `SerializableBlockLocation` and `SerializableFileStatus` given that `BlockLocation` and `FileStatus` are already serializable. 3. Hides the `isRootLevel` flag from the top-level API. ### Why are the changes needed? Main purpose is to simplify the logic within `HadoopFSUtils` as well as cleanup the API. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit tests (e.g., `FileIndexSuite`) Closes #29959 from sunchao/hadoop-fs-utils-followup. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Holden Karau <hkarau@apple.com>	2020-11-18 12:39:00 -08:00
Gengliang Wang	a180e02842	[SPARK-32852][SQL][DOC][FOLLOWUP] Revise the documentation of spark.sql.hive.metastore.jars ### What changes were proposed in this pull request? This is a follow-up for https://github.com/apache/spark/pull/29881. It revises the documentation of the configuration `spark.sql.hive.metastore.jars`. ### Why are the changes needed? Fix grammatical error in the doc. Also, make it more clear that the configuration is effective only when `spark.sql.hive.metastore.jars` is set as `path` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Just doc changes. Closes #30407 from gengliangwang/reviseJarPathDoc. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-11-18 22:09:40 +08:00
Bryan Cutler	8e2a0bdce7	[SPARK-24554][PYTHON][SQL] Add MapType support for PySpark with Arrow ### What changes were proposed in this pull request? This change adds MapType support for PySpark with Arrow, if using pyarrow >= 2.0.0. ### Why are the changes needed? MapType was previous unsupported with Arrow. ### Does this PR introduce _any_ user-facing change? User can now enable MapType for `createDataFrame()`, `toPandas()` with Arrow optimization, and with Pandas UDFs. ### How was this patch tested? Added new PySpark tests for createDataFrame(), toPandas() and Scalar Pandas UDFs. Closes #30393 from BryanCutler/arrow-add-MapType-SPARK-24554. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-18 21:18:19 +09:00
Liang-Chi Hsieh	7f3d99a8a5	[MINOR][SQL][DOCS] Update schema_of_csv and schema_of_json doc ### What changes were proposed in this pull request? This minor PR updates the docs of `schema_of_csv` and `schema_of_json`. They allow foldable string column instead of a string literal now. ### Why are the changes needed? The function doc of `schema_of_csv` and `schema_of_json` are not updated accordingly with previous PRs. ### Does this PR introduce _any_ user-facing change? Yes, update user-facing doc. ### How was this patch tested? Unit test. Closes #30396 from viirya/minor-json-csv. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-18 11:32:27 +09:00
Liang-Chi Hsieh	928348408e	[SPARK-33427][SQL] Add subexpression elimination for interpreted expression evaluation ### What changes were proposed in this pull request? This patch proposes to add subexpression elimination for interpreted expression evaluation. Interpreted expression evaluation is used when codegen was not able to work, for example complex schema. ### Why are the changes needed? Currently we only do subexpression elimination for codegen. For some reasons, we may need to run interpreted expression evaluation. For example, codegen fails to compile and fallbacks to interpreted mode, or complex input/output schema of expressions. It is commonly seen for complex schema from expressions that is possibly caused by the query optimizer too, e.g. SPARK-32945. We should also support subexpression elimination for interpreted evaluation. That could reduce performance difference when Spark fallbacks from codegen to interpreted expression evaluation, and improve Spark usability. #### Benchmark Update `SubExprEliminationBenchmark`: Before: ``` OpenJDK 64-Bit Server VM 1.8.0_265-b01 on Mac OS X 10.15.6 Intel(R) Core(TM) i7-9750H CPU 2.60GHz from_json as subExpr: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------- subexpressionElimination on, codegen off 24707 25688 903 0.0 247068775.9 1.0X ``` After: ``` OpenJDK 64-Bit Server VM 1.8.0_265-b01 on Mac OS X 10.15.6 Intel(R) Core(TM) i7-9750H CPU 2.60GHz from_json as subExpr: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------- subexpressionElimination on, codegen off 2360 2435 87 0.0 23604320.7 11.2X ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Benchmark manually. Closes #30341 from viirya/SPARK-33427. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-17 14:29:37 +00:00
Yuming Wang	09bb9bedcd	[SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values ### What changes were proposed in this pull request? We [rewrite](`5197c5d2e7/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala (L722-L724)`) `In`/`InSet` predicate to `or` expressions when pruning Hive partitions. That will cause Hive metastore stack over flow if there are a lot of values. This pr rewrite `InSet` predicate to `GreaterThanOrEqual` min value and `LessThanOrEqual ` max value when pruning Hive partitions to avoid Hive metastore stack overflow. From our experience, `spark.sql.hive.metastorePartitionPruningInSetThreshold` should be less than 10000. ### Why are the changes needed? Avoid Hive metastore stack overflow when `InSet` predicate have many values. Especially DPP, it may generate many values. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #30325 from wangyum/SPARK-33416. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-17 13:47:01 +00:00
HyukjinKwon	e2c7bfce40	[SPARK-33407][PYTHON] Simplify the exception message from Python UDFs (disabled by default) ### What changes were proposed in this pull request? This PR proposes to simplify the exception messages from Python UDFS. Currently, the exception message from Python UDFs is as below: ```python from pyspark.sql.functions import udf; spark.range(10).select(udf(lambda x: x/0)("id")).collect() ``` ```python Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../python/pyspark/sql/dataframe.py", line 427, in show print(self._jdf.showString(n, 20, vertical)) File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../python/pyspark/sql/utils.py", line 127, in deco raise_from(converted) File "<string>", line 3, in raise_from pyspark.sql.utils.PythonException: An exception was thrown from Python worker in the executor: Traceback (most recent call last): File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 605, in main process() File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 597, in process serializer.dump_stream(out_iter, outfile) File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream for obj in iterator: File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched for item in iterator: File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr> result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda> return lambda a: f(a) File "/.../python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper return f(args, *kwargs) File "<stdin>", line 1, in <lambda> ZeroDivisionError: division by zero ``` Actually, almost all cases, users only care about `ZeroDivisionError: division by zero`. We don't really have to show the internal stuff in 99% cases. This PR adds a configuration `spark.sql.execution.pyspark.udf.simplifiedException.enabled` (disabled by default) that hides the internal tracebacks related to Python worker, (de)serialization, etc. ```python Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../python/pyspark/sql/dataframe.py", line 427, in show print(self._jdf.showString(n, 20, vertical)) File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../python/pyspark/sql/utils.py", line 127, in deco raise_from(converted) File "<string>", line 3, in raise_from pyspark.sql.utils.PythonException: An exception was thrown from Python worker in the executor: Traceback (most recent call last): File "<stdin>", line 1, in <lambda> ZeroDivisionError: division by zero ``` The trackback will be shown from the point when any non-PySpark file is seen in the traceback. ### Why are the changes needed? Without this configuration. such internal tracebacks are exposed to users directly especially for shall or notebook users in PySpark. 99% cases people don't care about the internal Python worker, (de)serialization and related tracebacks. It just makes the exception more difficult to read. For example, one statement of `x/0` above shows a very long traceback and most of them are unnecessary. This configuration enables the ability to show simplified tracebacks which users will likely be most interested in. ### Does this PR introduce _any_ user-facing change? By default, no. It adds one configuration that simplifies the exception message. See the example above. ### How was this patch tested? Manually tested: ```bash $ pyspark --conf spark.sql.execution.pyspark.udf.simplifiedException.enabled=true ``` ```python from pyspark.sql.functions import udf; spark.sparkContext.setLogLevel("FATAL"); spark.range(10).select(udf(lambda x: x/0)("id")).collect() ``` and unittests were also added. Closes #30309 from HyukjinKwon/SPARK-33407. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-17 14:15:31 +09:00
Cheng Su	5af5aa146e	[SPARK-33209][SS] Refactor unit test of stream-stream join in UnsupportedOperationsSuite ### What changes were proposed in this pull request? This PR is a followup from https://github.com/apache/spark/pull/30076 to refactor unit test of stream-stream join in `UnsupportedOperationsSuite`, where we had a lot of duplicated code for stream-stream join unit test, for each join type. ### Why are the changes needed? Help reduce duplicated code and make it easier for developers to read and add code in the future. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit test in `UnsupportedOperationsSuite.scala` (pure refactoring). Closes #30347 from c21/stream-test. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-11-17 11:18:42 +09:00
Prakhar Jain	f5e3302840	[SPARK-33399][SQL] Normalize output partitioning and sortorder with respect to aliases to avoid unneeded exchange/sort nodes ### What changes were proposed in this pull request? This pull request tries to remove unneeded exchanges/sorts by normalizing the output partitioning and sortorder information correctly with respect to aliases. Example: consider this join of three tables: \|SELECT t2id, t3.id as t3id \|FROM ( \| SELECT t1.id as t1id, t2.id as t2id \| FROM t1, t2 \| WHERE t1.id = t2.id \|) t12, t3 \|WHERE t1id = t3.id The plan for this looks like: (9) Project [t2id#1034L, id#1004L AS t3id#1035L] +- (9) SortMergeJoin [t1id#1033L], [id#1004L], Inner :- (6) Sort [t1id#1033L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(t1id#1033L, 5), true, [id=#1343] <------------------------------ : +- (5) Project [id#996L AS t1id#1033L, id#1000L AS t2id#1034L] : +- (5) SortMergeJoin [id#996L], [id#1000L], Inner : :- (2) Sort [id#996L ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(id#996L, 5), true, [id=#1329] : : +- (1) Range (0, 10, step=1, splits=2) : +- (4) Sort [id#1000L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#1000L, 5), true, [id=#1335] : +- (3) Range (0, 20, step=1, splits=2) +- (8) Sort [id#1004L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#1004L, 5), true, [id=#1349] +- *(7) Range (0, 30, step=1, splits=2) In this plan, the marked exchange could have been avoided as the data is already partitioned on "t1.id". This happens because AliasAwareOutputPartitioning class handles aliases only related to HashPartitioning. This change normalizes all output partitioning based on aliasing happening in Project. ### Why are the changes needed? To remove unneeded exchanges. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UT added. On TPCDS 1000 scale, this change improves the performance of query 95 from 330 seconds to 170 seconds by removing the extra Exchange. Closes #30300 from prakharjain09/SPARK-33399-outputpartitioning. Authored-by: Prakhar Jain <prakharjain09@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-11-17 10:35:43 +09:00
xuewei.linxuewei	b5eca18af0	[SPARK-33460][SQL] Accessing map values should fail if key is not found ### What changes were proposed in this pull request? Instead of returning NULL, throws runtime NoSuchElementException towards invalid key accessing in map-like functions, such as element_at, GetMapValue, when ANSI mode is on. ### Why are the changes needed? For ANSI mode. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT and Existing UT. Closes #30386 from leanken/leanken-SPARK-33460. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-16 16:14:31 +00:00
Max Gekk	6883f29465	[SPARK-33453][SQL][TESTS] Unify v1 and v2 SHOW PARTITIONS tests ### What changes were proposed in this pull request? 1. Move `SHOW PARTITIONS` parsing tests to `ShowPartitionsParserSuite` 2. Place Hive tests for `SHOW PARTITIONS` from `HiveCommandSuite` to the base test suite `v1.ShowPartitionsSuiteBase`. This will allow to run the tests w/ and w/o Hive. The changes follow the approach of https://github.com/apache/spark/pull/30287. ### Why are the changes needed? - The unification will allow to run common `SHOW PARTITIONS` tests for both DSv1 and Hive DSv1, DSv2 - We can detect missing features and differences between DSv1 and DSv2 implementations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running: - new test suites `build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowPartitionsSuite"` - and old one `build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly org.apache.spark.sql.hive.execution.HiveCommandSuite"` Closes #30377 from MaxGekk/unify-dsv1_v2-show-partitions-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-16 16:11:42 +00:00
luluorta	dfa6fb46f4	[SPARK-33389][SQL] Make internal classes of SparkSession always using active SQLConf ### What changes were proposed in this pull request? This PR makes internal classes of SparkSession always using active SQLConf. We should remove all `conf: SQLConf`s from ctor-parameters of this classes (`Analyzer`, `SparkPlanner`, `SessionCatalog`, `CatalogManager` `SparkSqlParser` and etc.) and use `SQLConf.get` instead. ### Why are the changes needed? Code refine. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing test Closes #30299 from luluorta/SPARK-33389. Authored-by: luluorta <luluorta@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-16 15:27:18 +00:00
xuewei.linxuewei	aa508fcc03	[SPARK-33140][SQL][FOLLOW-UP] Revert code that not use passed-in SparkSession to get SQLConf ### What changes were proposed in this pull request? Revert code that does not use passed-in SparkSession to get SQLConf in [SPARK-33140]. The change scope of [SPARK-33140] change passed-in SQLConf instance and place using SparkSession to get SQLConf to be unified to use SQLConf.get. And the code reverted in the patch, the passed-in SparkSession was not about to get SQLConf, but using its catalog, it's better to be consistent. ### Why are the changes needed? Potential regression bug. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UT. Closes #30364 from leanken/leanken-SPARK-33140. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-16 11:57:50 +00:00
Max Gekk	71a29b2eca	[MINOR][SQL][DOCS] Fix a reference to `spark.sql.sources.useV1SourceList` ### What changes were proposed in this pull request? Replace `spark.sql.sources.write.useV1SourceList` by `spark.sql.sources.useV1SourceList` in the comment for `CatalogManager.v2SessionCatalog()`. ### Why are the changes needed? To have correct comments. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `./dev/scalastyle`. Closes #30385 from MaxGekk/fix-comment-useV1SourceList. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-16 17:57:20 +09:00
Liang-Chi Hsieh	10b011f837	[SPARK-33456][SQL][TEST][FOLLOWUP] Fix SUBEXPRESSION_ELIMINATION_ENABLED config name ### What changes were proposed in this pull request? To fix wrong config name in `subexp-elimination.sql`. ### Why are the changes needed? `CONFIG_DIM` should use config name's key. ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? Unit test. Closes #30384 from viirya/SPARK-33456-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-16 17:53:31 +09:00
Yuming Wang	cdcbdaeb0d	[SPARK-33458][SQL] Hive partition pruning support Contains, StartsWith and EndsWith predicate ### What changes were proposed in this pull request? This pr add support Hive partition pruning on `Contains`, `StartsWith` and `EndsWith` predicate. ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30383 from wangyum/SPARK-33458. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-16 07:18:13 +00:00
Max Gekk	4e5d2e0695	[SPARK-33394][SQL][TESTS] Throw `NoSuchNamespaceException` for not existing namespace in `InMemoryTableCatalog.listTables()` ### What changes were proposed in this pull request? Throw `NoSuchNamespaceException` in `listTables()` of the custom test catalog `InMemoryTableCatalog` if the passed namespace doesn't exist. ### Why are the changes needed? 1. To align behavior of V2 `InMemoryTableCatalog` to V1 session catalog. 2. To distinguish two situations: 1. A namespace does exist but does not contain any tables. In that case, `listTables()` returns empty result. 2. A namespace does not exist. `listTables()` throws `NoSuchNamespaceException` in this case. ### Does this PR introduce _any_ user-facing change? Yes. For example, `SHOW TABLES` returns empty result before the changes. ### How was this patch tested? By running V1/V2 ShowTablesSuites. Closes #30358 from MaxGekk/show-tables-in-not-existing-namespace. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-16 07:08:21 +00:00
Liang-Chi Hsieh	d4cf1483fd	[SPARK-33456][SQL][TEST] Add end-to-end test for subexpression elimination ### What changes were proposed in this pull request? This patch proposes to add end-to-end test for subexpression elimination. ### Why are the changes needed? We have subexpression elimination feature for expression evaluation but we don't have end-to-end tests for the feature. We should have one to make sure we don't break it. ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? Unit tests. Closes #30381 from viirya/SPARK-33456. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-16 15:47:35 +09:00
artiship	1ae6d64b5f	[SPARK-33358][SQL] Return code when command process failed Exit Spark SQL CLI processing loop if one of the commands (sub sql statement) process failed This is a regression at Apache Spark 3.0.0. ``` $ cat 1.sql select * from nonexistent_table; select 2; ``` Apache Spark 2.4.7 ``` spark-2.4.7-bin-hadoop2.7:$ bin/spark-sql -f 1.sql 20/11/15 16:14:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Error in query: Table or view not found: nonexistent_table; line 1 pos 14 ``` Apache Spark 3.0.1 ``` $ bin/spark-sql -f 1.sql Error in query: Table or view not found: nonexistent_table; line 1 pos 14; 'Project [] +- 'UnresolvedRelation [nonexistent_table] 2 Time taken: 2.786 seconds, Fetched 1 row(s) ``` Apache Hive 1.2.2* ``` apache-hive-1.2.2-bin:$ bin/hive -f 1.sql Logging initialized using configuration in jar:file:/Users/dongjoon/APACHE/hive-release/apache-hive-1.2.2-bin/lib/hive-common-1.2.2.jar!/hive-log4j.properties FAILED: SemanticException [Error 10001]: Line 1:14 Table not found 'nonexistent_table' ``` Yes. This is a fix of regression. Pass the UT. Closes #30263 from artiship/SPARK-33358. Authored-by: artiship <meilziner@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-15 16:57:12 -08:00
Liang-Chi Hsieh	eea846b895	[SPARK-33455][SQL][TEST] Add SubExprEliminationBenchmark for benchmarking subexpression elimination ### What changes were proposed in this pull request? This patch adds a benchmark `SubExprEliminationBenchmark` for benchmarking subexpression elimination feature. ### Why are the changes needed? We need a benchmark for subexpression elimination feature for change such as #30341. ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? Unit test. Closes #30379 from viirya/SPARK-33455. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-14 19:02:36 -08:00
luluorta	156704ba0d	[SPARK-33432][SQL] SQL parser should use active SQLConf ### What changes were proposed in this pull request? This PR makes SQL parser using active SQLConf instead of the one in ctor-parameters. ### Why are the changes needed? In ANSI mode, schema string parsing should fail if the schema uses ANSI reserved keyword as attribute name: ```scala spark.conf.set("spark.sql.ansi.enabled", "true") spark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MM/yyyy'));""").show ``` output: > Cannot parse the data type: > no viable alternative at input 'time'(line 1, pos 0) > > == SQL == > time Timestamp > ^^^ But this query may accidentally succeed in certain cases cause the DataType parser sticks to the configs of the first created session in the current thread: ```scala DataType.fromDDL("time Timestamp") val newSpark = spark.newSession() newSpark.conf.set("spark.sql.ansi.enabled", "true") newSpark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MM/yyyy'));""").show ``` output: > +--------------------------------+ > \|from_json({"time":"26/10/2015"})\| > +--------------------------------+ > \| {2015-10-26 00:00...\| > +--------------------------------+ ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Newly and updated UTs Closes #30357 from luluorta/SPARK-33432. Authored-by: luluorta <luluorta@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-14 13:37:12 -08:00
artiship	34a9a77ab5	[SPARK-33396][SQL] Spark SQL CLI prints appliction id when process file ### What changes were proposed in this pull request? Modify SparkSQLCLIDriver.scala to move ahead calling the cli.printMasterAndAppId method before process file. ### Why are the changes needed? Even though in SPARK-25043 it has already brought in the printing application id feature. But the process file situation seems have not been included. This small change is to make spark-sql will also print out application id when process file. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? env ``` spark version: 3.0.1 os: centos 7 ``` /tmp/tmp.sql ```sql select 1; ``` submit command: ```sh export HADOOP_USER_NAME=my-hadoop-user bin/spark-sql \ --master yarn \ --deploy-mode client \ --queue my.queue.name \ --conf spark.driver.host=$(hostname -i) \ --conf spark.app.name=spark-test \ --name "spark-test" \ -f /tmp/tmp.sql ``` execution log: ```sh 20/11/09 23:18:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 20/11/09 23:18:40 WARN HiveConf: HiveConf of name hive.spark.client.rpc.server.address.use.ip does not exist 20/11/09 23:18:40 WARN HiveConf: HiveConf of name hive.spark.client.submit.timeout.interval does not exist 20/11/09 23:18:40 WARN HiveConf: HiveConf of name hive.enforce.bucketing does not exist 20/11/09 23:18:40 WARN HiveConf: HiveConf of name hive.server2.enable.impersonation does not exist 20/11/09 23:18:40 WARN HiveConf: HiveConf of name hive.run.timeout.seconds does not exist 20/11/09 23:18:40 WARN HiveConf: HiveConf of name hive.support.sql11.reserved.keywords does not exist 20/11/09 23:18:40 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 20/11/09 23:18:41 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN). 20/11/09 23:18:42 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. 20/11/09 23:18:52 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered! Spark master: yarn, Application Id: application_1567136266901_27355775 1 1 Time taken: 4.974 seconds, Fetched 1 row(s) ``` Closes #30301 from artiship/SPARK-33396. Authored-by: artiship <meilziner@gmail.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2020-11-14 20:54:17 +08:00
Liang-Chi Hsieh	0046222a75	[SPARK-33337][SQL][FOLLOWUP] Prevent possible flakyness in SubexpressionEliminationSuite ### What changes were proposed in this pull request? This is a simple followup to prevent test flakyness in SubexpressionEliminationSuite. If `getAllEquivalentExprs` returns more than 1 sequences, due to HashMap, we should use `contains` instead of assuming the order of results. ### Why are the changes needed? Prevent test flakyness in SubexpressionEliminationSuite. ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? Unit test. Closes #30371 from viirya/SPARK-33337-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-13 15:10:02 -08:00
xuewei.linxuewei	234711a328	Revert "[SPARK-33139][SQL] protect setActionSession and clearActiveSession" ### What changes were proposed in this pull request? In [SPARK-33139] we defined `setActionSession` and `clearActiveSession` as deprecated API, it turns out it is widely used, and after discussion, even if without this PR, it should work with unify view feature, it might only be a risk if user really abuse using these two API. So revert the PR is needed. [SPARK-33139] has two commit, include a follow up. Revert them both. ### Why are the changes needed? Revert. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UT. Closes #30367 from leanken/leanken-revert-SPARK-33139. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-13 13:35:45 +00:00

1 2 3 4 5 ...

10225 commits