ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Chao Sun	b623c03456	[SPARK-32381][CORE][FOLLOWUP][TEST-HADOOP2.7] Don't remove SerializableFileStatus and SerializableBlockLocation for Hadoop 2.7 ### What changes were proposed in this pull request? Revert the change in #29959 and don't remove `SerializableFileStatus` and `SerializableBlockLocation`. ### Why are the changes needed? In Hadoop 2.7 `FileStatus` and `BlockLocation` are not serializable, so we still need the two wrapper classes. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #30447 from sunchao/SPARK-32381-followup. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-20 18:45:17 -08:00
Max Gekk	530c0a8e28	[SPARK-33505][SQL][TESTS] Fix adding new partitions by INSERT INTO `InMemoryPartitionTable` ### What changes were proposed in this pull request? 1. Add a hook method to `addPartitionKey()` of `InMemoryTable` which is called per every row. 2. Override `addPartitionKey()` in `InMemoryPartitionTable`, and add partition key every time when new row is inserted to the table. ### Why are the changes needed? To be able to write unified tests for datasources V1 and V2. Currently, INSERT INTO a V1 table creates partitions but the same doesn't work for the custom catalog `InMemoryPartitionTableCatalog` used in DSv2 tests. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the affected test suite `DataSourceV2SQLSuite`. Closes #30449 from MaxGekk/insert-into-InMemoryPartitionTable. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-20 18:41:25 -08:00
Jungtaek Lim (HeartSaVioR)	67c6ed9068	[SPARK-33223][SS][FOLLOWUP] Clarify the meaning of "number of rows dropped by watermark" in SS UI page ### What changes were proposed in this pull request? This PR fixes the representation to clarify the meaning of "number of rows dropped by watermark" in SS UI page. ### Why are the changes needed? `Aggregated Number Of State Rows Dropped By Watermark` says that the dropped rows are from the state, whereas they're not. We say "evicted from the state" for the case, which is "normal" to emit outputs and reduce memory usage of the state. The metric actually represents the number of "input" rows dropped by watermark, and the meaning of "input" is relative to the "stateful operator". That's a bit confusing as we normally think "input" as "input from source" whereas it's not. ### Does this PR introduce _any_ user-facing change? Yes, UI element & tooltip change. ### How was this patch tested? Only text change in UI, so we know how thing will be changed intuitively. Closes #30439 from HeartSaVioR/SPARK-33223-FOLLOWUP. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-11-21 10:27:00 +09:00
anchovYu	de0f50abf4	[SPARK-32670][SQL] Group exception messages in Catalyst Analyzer in one file ### What changes were proposed in this pull request? Group all messages of `AnalysisExcpetions` created and thrown directly in org.apache.spark.sql.catalyst.analysis.Analyzer in one file. * Create a new object: `org.apache.spark.sql.CatalystErrors` with many exception-creating functions. * When the `Analyzer` wants to create and throw a new `AnalysisException`, call functions of `CatalystErrors` ### Why are the changes needed? This is the sample PR that groups exception messages together in several files. It will largely help with standardization of error messages and its maintenance. ### Does this PR introduce _any_ user-facing change? No. Error messages remain unchanged. ### How was this patch tested? No new tests - pass all original tests to make sure it doesn't break any existing behavior. ### Naming of exception functions All function names ended with `Error`. * For specific errors like `groupingIDMismatch` and `groupingColInvalid`, directly use them as name, just like `groupingIDMismatchError` and `groupingColInvalidError`. * For generic errors like `dataTypeMismatch`, * if confident with the context, prefix and condition can be added, like `pivotValDataTypeMismatchError` * if not sure about the context, add a `For` suffix of the specific component that this exception is related to, like `dataTypeMismatchForDeserializerError` Closes #29497 from anchovYu/32670. Lead-authored-by: anchovYu <aureole@sjtu.edu.cn> Co-authored-by: anchovYu <xyyu15@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-21 08:33:39 +09:00
Chao Sun	2479778934	[SPARK-33492][SQL] DSv2: Append/Overwrite/ReplaceTable should invalidate cache ### What changes were proposed in this pull request? This adds changes in the following places: - logic to also refresh caches referencing the target table in v2 `AppendDataExec`, `OverwriteByExpressionExec`, `OverwritePartitionsDynamicExec`, as well as their v1 fallbacks `AppendDataExecV1` and `OverwriteByExpressionExecV1`. - logic to invalidate caches referencing the target table in v2 `ReplaceTableAsSelectExec` and its atomic version `AtomicReplaceTableAsSelectExec`. These are only supported in v2 at the moment though. In addition to the above, in order to test the v1 write fallback behavior, I extended `InMemoryTableWithV1Fallback` to also support batch reads. ### Why are the changes needed? Currently in DataSource v2 we don't refresh or invalidate caches referencing the target table when the table content is changed by operations such as append, overwrite, or replace table. This is different from DataSource v1, and could potentially cause data correctness issue if the staled caches are queried later. ### Does this PR introduce _any_ user-facing change? Yes. Now When a data source v2 is cached (either directly or indirectly), all the relevant caches will be refreshed or invalidated if the table is replaced. ### How was this patch tested? Added unit tests for the new code path. Closes #30429 from sunchao/SPARK-33492. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-20 14:59:56 -08:00
Huaxin Gao	a1a3d5cb02	[MINOR][TESTS][DOCS] Use fully-qualified class name in docker integration test ### What changes were proposed in this pull request? change ``` ./build/sbt -Pdocker-integration-tests "testOnly *xxxIntegrationSuite" ``` to ``` ./build/sbt -Pdocker-integration-tests "testOnly org.apache.spark.sql.jdbc.xxxIntegrationSuite" ``` ### Why are the changes needed? We only want to start v1 ```xxxIntegrationSuite```, not the newly added```v2.xxxIntegrationSuite```. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually checked Closes #30448 from huaxingao/dockertest. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-20 10:14:37 -08:00
Ruifeng Zheng	116b7b72a1	[SPARK-33466][ML][PYTHON] Imputer support mode(most_frequent) strategy ### What changes were proposed in this pull request? impl a new strategy `mode`: replace missing using the most frequent value along each column. ### Why are the changes needed? it is highly scalable, and had been a function in [sklearn.impute.SimpleImputer](https://scikit-learn.org/stable/modules/generated/sklearn.impute.SimpleImputer.html#sklearn.impute.SimpleImputer) for a long time. ### Does this PR introduce _any_ user-facing change? Yes, a new strategy is added ### How was this patch tested? updated testsuites Closes #30397 from zhengruifeng/imputer_max_freq. Lead-authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Co-authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-11-20 11:35:34 -06:00
angerszhu	47326ac1c6	[SPARK-28704][SQL][TEST] Add back Skiped HiveExternalCatalogVersionsSuite in HiveSparkSubmitSuite at JDK9+ ### What changes were proposed in this pull request? We skip test HiveExternalCatalogVersionsSuite when testing with JAVA_9 or later because our previous version does not support JAVA_9 or later. We now add it back since we have a version supports JAVA_9 or later. ### Why are the changes needed? To recover test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Check CI logs. Closes #30428 from AngersZhuuuu/SPARK-28704. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-20 08:40:14 -08:00
ulysses	3384bda453	[SPARK-33468][SQL] ParseUrl in ANSI mode should fail if input string is not a valid url ### What changes were proposed in this pull request? With `ParseUrl`, instead of return null we throw exception if input string is not a vaild url. ### Why are the changes needed? For ANSI mode. ### Does this PR introduce _any_ user-facing change? Yes, user will get exception if `set spark.sql.ansi.enabled=true`. ### How was this patch tested? Add test. Closes #30399 from ulysses-you/SPARK-33468. Lead-authored-by: ulysses <youxiduo@weidian.com> Co-authored-by: ulysses-you <youxiduo@weidian.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-20 13:23:08 +00:00
liucht	cbc8be24c8	[SPARK-33422][DOC] Fix the correct display of left menu item ### What changes were proposed in this pull request? Limit the height of the menu area on the left to display vertical scroll bar ### Why are the changes needed? The bottom menu item cannot be displayed when the left menu tree is long ### Does this PR introduce any user-facing change? Yes, if the menu item shows more, you'll see it by pulling down the vertical scroll bar before: ![image](https://user-images.githubusercontent.com/28332082/98805115-16995d80-2452-11eb-933a-3b72c14bea78.png) after: ![image](https://user-images.githubusercontent.com/28332082/98805418-7e4fa880-2452-11eb-9a9b-8d265078297c.png) ### How was this patch tested? NA Closes #30335 from liucht-inspur/master. Authored-by: liucht <liucht@inspur.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-20 22:19:35 +09:00
Max Gekk	870d409533	[SPARK-32512][SQL][TESTS][FOLLOWUP] Remove duplicate tests for ALTER TABLE .. PARTITIONS from DataSourceV2SQLSuite ### What changes were proposed in this pull request? Remove tests from `DataSourceV2SQLSuite` that were copied to `AlterTablePartitionV2SQLSuite` by https://github.com/apache/spark/pull/29339. ### Why are the changes needed? - To reduce tests execution time - To improve test maintenance ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the modified tests: ``` $ build/sbt "test:testOnly DataSourceV2SQLSuite" $ build/sbt "test:testOnly AlterTablePartitionV2SQLSuite" ``` Closes #30444 from MaxGekk/dedup-tests-AlterTablePartitionV2SQLSuite. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-20 12:53:45 +00:00
yangjie01	2289389821	[SPARK-33441][BUILD][FOLLOWUP] Make unused-imports check for SBT specific ### What changes were proposed in this pull request? Move "unused-imports" check config to `SparkBuild.scala` and make it SBT specific. ### Why are the changes needed? Make unused-imports check for SBT specific. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #30441 from LuciferYang/SPARK-33441-FOLLOWUP. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-20 21:27:41 +09:00
Venkata krishnan Sowrirajan	8218b48803	[SPARK-32919][SHUFFLE][TEST-MAVEN][TEST-HADOOP2.7] Driver side changes for coordinating push based shuffle by selecting external shuffle services for merging partitions ### What changes were proposed in this pull request? Driver side changes for coordinating push based shuffle by selecting external shuffle services for merging partitions. This PR includes changes related to `ShuffleMapStage` preparation which is selection of merger locations and initializing them as part of `ShuffleDependency`. Currently this code is not used as some of the changes would come subsequently as part of https://issues.apache.org/jira/browse/SPARK-32917 (shuffle blocks push as part of `ShuffleMapTask`), https://issues.apache.org/jira/browse/SPARK-32918 (support for finalize API) and https://issues.apache.org/jira/browse/SPARK-32920 (finalization of push/merge phase). This is why the tests here are also partial, once these above mentioned changes are raised as PR we will have enough tests for DAGScheduler piece of code as well. ### Why are the changes needed? Added a new API in `SchedulerBackend` to get merger locations for push based shuffle. This is currently implemented for Yarn and other cluster managers can have separate implementations which is why a new API is introduced. ### Does this PR introduce _any_ user-facing change? Yes, user facing config to enable push based shuffle is introduced ### How was this patch tested? Added unit tests partially and some of the changes in DAGScheduler depends on future changes, DAGScheduler tests will be added along with those changes. Lead-authored-by: Venkata krishnan Sowrirajan vsowrirajanlinkedin.com Co-authored-by: Min Shen mshenlinkedin.com Closes #30164 from venkata91/upstream-SPARK-32919. Lead-authored-by: Venkata krishnan Sowrirajan <vsowrirajan@linkedin.com> Co-authored-by: Min Shen <mshen@linkedin.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>	2020-11-20 06:00:30 -06:00
HyukjinKwon	02d410a18c	[MINOR][DOCS] Document 'without' value for HADOOP_VERSION in pip installation ### What changes were proposed in this pull request? I believe it's self-descriptive. ### Why are the changes needed? To document supported features. ### Does this PR introduce _any_ user-facing change? Yes, the docs are updated. It's master only. ### How was this patch tested? Manually built the docs via `cd python/docs` and `make clean html`: ![Screen Shot 2020-11-20 at 10 59 07 AM](https://user-images.githubusercontent.com/6477701/99748225-7ad9b280-2b1f-11eb-86fd-165012b1bb7c.png) Closes #30436 from HyukjinKwon/minor-doc-fix. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-20 13:14:20 +09:00
Gabor Somogyi	883a213a8f	[MINOR] Structured Streaming statistics page indent fix ### What changes were proposed in this pull request? Structured Streaming statistics page code contains an indentation issue. This PR fixes it. ### Why are the changes needed? Indent fix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #30434 from gaborgsomogyi/STAT-INDENT-FIX. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-19 13:36:45 -08:00
Chao Sun	6da8ade5f4	[SPARK-33045][SQL][FOLLOWUP] Fix build failure with Scala 2.13 ### What changes were proposed in this pull request? Explicitly convert `scala.collection.mutable.Buffer` to `Seq`. In Scala 2.13 `Seq` is an alias of `scala.collection.immutable.Seq` instead of `scala.collection.Seq`. ### Why are the changes needed? Without the change build with Scala 2.13 fails with the following: ``` [error] /home/runner/work/spark/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala:1417:41: type mismatch; [error] found : scala.collection.mutable.Buffer[org.apache.spark.unsafe.types.UTF8String] [error] required: Seq[org.apache.spark.unsafe.types.UTF8String] [error] case null => LikeAll(e, patterns) [error] ^ [error] /home/runner/work/spark/spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala:1418:41: type mismatch; [error] found : scala.collection.mutable.Buffer[org.apache.spark.unsafe.types.UTF8String] [error] required: Seq[org.apache.spark.unsafe.types.UTF8String] [error] case _ => NotLikeAll(e, patterns) [error] ^ ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #30431 from sunchao/SPARK-33045-followup. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-19 12:42:33 -08:00
gengjiaan	3695e997d5	[SPARK-33045][SQL] Support build-in function like_all and fix StackOverflowError issue ### What changes were proposed in this pull request? Spark already support `LIKE ALL` syntax, but it will throw `StackOverflowError` if there are many elements(more than 14378 elements). We should implement built-in function for LIKE ALL to fix this issue. Why the stack overflow can happen in the current approach ? The current approach uses reduceLeft to connect each `Like(e, p)`, this will lead the the call depth of the thread is too large, causing `StackOverflowError` problems. Why the fix in this PR can avoid the error? This PR support built-in function for `LIKE ALL` and avoid this issue. ### Why are the changes needed? 1.Fix the `StackOverflowError` issue. 2.Support built-in function `like_all`. ### Does this PR introduce _any_ user-facing change? 'No'. ### How was this patch tested? Jenkins test. Closes #29999 from beliefer/SPARK-33045-like_all. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Co-authored-by: Jiaan Geng <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-19 16:56:21 +00:00
ulysses	21b13506cd	[SPARK-33442][SQL] Change Combine Limit to Eliminate limit using max row ### What changes were proposed in this pull request? Change `CombineLimits` name to `EliminateLimits` and add check if `Limit` child max row <= limit. ### Why are the changes needed? In Add-hoc scene, we always add limit for the query if user have no special limit value, but not all limit is nesessary. A general negative example is ``` select count(*) from t limit 100000; ``` It will be great if we can eliminate limit at Spark side. Also, we make a benchmark for this case ``` runBenchmark("Sort and Limit") { val N = 100000 val benchmark = new Benchmark("benchmark sort and limit", N) benchmark.addCase("TakeOrderedAndProject", 3) { _ => spark.range(N).toDF("c").repartition(200).sort("c").take(200000) } benchmark.addCase("Sort And Limit", 3) { _ => withSQLConf("spark.sql.execution.topKSortFallbackThreshold" -> "-1") { spark.range(N).toDF("c").repartition(200).sort("c").take(200000) } } benchmark.addCase("Sort", 3) { _ => spark.range(N).toDF("c").repartition(200).sort("c").collect() } benchmark.run() } ``` and the result is ``` Java HotSpot(TM) 64-Bit Server VM 1.8.0_191-b12 on Mac OS X 10.15.6 Intel(R) Core(TM) i5-5257U CPU 2.70GHz benchmark sort and limit: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ TakeOrderedAndProject 1833 2259 382 0.1 18327.1 1.0X Sort And Limit 1417 1658 285 0.1 14167.5 1.3X Sort 1324 1484 225 0.1 13238.3 1.4X ``` It shows that it makes sense to replace `TakeOrderedAndProjectExec` with `Sort + Project`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Add test. Closes #30368 from ulysses-you/SPARK-33442. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-19 13:31:10 +00:00
allisonwang-db	a03c540cf7	[SPARK-33472][SQL] Adjust RemoveRedundantSorts rule order ### What changes were proposed in this pull request? This PR switched the order for the rule `RemoveRedundantSorts` and `EnsureRequirements` so that `EnsureRequirements` will be invoked before `RemoveRedundantSorts` to avoid IllegalArgumentException when instantiating PartitioningCollection. ### Why are the changes needed? `RemoveRedundantSorts` rule uses SparkPlan's `outputPartitioning` to check whether a sort node is redundant. Currently, it is added before `EnsureRequirements`. Since `PartitioningCollection` requires left and right partitioning to have the same number of partitions, which is not necessarily true before applying `EnsureRequirements`, the rule can fail with the following exception: ``` IllegalArgumentException: requirement failed: PartitioningCollection requires all of its partitionings have the same numPartitions. ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test Closes #30373 from allisonwang-db/sort-follow-up. Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-19 13:29:01 +00:00
allisonwang-db	ef2638c3e3	[SPARK-33183][SQL][FOLLOW-UP] Update rule RemoveRedundantSorts config version ### What changes were proposed in this pull request? This PR is a follow up for #30093 to updates the config `spark.sql.execution.removeRedundantSorts` version to 2.4.8. ### Why are the changes needed? To update the rule version it has been backported to 2.4. #30194 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #30420 from allisonwang-db/spark-33183-follow-up. Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-19 00:12:22 -08:00
Dongjoon Hyun	d5e7bd0cc4	[SPARK-33483][INFRA][TESTS] Fix rat exclusion patterns and add a LICENSE ### What changes were proposed in this pull request? This PR fixes the RAT exclusion rule which was originated from SPARK-1144 (Apache Spark 1.0) ### Why are the changes needed? This prevents the situation like https://github.com/apache/spark/pull/30415. Currently, it missed `catalog` directory due to `.log` rule. ``` $ dev/check-license Could not find Apache license headers in the following files: !????? /Users/dongjoon/APACHE/spark-merge/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/MetadataColumn.java !????? /Users/dongjoon/APACHE/spark-merge/sql/catalyst/src/main/java/org/apache/spark/sql/connector/catalog/SupportsMetadataColumns.java ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CI with the new rule. Closes #30418 from dongjoon-hyun/SPARK-RAT. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-18 23:59:11 -08:00
Prakhar Jain	0b0fb70b09	[SPARK-33400][SQL] Normalize sameOrderExpressions in SortOrder to avoid unnecessary sort operations ### What changes were proposed in this pull request? This pull request tries to normalize the SortOrder properly to prevent unnecessary sort operators. Currently the sameOrderExpressions are not normalized as part of AliasAwareOutputOrdering. Example: consider this join of three tables: """ \|SELECT t2id, t3.id as t3id \|FROM ( \| SELECT t1.id as t1id, t2.id as t2id \| FROM t1, t2 \| WHERE t1.id = t2.id \|) t12, t3 \|WHERE t1id = t3.id """. The plan for this looks like: (8) Project [t2id#1059L, id#1004L AS t3id#1060L] +- (8) SortMergeJoin [t2id#1059L], [id#1004L], Inner :- (5) Sort [t2id#1059L ASC NULLS FIRST ], false, 0 <----------------------------- : +- (5) Project [id#1000L AS t2id#1059L] : +- (5) SortMergeJoin [id#996L], [id#1000L], Inner : :- (2) Sort [id#996L ASC NULLS FIRST ], false, 0 : : +- Exchange hashpartitioning(id#996L, 5), true, [id=#1426] : : +- (1) Range (0, 10, step=1, splits=2) : +- (4) Sort [id#1000L ASC NULLS FIRST ], false, 0 : +- Exchange hashpartitioning(id#1000L, 5), true, [id=#1432] : +- (3) Range (0, 20, step=1, splits=2) +- (7) Sort [id#1004L ASC NULLS FIRST ], false, 0 +- Exchange hashpartitioning(id#1004L, 5), true, [id=#1443] +- *(6) Range (0, 30, step=1, splits=2) In this plan, the marked sort node could have been avoided as the data is already sorted on "t2.id" by the lower SortMergeJoin. ### Why are the changes needed? To remove unneeded Sort operators. ### Does this PR introduce any user-facing change? No ### How was this patch tested? New UT added. Closes #30302 from prakharjain09/SPARK-33400-sortorder. Authored-by: Prakhar Jain <prakharjain09@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-19 06:25:37 +00:00
Yuming Wang	014e1fbb3a	[SPARK-27421][SQL] Fix filter for int column and value class java.lang.String when pruning partition column ### What changes were proposed in this pull request? This pr fix filter for int column and value class java.lang.String when pruning partition column. How to reproduce this issue: ```scala spark.sql("CREATE table test (name STRING) partitioned by (id int) STORED AS PARQUET") spark.sql("CREATE VIEW test_view as select cast(id as string) as id, name from test") spark.sql("SELECT * FROM test_view WHERE id = '0'").explain ``` ``` 20/11/15 06:19:01 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_partitions_by_filter : db=default tbl=test 20/11/15 06:19:01 INFO MetaStoreDirectSql: Unable to push down SQL filter: Cannot push down filter for int column and value class java.lang.String 20/11/15 06:19:01 ERROR SparkSQLDriver: Failed in [SELECT * FROM test_view WHERE id = '0'] java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive. You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem, however this will result in degraded performance. Please report a bug: https://issues.apache.org/jira/browse/SPARK at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:828) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:745) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:276) at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:743) ``` ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30380 from wangyum/SPARK-27421. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2020-11-19 14:01:42 +08:00
yangjie01	e3058ba17c	[SPARK-33441][BUILD] Add unused-imports compilation check and remove all unused-imports ### What changes were proposed in this pull request? This pr add a new Scala compile arg to `pom.xml` to defense against new unused imports: - `-Ywarn-unused-import` for Scala 2.12 - `-Wconf:cat=unused-imports:e` for Scala 2.13 The other fIles change are remove all unused imports in Spark code ### Why are the changes needed? Cleanup code and add guarantee to defense against new unused imports ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #30351 from LuciferYang/remove-imports-core-module. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-19 14:20:39 +09:00
Ryan Blue	66a76378cf	[SPARK-31255][SQL][FOLLOWUP] Add missing license headers ### What changes were proposed in this pull request? Add missing license headers for new files added in #28027. ### Why are the changes needed? To fix licenses. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This is a purely non-functional change. Closes #30415 from rdblue/license-headers. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-18 19:18:28 -08:00
Liang-Chi Hsieh	e518008ca9	[SPARK-33473][SQL] Extend interpreted subexpression elimination to other interpreted projections ### What changes were proposed in this pull request? Similar to `InterpretedUnsafeProjection`, this patch proposes to extend interpreted subexpression elimination to `InterpretedMutableProjection` and `InterpretedSafeProjection`. ### Why are the changes needed? Enabling subexpression elimination can improve the performance of interpreted projections, as shown in `InterpretedUnsafeProjection`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #30406 from viirya/SPARK-33473. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-18 18:58:06 -08:00
Liang-Chi Hsieh	97d2cee4af	[SPARK-33427][SQL][FOLLOWUP] Prevent test flakyness in SubExprEvaluationRuntimeSuite ### What changes were proposed in this pull request? This followup is to prevent possible test flakyness of `SubExprEvaluationRuntimeSuite`. ### Why are the changes needed? Because HashMap doesn't guarantee the order, in `proxyExpressions` the proxy expression id is not deterministic. So in `SubExprEvaluationRuntimeSuite` we should not test against it. ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? Unit test. Closes #30414 from viirya/SPARK-33427-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2020-11-18 18:35:11 -08:00
Gengliang Wang	4267ca98fa	[SPARK-33479][DOC] Make the API Key of DocSearch configurable ### What changes were proposed in this pull request? Make the API key of DocSearch configurable and avoid hardcoding in the HTML template ### Why are the changes needed? After https://github.com/apache/spark/pull/30292, our Spark documentation site supports searching. However, the default API key always points to the latest release doc. We have to set different API keys for different releases. Otherwise, the search results are always based on the latest documentation(https://spark.apache.org/docs/latest/) even when visiting the documentation of previous releases. As per discussion in https://github.com/apache/spark/pull/30292#issuecomment-725613417, we should make the API key configurable and set different values for different releases. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual test Closes #30409 from gengliangwang/apiKey. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-11-19 11:20:18 +09:00
zero323	56a8510e19	[SPARK-33304][R][SQL] Add from_avro and to_avro functions to SparkR ### What changes were proposed in this pull request? Adds `from_avro` and `to_avro` functions to SparkR. ### Why are the changes needed? Feature parity. ### Does this PR introduce _any_ user-facing change? New functions exposed in SparkR API. ### How was this patch tested? New unit tests. Closes #30216 from zero323/SPARK-33304. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-19 09:52:29 +09:00
Gengliang Wang	9a4c79073b	[SPARK-33354][SQL] New explicit cast syntax rules in ANSI mode ### What changes were proposed in this pull request? In section 6.13 of the ANSI SQL standard, there are syntax rules for valid combinations of the source and target data types. ![image](https://user-images.githubusercontent.com/1097932/98212874-17356f80-1ef9-11eb-8f2b-385f32db404a.png) Comparing the ANSI CAST syntax rules with the current default behavior of Spark: ![image](https://user-images.githubusercontent.com/1097932/98789831-b7870a80-23b7-11eb-9b5f-469a42e0ee4a.png) To make Spark's ANSI mode more ANSI SQL Compatible，I propose to disallow the following casting in ANSI mode: ``` TimeStamp <=> Boolean Date <=> Boolean Numeric <=> Timestamp Numeric <=> Date Numeric <=> Binary String <=> Array String <=> Map String <=> Struct ``` The following castings are considered invalid in ANSI SQL standard, but they are quite straight forward. Let's Allow them for now ``` Numeric <=> Boolean String <=> Binary ``` ### Why are the changes needed? Better ANSI SQL compliance ### Does this PR introduce _any_ user-facing change? Yes, the following castings will not be allowed in ANSI mode: ``` TimeStamp <=> Boolean Date <=> Boolean Numeric <=> Timestamp Numeric <=> Date Numeric <=> Binary String <=> Array String <=> Map String <=> Struct ``` ### How was this patch tested? Unit test The ANSI Compliance doc preview: ![image](https://user-images.githubusercontent.com/1097932/98946017-2cd20880-24a8-11eb-8161-65749bfdd03a.png) Closes #30260 from gengliangwang/ansiCanCast. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-11-19 09:23:36 +09:00
HyukjinKwon	fbfc0bf628	[SPARK-33464][INFRA] Add/remove (un)necessary cache and restructure GitHub Actions yaml ### What changes were proposed in this pull request? This PR proposes: - Add `~/.sbt` directory into the build cache, see also https://github.com/sbt/sbt/issues/3681 - Move `hadoop-2` below to put up together with `java-11` and `scala-213`, see https://github.com/apache/spark/pull/30391#discussion_r524881430 - Remove unnecessary `.m2` cache if you run SBT tests only. - Remove `rm ~/.m2/repository/org/apache/spark`. If you don't `sbt publishLocal` or `mvn install`, we don't need to care about it. - Use Java 8 in Scala 2.13 build. We can switch the Java version to 11 used for release later. - Add caches into linters. The linter scripts uses `sbt` in, for example, `./dev/lint-scala`, and uses `mvn` in, for example, `./dev/lint-java`. Also, it requires to `sbt package` in Jekyll build, see: https://github.com/apache/spark/blob/master/docs/_plugins/copy_api_dirs.rb#L160-L161. We need full caches here for SBT, Maven and build tools. - Use the same syntax of Java version, 1.8 -> 8. ### Why are the changes needed? - Remove unnecessary stuff - Cache what we can in the build ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? It will be tested in GitHub Actions build at the current PR Closes #30391 from HyukjinKwon/SPARK-33464. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-18 15:13:43 -08:00
Ryan Blue	1df69f7e32	[SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2 ### What changes were proposed in this pull request? This adds support for metadata columns to DataSourceV2. If a source implements `SupportsMetadataColumns` it must also implement `SupportsPushDownRequiredColumns` to support projecting those columns. The analyzer is updated to resolve metadata columns from `LogicalPlan.metadataOutput`, and this adds a rule that will add metadata columns to the output of `DataSourceV2Relation` if one is used. ### Why are the changes needed? This is the solution discussed for exposing additional data in the Kafka source. It is also needed for a generic `MERGE INTO` plan. ### Does this PR introduce any user-facing change? Yes. Users can project additional columns from sources that implement the new API. This also updates `DescribeTableExec` to show metadata columns. ### How was this patch tested? Will include new unit tests. Closes #28027 from rdblue/add-dsv2-metadata-columns. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>	2020-11-18 14:07:51 -08:00
Chao Sun	27cd945c15	[SPARK-32381][CORE][SQL][FOLLOWUP] More cleanup on HadoopFSUtils ### What changes were proposed in this pull request? This PR is a follow-up of #29471 and does the following improvements for `HadoopFSUtils`: 1. Removes the extra `filterFun` from the listing API and combines it with the `filter`. 2. Removes `SerializableBlockLocation` and `SerializableFileStatus` given that `BlockLocation` and `FileStatus` are already serializable. 3. Hides the `isRootLevel` flag from the top-level API. ### Why are the changes needed? Main purpose is to simplify the logic within `HadoopFSUtils` as well as cleanup the API. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit tests (e.g., `FileIndexSuite`) Closes #29959 from sunchao/hadoop-fs-utils-followup. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Holden Karau <hkarau@apple.com>	2020-11-18 12:39:00 -08:00
Stavros Kontopoulos	dcac78e12b	[SPARK-27936][K8S] Support python deps Supports python client deps from the launcher fs. This is a feature that was added for java deps. This PR adds support fo rpythona s well. yes Manually running different scenarios and via examining the driver & executors logs. Also there is an integration test added. I verified that the python resources are added to the spark file server and they are named properly so they dont fail the executors. Note here that as previously the following will not work: primary resource `A.py`: uses a closure defined in submited pyfile `B.py`, context.py only adds to the pythonpath files with certain extension eg. zip, egg, jar. Closes #25870 from skonto/python-deps. Lead-authored-by: Stavros Kontopoulos <skontopo@redhat.com> Co-authored-by: Stavros Kontopoulos <st.kontopoulos@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-18 10:43:41 -08:00
Dongjoon Hyun	594c7c613a	[SPARK-33476][CORE] Generalize ExecutorSource to expose user-given file system schemes ### What changes were proposed in this pull request? This PR aims to generalize executor metrics to support user-given file system schemes instead of the fixed `file,hdfs` scheme. ### Why are the changes needed? For the users using only cloud storages like `S3A`, we need to be able to expose `S3A` metrics. Also, we can skip unused `hdfs` metrics. ### Does this PR introduce _any_ user-facing change? Yes, but compatible for the existing users which uses `hdfs` and `file` filesystem scheme only. ### How was this patch tested? Manually do the following. ``` $ build/sbt -Phadoop-cloud package $ sbin/start-master.sh; sbin/start-slave.sh spark://$(hostname):7077 $ bin/spark-shell --master spark://$(hostname):7077 -c spark.executor.metrics.fileSystemSchemes=file,s3a -c spark.metrics.conf.executor.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink scala> spark.read.textFile("s3a://dongjoon/README.md").collect() ``` Separately, launch `jconsole` and check `.executor.filesystem.s3a.`. Also, confirm that there is no `.executor.filesystem.hdfs.` ``` $ jconsole ``` ![Screen Shot 2020-11-17 at 9 26 03 PM](https://user-images.githubusercontent.com/9700541/99487609-94121180-291b-11eb-9ed2-964546146981.png) Closes #30405 from dongjoon-hyun/SPARK-33476. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-18 08:04:14 -08:00
zhengruifeng	689c294102	[SPARK-32907][ML][PYTHON] Adaptively blockify instances - AFT,LiR,LoR ### What changes were proposed in this pull request? use `maxBlockSizeInMB` instead of `blockSize` (#rows) to control the stacking of vectors; ### Why are the changes needed? the performance gain is mainly related to the nnz of block. ### Does this PR introduce _any_ user-facing change? yes, param blockSize -> blockSizeInMB in master ### How was this patch tested? updated testsuites Closes #30355 from zhengruifeng/adaptively_blockify_aft_lir_lor. Lead-authored-by: zhengruifeng <ruifengz@foxmail.com> Co-authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>	2020-11-18 23:02:31 +08:00
Gengliang Wang	a180e02842	[SPARK-32852][SQL][DOC][FOLLOWUP] Revise the documentation of spark.sql.hive.metastore.jars ### What changes were proposed in this pull request? This is a follow-up for https://github.com/apache/spark/pull/29881. It revises the documentation of the configuration `spark.sql.hive.metastore.jars`. ### Why are the changes needed? Fix grammatical error in the doc. Also, make it more clear that the configuration is effective only when `spark.sql.hive.metastore.jars` is set as `path` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Just doc changes. Closes #30407 from gengliangwang/reviseJarPathDoc. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-11-18 22:09:40 +08:00
Takeshi Yamamuro	74bd046d17	[SPARK-33475][BUILD] Bump ANTLR runtime version to 4.8-1 ### What changes were proposed in this pull request? This PR intends to upgrade ANTLR runtime from 4.7.1 to 4.8-1. ### Why are the changes needed? Release note of v4.8 and v4.7.2 (the v4.7.2 release has a few minor bug fixes for java targets): - v4.8: https://github.com/antlr/antlr4/releases/tag/4.8 - v4.7.2: https://github.com/antlr/antlr4/releases/tag/4.7.2 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA tests. Closes #30404 from maropu/UpgradeAntlr. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-18 21:20:28 +09:00
Bryan Cutler	8e2a0bdce7	[SPARK-24554][PYTHON][SQL] Add MapType support for PySpark with Arrow ### What changes were proposed in this pull request? This change adds MapType support for PySpark with Arrow, if using pyarrow >= 2.0.0. ### Why are the changes needed? MapType was previous unsupported with Arrow. ### Does this PR introduce _any_ user-facing change? User can now enable MapType for `createDataFrame()`, `toPandas()` with Arrow optimization, and with Pandas UDFs. ### How was this patch tested? Added new PySpark tests for createDataFrame(), toPandas() and Scalar Pandas UDFs. Closes #30393 from BryanCutler/arrow-add-MapType-SPARK-24554. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-18 21:18:19 +09:00
angerszhu	dd32f45d20	[SPARK-31069][CORE] Avoid repeat compute `chunksBeingTransferred` cause hight cpu cost in external shuffle service when `maxChunksBeingTransferred` use default value ### What changes were proposed in this pull request? Followup from #27831 , origin author chrysan. Each request it will check `chunksBeingTransferred ` ``` public long chunksBeingTransferred() { long sum = 0L; for (StreamState streamState: streams.values()) { sum += streamState.chunksBeingTransferred.get(); } return sum; } ``` such as ``` long chunksBeingTransferred = streamManager.chunksBeingTransferred(); if (chunksBeingTransferred >= maxChunksBeingTransferred) { logger.warn("The number of chunks being transferred {} is above {}, close the connection.", chunksBeingTransferred, maxChunksBeingTransferred); channel.close(); return; } ``` It will traverse `streams` repeatedly and we know that fetch data chunk will access `stream` too, there cause two problem: 1. repeated traverse `streams`, the longer the length, the longer the time 2. lock race in ConcurrentHashMap `streams` In this PR, when `maxChunksBeingTransferred` use default value, we avoid compute `chunksBeingTransferred ` since we don't care about this. If user want to set this configuration and meet performance problem, you can also backport PR #27831 ### Why are the changes needed? Speed up getting `chunksBeingTransferred` and avoid lock race in object `streams` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existed UT Closes #30139 from AngersZhuuuu/SPARK-31069. Lead-authored-by: angerszhu <angers.zhu@gmail.com> Co-authored-by: chrysan <chrysanxia@gmail.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>	2020-11-17 20:52:58 -06:00
Liang-Chi Hsieh	7f3d99a8a5	[MINOR][SQL][DOCS] Update schema_of_csv and schema_of_json doc ### What changes were proposed in this pull request? This minor PR updates the docs of `schema_of_csv` and `schema_of_json`. They allow foldable string column instead of a string literal now. ### Why are the changes needed? The function doc of `schema_of_csv` and `schema_of_json` are not updated accordingly with previous PRs. ### Does this PR introduce _any_ user-facing change? Yes, update user-facing doc. ### How was this patch tested? Unit test. Closes #30396 from viirya/minor-json-csv. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-18 11:32:27 +09:00
Rameshkrishnan Muthusamy	5e8549973d	[SPARK-33471][K8S][BUILD] Upgrade kubernetes-client to 4.12.0 ### What changes were proposed in this pull request? This PR aims to upgrade Kubernetes-client from 4.11.1 to 4.12.0 ### Why are the changes needed? This upgrades the dependency for Apache Spark 3.1.0. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #30401 from ramesh-muthusamy/SPARK-33471-k8s-clientupgrade. Authored-by: Rameshkrishnan Muthusamy <rameshkrishnan_muthusamy@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-17 13:41:58 -08:00
Prashant Sharma	2a8e253cdb	[SPARK-32222][K8S][TESTS] Add K8s IT for conf propagation ### What changes were proposed in this pull request? Added integration test - which tries to configure a log4j.properties and checks if, it is the one pickup by the driver. ### Why are the changes needed? Improved test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running integration tests. Closes #30388 from ScrapCodes/SPARK-32222/k8s-it-spark-conf-propagate. Authored-by: Prashant Sharma <prashsh1@in.ibm.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-17 08:47:04 -08:00
Liang-Chi Hsieh	928348408e	[SPARK-33427][SQL] Add subexpression elimination for interpreted expression evaluation ### What changes were proposed in this pull request? This patch proposes to add subexpression elimination for interpreted expression evaluation. Interpreted expression evaluation is used when codegen was not able to work, for example complex schema. ### Why are the changes needed? Currently we only do subexpression elimination for codegen. For some reasons, we may need to run interpreted expression evaluation. For example, codegen fails to compile and fallbacks to interpreted mode, or complex input/output schema of expressions. It is commonly seen for complex schema from expressions that is possibly caused by the query optimizer too, e.g. SPARK-32945. We should also support subexpression elimination for interpreted evaluation. That could reduce performance difference when Spark fallbacks from codegen to interpreted expression evaluation, and improve Spark usability. #### Benchmark Update `SubExprEliminationBenchmark`: Before: ``` OpenJDK 64-Bit Server VM 1.8.0_265-b01 on Mac OS X 10.15.6 Intel(R) Core(TM) i7-9750H CPU 2.60GHz from_json as subExpr: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------- subexpressionElimination on, codegen off 24707 25688 903 0.0 247068775.9 1.0X ``` After: ``` OpenJDK 64-Bit Server VM 1.8.0_265-b01 on Mac OS X 10.15.6 Intel(R) Core(TM) i7-9750H CPU 2.60GHz from_json as subExpr: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------- subexpressionElimination on, codegen off 2360 2435 87 0.0 23604320.7 11.2X ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Benchmark manually. Closes #30341 from viirya/SPARK-33427. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-17 14:29:37 +00:00
Yuming Wang	09bb9bedcd	[SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values ### What changes were proposed in this pull request? We [rewrite](`5197c5d2e7/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala (L722-L724)`) `In`/`InSet` predicate to `or` expressions when pruning Hive partitions. That will cause Hive metastore stack over flow if there are a lot of values. This pr rewrite `InSet` predicate to `GreaterThanOrEqual` min value and `LessThanOrEqual ` max value when pruning Hive partitions to avoid Hive metastore stack overflow. From our experience, `spark.sql.hive.metastorePartitionPruningInSetThreshold` should be less than 10000. ### Why are the changes needed? Avoid Hive metastore stack overflow when `InSet` predicate have many values. Especially DPP, it may generate many values. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #30325 from wangyum/SPARK-33416. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-17 13:47:01 +00:00
HyukjinKwon	e2c7bfce40	[SPARK-33407][PYTHON] Simplify the exception message from Python UDFs (disabled by default) ### What changes were proposed in this pull request? This PR proposes to simplify the exception messages from Python UDFS. Currently, the exception message from Python UDFs is as below: ```python from pyspark.sql.functions import udf; spark.range(10).select(udf(lambda x: x/0)("id")).collect() ``` ```python Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../python/pyspark/sql/dataframe.py", line 427, in show print(self._jdf.showString(n, 20, vertical)) File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../python/pyspark/sql/utils.py", line 127, in deco raise_from(converted) File "<string>", line 3, in raise_from pyspark.sql.utils.PythonException: An exception was thrown from Python worker in the executor: Traceback (most recent call last): File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 605, in main process() File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 597, in process serializer.dump_stream(out_iter, outfile) File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream for obj in iterator: File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched for item in iterator: File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr> result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda> return lambda a: f(a) File "/.../python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper return f(args, *kwargs) File "<stdin>", line 1, in <lambda> ZeroDivisionError: division by zero ``` Actually, almost all cases, users only care about `ZeroDivisionError: division by zero`. We don't really have to show the internal stuff in 99% cases. This PR adds a configuration `spark.sql.execution.pyspark.udf.simplifiedException.enabled` (disabled by default) that hides the internal tracebacks related to Python worker, (de)serialization, etc. ```python Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../python/pyspark/sql/dataframe.py", line 427, in show print(self._jdf.showString(n, 20, vertical)) File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../python/pyspark/sql/utils.py", line 127, in deco raise_from(converted) File "<string>", line 3, in raise_from pyspark.sql.utils.PythonException: An exception was thrown from Python worker in the executor: Traceback (most recent call last): File "<stdin>", line 1, in <lambda> ZeroDivisionError: division by zero ``` The trackback will be shown from the point when any non-PySpark file is seen in the traceback. ### Why are the changes needed? Without this configuration. such internal tracebacks are exposed to users directly especially for shall or notebook users in PySpark. 99% cases people don't care about the internal Python worker, (de)serialization and related tracebacks. It just makes the exception more difficult to read. For example, one statement of `x/0` above shows a very long traceback and most of them are unnecessary. This configuration enables the ability to show simplified tracebacks which users will likely be most interested in. ### Does this PR introduce _any_ user-facing change? By default, no. It adds one configuration that simplifies the exception message. See the example above. ### How was this patch tested? Manually tested: ```bash $ pyspark --conf spark.sql.execution.pyspark.udf.simplifiedException.enabled=true ``` ```python from pyspark.sql.functions import udf; spark.sparkContext.setLogLevel("FATAL"); spark.range(10).select(udf(lambda x: x/0)("id")).collect() ``` and unittests were also added. Closes #30309 from HyukjinKwon/SPARK-33407. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-17 14:15:31 +09:00
Cheng Su	5af5aa146e	[SPARK-33209][SS] Refactor unit test of stream-stream join in UnsupportedOperationsSuite ### What changes were proposed in this pull request? This PR is a followup from https://github.com/apache/spark/pull/30076 to refactor unit test of stream-stream join in `UnsupportedOperationsSuite`, where we had a lot of duplicated code for stream-stream join unit test, for each join type. ### Why are the changes needed? Help reduce duplicated code and make it easier for developers to read and add code in the future. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit test in `UnsupportedOperationsSuite.scala` (pure refactoring). Closes #30347 from c21/stream-test. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-11-17 11:18:42 +09:00
Prakhar Jain	f5e3302840	[SPARK-33399][SQL] Normalize output partitioning and sortorder with respect to aliases to avoid unneeded exchange/sort nodes ### What changes were proposed in this pull request? This pull request tries to remove unneeded exchanges/sorts by normalizing the output partitioning and sortorder information correctly with respect to aliases. Example: consider this join of three tables: \|SELECT t2id, t3.id as t3id \|FROM ( \| SELECT t1.id as t1id, t2.id as t2id \| FROM t1, t2 \| WHERE t1.id = t2.id \|) t12, t3 \|WHERE t1id = t3.id The plan for this looks like: (9) Project [t2id#1034L, id#1004L AS t3id#1035L] +- (9) SortMergeJoin [t1id#1033L], [id#1004L], Inner :- (6) Sort [t1id#1033L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(t1id#1033L, 5), true, [id=#1343] <------------------------------ : +- (5) Project [id#996L AS t1id#1033L, id#1000L AS t2id#1034L] : +- (5) SortMergeJoin [id#996L], [id#1000L], Inner : :- (2) Sort [id#996L ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(id#996L, 5), true, [id=#1329] : : +- (1) Range (0, 10, step=1, splits=2) : +- (4) Sort [id#1000L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#1000L, 5), true, [id=#1335] : +- (3) Range (0, 20, step=1, splits=2) +- (8) Sort [id#1004L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#1004L, 5), true, [id=#1349] +- *(7) Range (0, 30, step=1, splits=2) In this plan, the marked exchange could have been avoided as the data is already partitioned on "t1.id". This happens because AliasAwareOutputPartitioning class handles aliases only related to HashPartitioning. This change normalizes all output partitioning based on aliasing happening in Project. ### Why are the changes needed? To remove unneeded exchanges. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UT added. On TPCDS 1000 scale, this change improves the performance of query 95 from 330 seconds to 170 seconds by removing the extra Exchange. Closes #30300 from prakharjain09/SPARK-33399-outputpartitioning. Authored-by: Prakhar Jain <prakharjain09@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-11-17 10:35:43 +09:00
Pascal Gillet	9ab0f82a59	[SPARK-23499][MESOS] Support for priority queues in Mesos scheduler ### What changes were proposed in this pull request? I push this PR as I could not re-open the stale one https://github.com/apache/spark/pull/20665 . As for Yarn or Kubernetes, Mesos users should be able to specify priority queues to define a workload management policy for queued drivers in the Mesos Cluster Dispatcher. This would ensure scheduling order while enqueuing Spark applications for a Mesos cluster. ### Why are the changes needed? Currently, submitted drivers are kept in order of their submission: the first driver added to the queue will be the first one to be executed (FIFO), regardless of their priority. See https://issues.apache.org/jira/projects/SPARK/issues/SPARK-23499 for more details. ### Does this PR introduce _any_ user-facing change? The MesosClusterDispatcher UI shows now Spark jobs along with the queue to which they are submitted. ### How was this patch tested? Unit tests. Also, this feature has been in production for 3 years now as we use a modified Spark 2.4.0 since then. Closes #30352 from pgillet/mesos-scheduler-priority-queue. Lead-authored-by: Pascal Gillet <pascal.gillet@stack-labs.com> Co-authored-by: pgillet <pascalgillet@ymail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-16 16:54:08 -08:00
xuewei.linxuewei	b5eca18af0	[SPARK-33460][SQL] Accessing map values should fail if key is not found ### What changes were proposed in this pull request? Instead of returning NULL, throws runtime NoSuchElementException towards invalid key accessing in map-like functions, such as element_at, GetMapValue, when ANSI mode is on. ### Why are the changes needed? For ANSI mode. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT and Existing UT. Closes #30386 from leanken/leanken-SPARK-33460. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-16 16:14:31 +00:00

1 2 3 4 5 ...

28612 commits