ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Prakhar Jain	0b0fb70b09	[SPARK-33400][SQL] Normalize sameOrderExpressions in SortOrder to avoid unnecessary sort operations ### What changes were proposed in this pull request? This pull request tries to normalize the SortOrder properly to prevent unnecessary sort operators. Currently the sameOrderExpressions are not normalized as part of AliasAwareOutputOrdering. Example: consider this join of three tables: """ \|SELECT t2id, t3.id as t3id \|FROM ( \| SELECT t1.id as t1id, t2.id as t2id \| FROM t1, t2 \| WHERE t1.id = t2.id \|) t12, t3 \|WHERE t1id = t3.id """. The plan for this looks like: (8) Project [t2id#1059L, id#1004L AS t3id#1060L] +- (8) SortMergeJoin [t2id#1059L], [id#1004L], Inner :- (5) Sort [t2id#1059L ASC NULLS FIRST ], false, 0 <----------------------------- : +- (5) Project [id#1000L AS t2id#1059L] : +- (5) SortMergeJoin [id#996L], [id#1000L], Inner : :- (2) Sort [id#996L ASC NULLS FIRST ], false, 0 : : +- Exchange hashpartitioning(id#996L, 5), true, [id=#1426] : : +- (1) Range (0, 10, step=1, splits=2) : +- (4) Sort [id#1000L ASC NULLS FIRST ], false, 0 : +- Exchange hashpartitioning(id#1000L, 5), true, [id=#1432] : +- (3) Range (0, 20, step=1, splits=2) +- (7) Sort [id#1004L ASC NULLS FIRST ], false, 0 +- Exchange hashpartitioning(id#1004L, 5), true, [id=#1443] +- *(6) Range (0, 30, step=1, splits=2) In this plan, the marked sort node could have been avoided as the data is already sorted on "t2.id" by the lower SortMergeJoin. ### Why are the changes needed? To remove unneeded Sort operators. ### Does this PR introduce any user-facing change? No ### How was this patch tested? New UT added. Closes #30302 from prakharjain09/SPARK-33400-sortorder. Authored-by: Prakhar Jain <prakharjain09@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-19 06:25:37 +00:00
Yuming Wang	014e1fbb3a	[SPARK-27421][SQL] Fix filter for int column and value class java.lang.String when pruning partition column ### What changes were proposed in this pull request? This pr fix filter for int column and value class java.lang.String when pruning partition column. How to reproduce this issue: ```scala spark.sql("CREATE table test (name STRING) partitioned by (id int) STORED AS PARQUET") spark.sql("CREATE VIEW test_view as select cast(id as string) as id, name from test") spark.sql("SELECT * FROM test_view WHERE id = '0'").explain ``` ``` 20/11/15 06:19:01 INFO audit: ugi=root ip=unknown-ip-addr cmd=get_partitions_by_filter : db=default tbl=test 20/11/15 06:19:01 INFO MetaStoreDirectSql: Unable to push down SQL filter: Cannot push down filter for int column and value class java.lang.String 20/11/15 06:19:01 ERROR SparkSQLDriver: Failed in [SELECT * FROM test_view WHERE id = '0'] java.lang.RuntimeException: Caught Hive MetaException attempting to get partition metadata by filter from Hive. You can set the Spark configuration setting spark.sql.hive.manageFilesourcePartitions to false to work around this problem, however this will result in degraded performance. Please report a bug: https://issues.apache.org/jira/browse/SPARK at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:828) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$getPartitionsByFilter$1(HiveClientImpl.scala:745) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:294) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:227) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:226) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:276) at org.apache.spark.sql.hive.client.HiveClientImpl.getPartitionsByFilter(HiveClientImpl.scala:743) ``` ### Why are the changes needed? Fix bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30380 from wangyum/SPARK-27421. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2020-11-19 14:01:42 +08:00
yangjie01	e3058ba17c	[SPARK-33441][BUILD] Add unused-imports compilation check and remove all unused-imports ### What changes were proposed in this pull request? This pr add a new Scala compile arg to `pom.xml` to defense against new unused imports: - `-Ywarn-unused-import` for Scala 2.12 - `-Wconf:cat=unused-imports:e` for Scala 2.13 The other fIles change are remove all unused imports in Spark code ### Why are the changes needed? Cleanup code and add guarantee to defense against new unused imports ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #30351 from LuciferYang/remove-imports-core-module. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-19 14:20:39 +09:00
Ryan Blue	66a76378cf	[SPARK-31255][SQL][FOLLOWUP] Add missing license headers ### What changes were proposed in this pull request? Add missing license headers for new files added in #28027. ### Why are the changes needed? To fix licenses. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This is a purely non-functional change. Closes #30415 from rdblue/license-headers. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-18 19:18:28 -08:00
Liang-Chi Hsieh	e518008ca9	[SPARK-33473][SQL] Extend interpreted subexpression elimination to other interpreted projections ### What changes were proposed in this pull request? Similar to `InterpretedUnsafeProjection`, this patch proposes to extend interpreted subexpression elimination to `InterpretedMutableProjection` and `InterpretedSafeProjection`. ### Why are the changes needed? Enabling subexpression elimination can improve the performance of interpreted projections, as shown in `InterpretedUnsafeProjection`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #30406 from viirya/SPARK-33473. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-18 18:58:06 -08:00
Liang-Chi Hsieh	97d2cee4af	[SPARK-33427][SQL][FOLLOWUP] Prevent test flakyness in SubExprEvaluationRuntimeSuite ### What changes were proposed in this pull request? This followup is to prevent possible test flakyness of `SubExprEvaluationRuntimeSuite`. ### Why are the changes needed? Because HashMap doesn't guarantee the order, in `proxyExpressions` the proxy expression id is not deterministic. So in `SubExprEvaluationRuntimeSuite` we should not test against it. ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? Unit test. Closes #30414 from viirya/SPARK-33427-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2020-11-18 18:35:11 -08:00
Gengliang Wang	4267ca98fa	[SPARK-33479][DOC] Make the API Key of DocSearch configurable ### What changes were proposed in this pull request? Make the API key of DocSearch configurable and avoid hardcoding in the HTML template ### Why are the changes needed? After https://github.com/apache/spark/pull/30292, our Spark documentation site supports searching. However, the default API key always points to the latest release doc. We have to set different API keys for different releases. Otherwise, the search results are always based on the latest documentation(https://spark.apache.org/docs/latest/) even when visiting the documentation of previous releases. As per discussion in https://github.com/apache/spark/pull/30292#issuecomment-725613417, we should make the API key configurable and set different values for different releases. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual test Closes #30409 from gengliangwang/apiKey. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-11-19 11:20:18 +09:00
zero323	56a8510e19	[SPARK-33304][R][SQL] Add from_avro and to_avro functions to SparkR ### What changes were proposed in this pull request? Adds `from_avro` and `to_avro` functions to SparkR. ### Why are the changes needed? Feature parity. ### Does this PR introduce _any_ user-facing change? New functions exposed in SparkR API. ### How was this patch tested? New unit tests. Closes #30216 from zero323/SPARK-33304. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-19 09:52:29 +09:00
Gengliang Wang	9a4c79073b	[SPARK-33354][SQL] New explicit cast syntax rules in ANSI mode ### What changes were proposed in this pull request? In section 6.13 of the ANSI SQL standard, there are syntax rules for valid combinations of the source and target data types. ![image](https://user-images.githubusercontent.com/1097932/98212874-17356f80-1ef9-11eb-8f2b-385f32db404a.png) Comparing the ANSI CAST syntax rules with the current default behavior of Spark: ![image](https://user-images.githubusercontent.com/1097932/98789831-b7870a80-23b7-11eb-9b5f-469a42e0ee4a.png) To make Spark's ANSI mode more ANSI SQL Compatible，I propose to disallow the following casting in ANSI mode: ``` TimeStamp <=> Boolean Date <=> Boolean Numeric <=> Timestamp Numeric <=> Date Numeric <=> Binary String <=> Array String <=> Map String <=> Struct ``` The following castings are considered invalid in ANSI SQL standard, but they are quite straight forward. Let's Allow them for now ``` Numeric <=> Boolean String <=> Binary ``` ### Why are the changes needed? Better ANSI SQL compliance ### Does this PR introduce _any_ user-facing change? Yes, the following castings will not be allowed in ANSI mode: ``` TimeStamp <=> Boolean Date <=> Boolean Numeric <=> Timestamp Numeric <=> Date Numeric <=> Binary String <=> Array String <=> Map String <=> Struct ``` ### How was this patch tested? Unit test The ANSI Compliance doc preview: ![image](https://user-images.githubusercontent.com/1097932/98946017-2cd20880-24a8-11eb-8161-65749bfdd03a.png) Closes #30260 from gengliangwang/ansiCanCast. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-11-19 09:23:36 +09:00
HyukjinKwon	fbfc0bf628	[SPARK-33464][INFRA] Add/remove (un)necessary cache and restructure GitHub Actions yaml ### What changes were proposed in this pull request? This PR proposes: - Add `~/.sbt` directory into the build cache, see also https://github.com/sbt/sbt/issues/3681 - Move `hadoop-2` below to put up together with `java-11` and `scala-213`, see https://github.com/apache/spark/pull/30391#discussion_r524881430 - Remove unnecessary `.m2` cache if you run SBT tests only. - Remove `rm ~/.m2/repository/org/apache/spark`. If you don't `sbt publishLocal` or `mvn install`, we don't need to care about it. - Use Java 8 in Scala 2.13 build. We can switch the Java version to 11 used for release later. - Add caches into linters. The linter scripts uses `sbt` in, for example, `./dev/lint-scala`, and uses `mvn` in, for example, `./dev/lint-java`. Also, it requires to `sbt package` in Jekyll build, see: https://github.com/apache/spark/blob/master/docs/_plugins/copy_api_dirs.rb#L160-L161. We need full caches here for SBT, Maven and build tools. - Use the same syntax of Java version, 1.8 -> 8. ### Why are the changes needed? - Remove unnecessary stuff - Cache what we can in the build ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? It will be tested in GitHub Actions build at the current PR Closes #30391 from HyukjinKwon/SPARK-33464. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-18 15:13:43 -08:00
Ryan Blue	1df69f7e32	[SPARK-31255][SQL] Add SupportsMetadataColumns to DSv2 ### What changes were proposed in this pull request? This adds support for metadata columns to DataSourceV2. If a source implements `SupportsMetadataColumns` it must also implement `SupportsPushDownRequiredColumns` to support projecting those columns. The analyzer is updated to resolve metadata columns from `LogicalPlan.metadataOutput`, and this adds a rule that will add metadata columns to the output of `DataSourceV2Relation` if one is used. ### Why are the changes needed? This is the solution discussed for exposing additional data in the Kafka source. It is also needed for a generic `MERGE INTO` plan. ### Does this PR introduce any user-facing change? Yes. Users can project additional columns from sources that implement the new API. This also updates `DescribeTableExec` to show metadata columns. ### How was this patch tested? Will include new unit tests. Closes #28027 from rdblue/add-dsv2-metadata-columns. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>	2020-11-18 14:07:51 -08:00
Chao Sun	27cd945c15	[SPARK-32381][CORE][SQL][FOLLOWUP] More cleanup on HadoopFSUtils ### What changes were proposed in this pull request? This PR is a follow-up of #29471 and does the following improvements for `HadoopFSUtils`: 1. Removes the extra `filterFun` from the listing API and combines it with the `filter`. 2. Removes `SerializableBlockLocation` and `SerializableFileStatus` given that `BlockLocation` and `FileStatus` are already serializable. 3. Hides the `isRootLevel` flag from the top-level API. ### Why are the changes needed? Main purpose is to simplify the logic within `HadoopFSUtils` as well as cleanup the API. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing unit tests (e.g., `FileIndexSuite`) Closes #29959 from sunchao/hadoop-fs-utils-followup. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Holden Karau <hkarau@apple.com>	2020-11-18 12:39:00 -08:00
Stavros Kontopoulos	dcac78e12b	[SPARK-27936][K8S] Support python deps Supports python client deps from the launcher fs. This is a feature that was added for java deps. This PR adds support fo rpythona s well. yes Manually running different scenarios and via examining the driver & executors logs. Also there is an integration test added. I verified that the python resources are added to the spark file server and they are named properly so they dont fail the executors. Note here that as previously the following will not work: primary resource `A.py`: uses a closure defined in submited pyfile `B.py`, context.py only adds to the pythonpath files with certain extension eg. zip, egg, jar. Closes #25870 from skonto/python-deps. Lead-authored-by: Stavros Kontopoulos <skontopo@redhat.com> Co-authored-by: Stavros Kontopoulos <st.kontopoulos@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-18 10:43:41 -08:00
Dongjoon Hyun	594c7c613a	[SPARK-33476][CORE] Generalize ExecutorSource to expose user-given file system schemes ### What changes were proposed in this pull request? This PR aims to generalize executor metrics to support user-given file system schemes instead of the fixed `file,hdfs` scheme. ### Why are the changes needed? For the users using only cloud storages like `S3A`, we need to be able to expose `S3A` metrics. Also, we can skip unused `hdfs` metrics. ### Does this PR introduce _any_ user-facing change? Yes, but compatible for the existing users which uses `hdfs` and `file` filesystem scheme only. ### How was this patch tested? Manually do the following. ``` $ build/sbt -Phadoop-cloud package $ sbin/start-master.sh; sbin/start-slave.sh spark://$(hostname):7077 $ bin/spark-shell --master spark://$(hostname):7077 -c spark.executor.metrics.fileSystemSchemes=file,s3a -c spark.metrics.conf.executor.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink scala> spark.read.textFile("s3a://dongjoon/README.md").collect() ``` Separately, launch `jconsole` and check `.executor.filesystem.s3a.`. Also, confirm that there is no `.executor.filesystem.hdfs.` ``` $ jconsole ``` ![Screen Shot 2020-11-17 at 9 26 03 PM](https://user-images.githubusercontent.com/9700541/99487609-94121180-291b-11eb-9ed2-964546146981.png) Closes #30405 from dongjoon-hyun/SPARK-33476. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-18 08:04:14 -08:00
zhengruifeng	689c294102	[SPARK-32907][ML][PYTHON] Adaptively blockify instances - AFT,LiR,LoR ### What changes were proposed in this pull request? use `maxBlockSizeInMB` instead of `blockSize` (#rows) to control the stacking of vectors; ### Why are the changes needed? the performance gain is mainly related to the nnz of block. ### Does this PR introduce _any_ user-facing change? yes, param blockSize -> blockSizeInMB in master ### How was this patch tested? updated testsuites Closes #30355 from zhengruifeng/adaptively_blockify_aft_lir_lor. Lead-authored-by: zhengruifeng <ruifengz@foxmail.com> Co-authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>	2020-11-18 23:02:31 +08:00
Gengliang Wang	a180e02842	[SPARK-32852][SQL][DOC][FOLLOWUP] Revise the documentation of spark.sql.hive.metastore.jars ### What changes were proposed in this pull request? This is a follow-up for https://github.com/apache/spark/pull/29881. It revises the documentation of the configuration `spark.sql.hive.metastore.jars`. ### Why are the changes needed? Fix grammatical error in the doc. Also, make it more clear that the configuration is effective only when `spark.sql.hive.metastore.jars` is set as `path` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Just doc changes. Closes #30407 from gengliangwang/reviseJarPathDoc. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-11-18 22:09:40 +08:00
Takeshi Yamamuro	74bd046d17	[SPARK-33475][BUILD] Bump ANTLR runtime version to 4.8-1 ### What changes were proposed in this pull request? This PR intends to upgrade ANTLR runtime from 4.7.1 to 4.8-1. ### Why are the changes needed? Release note of v4.8 and v4.7.2 (the v4.7.2 release has a few minor bug fixes for java targets): - v4.8: https://github.com/antlr/antlr4/releases/tag/4.8 - v4.7.2: https://github.com/antlr/antlr4/releases/tag/4.7.2 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA tests. Closes #30404 from maropu/UpgradeAntlr. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-18 21:20:28 +09:00
Bryan Cutler	8e2a0bdce7	[SPARK-24554][PYTHON][SQL] Add MapType support for PySpark with Arrow ### What changes were proposed in this pull request? This change adds MapType support for PySpark with Arrow, if using pyarrow >= 2.0.0. ### Why are the changes needed? MapType was previous unsupported with Arrow. ### Does this PR introduce _any_ user-facing change? User can now enable MapType for `createDataFrame()`, `toPandas()` with Arrow optimization, and with Pandas UDFs. ### How was this patch tested? Added new PySpark tests for createDataFrame(), toPandas() and Scalar Pandas UDFs. Closes #30393 from BryanCutler/arrow-add-MapType-SPARK-24554. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-18 21:18:19 +09:00
angerszhu	dd32f45d20	[SPARK-31069][CORE] Avoid repeat compute `chunksBeingTransferred` cause hight cpu cost in external shuffle service when `maxChunksBeingTransferred` use default value ### What changes were proposed in this pull request? Followup from #27831 , origin author chrysan. Each request it will check `chunksBeingTransferred ` ``` public long chunksBeingTransferred() { long sum = 0L; for (StreamState streamState: streams.values()) { sum += streamState.chunksBeingTransferred.get(); } return sum; } ``` such as ``` long chunksBeingTransferred = streamManager.chunksBeingTransferred(); if (chunksBeingTransferred >= maxChunksBeingTransferred) { logger.warn("The number of chunks being transferred {} is above {}, close the connection.", chunksBeingTransferred, maxChunksBeingTransferred); channel.close(); return; } ``` It will traverse `streams` repeatedly and we know that fetch data chunk will access `stream` too, there cause two problem: 1. repeated traverse `streams`, the longer the length, the longer the time 2. lock race in ConcurrentHashMap `streams` In this PR, when `maxChunksBeingTransferred` use default value, we avoid compute `chunksBeingTransferred ` since we don't care about this. If user want to set this configuration and meet performance problem, you can also backport PR #27831 ### Why are the changes needed? Speed up getting `chunksBeingTransferred` and avoid lock race in object `streams` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existed UT Closes #30139 from AngersZhuuuu/SPARK-31069. Lead-authored-by: angerszhu <angers.zhu@gmail.com> Co-authored-by: chrysan <chrysanxia@gmail.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>	2020-11-17 20:52:58 -06:00
Liang-Chi Hsieh	7f3d99a8a5	[MINOR][SQL][DOCS] Update schema_of_csv and schema_of_json doc ### What changes were proposed in this pull request? This minor PR updates the docs of `schema_of_csv` and `schema_of_json`. They allow foldable string column instead of a string literal now. ### Why are the changes needed? The function doc of `schema_of_csv` and `schema_of_json` are not updated accordingly with previous PRs. ### Does this PR introduce _any_ user-facing change? Yes, update user-facing doc. ### How was this patch tested? Unit test. Closes #30396 from viirya/minor-json-csv. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-18 11:32:27 +09:00
Rameshkrishnan Muthusamy	5e8549973d	[SPARK-33471][K8S][BUILD] Upgrade kubernetes-client to 4.12.0 ### What changes were proposed in this pull request? This PR aims to upgrade Kubernetes-client from 4.11.1 to 4.12.0 ### Why are the changes needed? This upgrades the dependency for Apache Spark 3.1.0. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #30401 from ramesh-muthusamy/SPARK-33471-k8s-clientupgrade. Authored-by: Rameshkrishnan Muthusamy <rameshkrishnan_muthusamy@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-17 13:41:58 -08:00
Prashant Sharma	2a8e253cdb	[SPARK-32222][K8S][TESTS] Add K8s IT for conf propagation ### What changes were proposed in this pull request? Added integration test - which tries to configure a log4j.properties and checks if, it is the one pickup by the driver. ### Why are the changes needed? Improved test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running integration tests. Closes #30388 from ScrapCodes/SPARK-32222/k8s-it-spark-conf-propagate. Authored-by: Prashant Sharma <prashsh1@in.ibm.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-17 08:47:04 -08:00
Liang-Chi Hsieh	928348408e	[SPARK-33427][SQL] Add subexpression elimination for interpreted expression evaluation ### What changes were proposed in this pull request? This patch proposes to add subexpression elimination for interpreted expression evaluation. Interpreted expression evaluation is used when codegen was not able to work, for example complex schema. ### Why are the changes needed? Currently we only do subexpression elimination for codegen. For some reasons, we may need to run interpreted expression evaluation. For example, codegen fails to compile and fallbacks to interpreted mode, or complex input/output schema of expressions. It is commonly seen for complex schema from expressions that is possibly caused by the query optimizer too, e.g. SPARK-32945. We should also support subexpression elimination for interpreted evaluation. That could reduce performance difference when Spark fallbacks from codegen to interpreted expression evaluation, and improve Spark usability. #### Benchmark Update `SubExprEliminationBenchmark`: Before: ``` OpenJDK 64-Bit Server VM 1.8.0_265-b01 on Mac OS X 10.15.6 Intel(R) Core(TM) i7-9750H CPU 2.60GHz from_json as subExpr: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------- subexpressionElimination on, codegen off 24707 25688 903 0.0 247068775.9 1.0X ``` After: ``` OpenJDK 64-Bit Server VM 1.8.0_265-b01 on Mac OS X 10.15.6 Intel(R) Core(TM) i7-9750H CPU 2.60GHz from_json as subExpr: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------- subexpressionElimination on, codegen off 2360 2435 87 0.0 23604320.7 11.2X ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Benchmark manually. Closes #30341 from viirya/SPARK-33427. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-17 14:29:37 +00:00
Yuming Wang	09bb9bedcd	[SPARK-33416][SQL] Avoid Hive metastore stack overflow when InSet predicate have many values ### What changes were proposed in this pull request? We [rewrite](`5197c5d2e7/sql/hive/src/main/scala/org/apache/spark/sql/hive/client/HiveShim.scala (L722-L724)`) `In`/`InSet` predicate to `or` expressions when pruning Hive partitions. That will cause Hive metastore stack over flow if there are a lot of values. This pr rewrite `InSet` predicate to `GreaterThanOrEqual` min value and `LessThanOrEqual ` max value when pruning Hive partitions to avoid Hive metastore stack overflow. From our experience, `spark.sql.hive.metastorePartitionPruningInSetThreshold` should be less than 10000. ### Why are the changes needed? Avoid Hive metastore stack overflow when `InSet` predicate have many values. Especially DPP, it may generate many values. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #30325 from wangyum/SPARK-33416. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-17 13:47:01 +00:00
HyukjinKwon	e2c7bfce40	[SPARK-33407][PYTHON] Simplify the exception message from Python UDFs (disabled by default) ### What changes were proposed in this pull request? This PR proposes to simplify the exception messages from Python UDFS. Currently, the exception message from Python UDFs is as below: ```python from pyspark.sql.functions import udf; spark.range(10).select(udf(lambda x: x/0)("id")).collect() ``` ```python Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../python/pyspark/sql/dataframe.py", line 427, in show print(self._jdf.showString(n, 20, vertical)) File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../python/pyspark/sql/utils.py", line 127, in deco raise_from(converted) File "<string>", line 3, in raise_from pyspark.sql.utils.PythonException: An exception was thrown from Python worker in the executor: Traceback (most recent call last): File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 605, in main process() File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 597, in process serializer.dump_stream(out_iter, outfile) File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream for obj in iterator: File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched for item in iterator: File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr> result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda> return lambda a: f(a) File "/.../python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper return f(args, *kwargs) File "<stdin>", line 1, in <lambda> ZeroDivisionError: division by zero ``` Actually, almost all cases, users only care about `ZeroDivisionError: division by zero`. We don't really have to show the internal stuff in 99% cases. This PR adds a configuration `spark.sql.execution.pyspark.udf.simplifiedException.enabled` (disabled by default) that hides the internal tracebacks related to Python worker, (de)serialization, etc. ```python Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../python/pyspark/sql/dataframe.py", line 427, in show print(self._jdf.showString(n, 20, vertical)) File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../python/pyspark/sql/utils.py", line 127, in deco raise_from(converted) File "<string>", line 3, in raise_from pyspark.sql.utils.PythonException: An exception was thrown from Python worker in the executor: Traceback (most recent call last): File "<stdin>", line 1, in <lambda> ZeroDivisionError: division by zero ``` The trackback will be shown from the point when any non-PySpark file is seen in the traceback. ### Why are the changes needed? Without this configuration. such internal tracebacks are exposed to users directly especially for shall or notebook users in PySpark. 99% cases people don't care about the internal Python worker, (de)serialization and related tracebacks. It just makes the exception more difficult to read. For example, one statement of `x/0` above shows a very long traceback and most of them are unnecessary. This configuration enables the ability to show simplified tracebacks which users will likely be most interested in. ### Does this PR introduce _any_ user-facing change? By default, no. It adds one configuration that simplifies the exception message. See the example above. ### How was this patch tested? Manually tested: ```bash $ pyspark --conf spark.sql.execution.pyspark.udf.simplifiedException.enabled=true ``` ```python from pyspark.sql.functions import udf; spark.sparkContext.setLogLevel("FATAL"); spark.range(10).select(udf(lambda x: x/0)("id")).collect() ``` and unittests were also added. Closes #30309 from HyukjinKwon/SPARK-33407. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-17 14:15:31 +09:00
Cheng Su	5af5aa146e	[SPARK-33209][SS] Refactor unit test of stream-stream join in UnsupportedOperationsSuite ### What changes were proposed in this pull request? This PR is a followup from https://github.com/apache/spark/pull/30076 to refactor unit test of stream-stream join in `UnsupportedOperationsSuite`, where we had a lot of duplicated code for stream-stream join unit test, for each join type. ### Why are the changes needed? Help reduce duplicated code and make it easier for developers to read and add code in the future. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit test in `UnsupportedOperationsSuite.scala` (pure refactoring). Closes #30347 from c21/stream-test. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-11-17 11:18:42 +09:00
Prakhar Jain	f5e3302840	[SPARK-33399][SQL] Normalize output partitioning and sortorder with respect to aliases to avoid unneeded exchange/sort nodes ### What changes were proposed in this pull request? This pull request tries to remove unneeded exchanges/sorts by normalizing the output partitioning and sortorder information correctly with respect to aliases. Example: consider this join of three tables: \|SELECT t2id, t3.id as t3id \|FROM ( \| SELECT t1.id as t1id, t2.id as t2id \| FROM t1, t2 \| WHERE t1.id = t2.id \|) t12, t3 \|WHERE t1id = t3.id The plan for this looks like: (9) Project [t2id#1034L, id#1004L AS t3id#1035L] +- (9) SortMergeJoin [t1id#1033L], [id#1004L], Inner :- (6) Sort [t1id#1033L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(t1id#1033L, 5), true, [id=#1343] <------------------------------ : +- (5) Project [id#996L AS t1id#1033L, id#1000L AS t2id#1034L] : +- (5) SortMergeJoin [id#996L], [id#1000L], Inner : :- (2) Sort [id#996L ASC NULLS FIRST], false, 0 : : +- Exchange hashpartitioning(id#996L, 5), true, [id=#1329] : : +- (1) Range (0, 10, step=1, splits=2) : +- (4) Sort [id#1000L ASC NULLS FIRST], false, 0 : +- Exchange hashpartitioning(id#1000L, 5), true, [id=#1335] : +- (3) Range (0, 20, step=1, splits=2) +- (8) Sort [id#1004L ASC NULLS FIRST], false, 0 +- Exchange hashpartitioning(id#1004L, 5), true, [id=#1349] +- *(7) Range (0, 30, step=1, splits=2) In this plan, the marked exchange could have been avoided as the data is already partitioned on "t1.id". This happens because AliasAwareOutputPartitioning class handles aliases only related to HashPartitioning. This change normalizes all output partitioning based on aliasing happening in Project. ### Why are the changes needed? To remove unneeded exchanges. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? New UT added. On TPCDS 1000 scale, this change improves the performance of query 95 from 330 seconds to 170 seconds by removing the extra Exchange. Closes #30300 from prakharjain09/SPARK-33399-outputpartitioning. Authored-by: Prakhar Jain <prakharjain09@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-11-17 10:35:43 +09:00
Pascal Gillet	9ab0f82a59	[SPARK-23499][MESOS] Support for priority queues in Mesos scheduler ### What changes were proposed in this pull request? I push this PR as I could not re-open the stale one https://github.com/apache/spark/pull/20665 . As for Yarn or Kubernetes, Mesos users should be able to specify priority queues to define a workload management policy for queued drivers in the Mesos Cluster Dispatcher. This would ensure scheduling order while enqueuing Spark applications for a Mesos cluster. ### Why are the changes needed? Currently, submitted drivers are kept in order of their submission: the first driver added to the queue will be the first one to be executed (FIFO), regardless of their priority. See https://issues.apache.org/jira/projects/SPARK/issues/SPARK-23499 for more details. ### Does this PR introduce _any_ user-facing change? The MesosClusterDispatcher UI shows now Spark jobs along with the queue to which they are submitted. ### How was this patch tested? Unit tests. Also, this feature has been in production for 3 years now as we use a modified Spark 2.4.0 since then. Closes #30352 from pgillet/mesos-scheduler-priority-queue. Lead-authored-by: Pascal Gillet <pascal.gillet@stack-labs.com> Co-authored-by: pgillet <pascalgillet@ymail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-16 16:54:08 -08:00
xuewei.linxuewei	b5eca18af0	[SPARK-33460][SQL] Accessing map values should fail if key is not found ### What changes were proposed in this pull request? Instead of returning NULL, throws runtime NoSuchElementException towards invalid key accessing in map-like functions, such as element_at, GetMapValue, when ANSI mode is on. ### Why are the changes needed? For ANSI mode. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UT and Existing UT. Closes #30386 from leanken/leanken-SPARK-33460. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-16 16:14:31 +00:00
Max Gekk	6883f29465	[SPARK-33453][SQL][TESTS] Unify v1 and v2 SHOW PARTITIONS tests ### What changes were proposed in this pull request? 1. Move `SHOW PARTITIONS` parsing tests to `ShowPartitionsParserSuite` 2. Place Hive tests for `SHOW PARTITIONS` from `HiveCommandSuite` to the base test suite `v1.ShowPartitionsSuiteBase`. This will allow to run the tests w/ and w/o Hive. The changes follow the approach of https://github.com/apache/spark/pull/30287. ### Why are the changes needed? - The unification will allow to run common `SHOW PARTITIONS` tests for both DSv1 and Hive DSv1, DSv2 - We can detect missing features and differences between DSv1 and DSv2 implementations. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running: - new test suites `build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowPartitionsSuite"` - and old one `build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly org.apache.spark.sql.hive.execution.HiveCommandSuite"` Closes #30377 from MaxGekk/unify-dsv1_v2-show-partitions-tests. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-16 16:11:42 +00:00
luluorta	dfa6fb46f4	[SPARK-33389][SQL] Make internal classes of SparkSession always using active SQLConf ### What changes were proposed in this pull request? This PR makes internal classes of SparkSession always using active SQLConf. We should remove all `conf: SQLConf`s from ctor-parameters of this classes (`Analyzer`, `SparkPlanner`, `SessionCatalog`, `CatalogManager` `SparkSqlParser` and etc.) and use `SQLConf.get` instead. ### Why are the changes needed? Code refine. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing test Closes #30299 from luluorta/SPARK-33389. Authored-by: luluorta <luluorta@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-16 15:27:18 +00:00
xuewei.linxuewei	aa508fcc03	[SPARK-33140][SQL][FOLLOW-UP] Revert code that not use passed-in SparkSession to get SQLConf ### What changes were proposed in this pull request? Revert code that does not use passed-in SparkSession to get SQLConf in [SPARK-33140]. The change scope of [SPARK-33140] change passed-in SQLConf instance and place using SparkSession to get SQLConf to be unified to use SQLConf.get. And the code reverted in the patch, the passed-in SparkSession was not about to get SQLConf, but using its catalog, it's better to be consistent. ### Why are the changes needed? Potential regression bug. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UT. Closes #30364 from leanken/leanken-SPARK-33140. Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-16 11:57:50 +00:00
Max Gekk	71a29b2eca	[MINOR][SQL][DOCS] Fix a reference to `spark.sql.sources.useV1SourceList` ### What changes were proposed in this pull request? Replace `spark.sql.sources.write.useV1SourceList` by `spark.sql.sources.useV1SourceList` in the comment for `CatalogManager.v2SessionCatalog()`. ### Why are the changes needed? To have correct comments. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running `./dev/scalastyle`. Closes #30385 from MaxGekk/fix-comment-useV1SourceList. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-16 17:57:20 +09:00
Liang-Chi Hsieh	10b011f837	[SPARK-33456][SQL][TEST][FOLLOWUP] Fix SUBEXPRESSION_ELIMINATION_ENABLED config name ### What changes were proposed in this pull request? To fix wrong config name in `subexp-elimination.sql`. ### Why are the changes needed? `CONFIG_DIM` should use config name's key. ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? Unit test. Closes #30384 from viirya/SPARK-33456-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-16 17:53:31 +09:00
Prashant Sharma	8615f354a4	[SPARK-30985][K8S] Support propagating SPARK_CONF_DIR files to driver and executor pods ### What changes were proposed in this pull request? This is an improvement, we mount all the user specific configuration files(except the templates and spark properties files) from `SPARK_CONF_DIR` at the point of spark-submit, to both executor and driver pods. Currently, only `spark.properties` is mounted, only on driver. ### Why are the changes needed? `SPARK_CONF_DIR` hosts several configuration files, for example, 1) `spark-defaults.conf` - containing all the spark properties. 2) `log4j.properties` - Logger configuration. 3) `core-site.xml` - Hadoop related configuration. 4) `fairscheduler.xml` - Spark's fair scheduling policy at the job level. 5) `metrics.properties` - Spark metrics. 6) Any user specific - library or framework specific configuration file. At the moment, we can cannot propagate these files to the driver and executor configuration directory. There is a design doc, with more details, and this patch is currently providing a reference implementation. Please take a look at the doc and comment, how we can improve. [google docs link to the doc](https://bit.ly/spark-30985) ### Further scope Support user defined configMaps. ### Does this PR introduce any user-facing change? Yes, previously the user configuration files(e.g. hdfs-site.xml, log4j.properties etc...) were not propagated by default, now after this patch it is propagated to driver and executor pods' `SPARK_CONF_DIR`. ### How was this patch tested? Added tests. Also manually tested, by deploying it to a minikube cluster and observing the additional configuration files were present, and taking effect. For example, changes to log4j.properties was properly applied to executors. Closes #27735 from ScrapCodes/SPARK-30985/spark-conf-k8s-propagate. Authored-by: Prashant Sharma <prashsh1@in.ibm.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-16 00:02:18 -08:00
Yuming Wang	cdcbdaeb0d	[SPARK-33458][SQL] Hive partition pruning support Contains, StartsWith and EndsWith predicate ### What changes were proposed in this pull request? This pr add support Hive partition pruning on `Contains`, `StartsWith` and `EndsWith` predicate. ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30383 from wangyum/SPARK-33458. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-16 07:18:13 +00:00
Max Gekk	4e5d2e0695	[SPARK-33394][SQL][TESTS] Throw `NoSuchNamespaceException` for not existing namespace in `InMemoryTableCatalog.listTables()` ### What changes were proposed in this pull request? Throw `NoSuchNamespaceException` in `listTables()` of the custom test catalog `InMemoryTableCatalog` if the passed namespace doesn't exist. ### Why are the changes needed? 1. To align behavior of V2 `InMemoryTableCatalog` to V1 session catalog. 2. To distinguish two situations: 1. A namespace does exist but does not contain any tables. In that case, `listTables()` returns empty result. 2. A namespace does not exist. `listTables()` throws `NoSuchNamespaceException` in this case. ### Does this PR introduce _any_ user-facing change? Yes. For example, `SHOW TABLES` returns empty result before the changes. ### How was this patch tested? By running V1/V2 ShowTablesSuites. Closes #30358 from MaxGekk/show-tables-in-not-existing-namespace. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-16 07:08:21 +00:00
Liang-Chi Hsieh	d4cf1483fd	[SPARK-33456][SQL][TEST] Add end-to-end test for subexpression elimination ### What changes were proposed in this pull request? This patch proposes to add end-to-end test for subexpression elimination. ### Why are the changes needed? We have subexpression elimination feature for expression evaluation but we don't have end-to-end tests for the feature. We should have one to make sure we don't break it. ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? Unit tests. Closes #30381 from viirya/SPARK-33456. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-16 15:47:35 +09:00
Dongjoon Hyun	10105b555d	[SPARK-33454][INFRA] Add GitHub Action job for Hadoop 2 ### What changes were proposed in this pull request? This PR aims to protect `Hadoop 2.x` profile compilation in Apache Spark 3.1+. ### Why are the changes needed? Since Apache Spark 3.1+ switch our default profile to Hadoop 3, we had better prevent at least compilation error with `Hadoop 2.x` profile at the PR review phase. Although this is an additional workload, it will finish quickly because it's compilation only. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the GitHub Action. - This should be merged after https://github.com/apache/spark/pull/30375 . Closes #30378 from dongjoon-hyun/SPARK-33454. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-16 15:06:51 +09:00
Yuming Wang	f660946ef2	[SPARK-33288][YARN][FOLLOW-UP][TEST-HADOOP2.7] Fix type mismatch error ### What changes were proposed in this pull request? This pr fix type mismatch error: ``` [error] /home/jenkins/workspace/spark-master-test-sbt-hadoop-2.7-hive-2.3/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocator.scala:320:52: type mismatch; [error] found : Long [error] required: Int [error] Resource.newInstance(resourcesWithDefaults.totalMemMiB, resourcesWithDefaults.cores) [error] ^ [error] one error found ``` ### Why are the changes needed? Fix compile issue. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing test. Closes #30375 from wangyum/SPARK-33288. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2020-11-16 11:28:52 +08:00
itholic	236c6c9f7c	[SPARK-33253][PYTHON][DOCS] Migration to NumPy documentation style in Streaming (pyspark.streaming.*) ### What changes were proposed in this pull request? This PR proposes to migrate to [NumPy documentation style](https://numpydoc.readthedocs.io/en/latest/format.html), see also [SPARK-33243](https://issues.apache.org/jira/browse/SPARK-33243). ### Why are the changes needed? For better documentation as text itself, and generated HTMLs ### Does this PR introduce _any_ user-facing change? Yes, they will see a better format of HTMLs, and better text format. See [SPARK-33243](https://issues.apache.org/jira/browse/SPARK-33243). ### How was this patch tested? Manually tested via running ./dev/lint-python. Closes #30346 from itholic/SPARK-32085. Lead-authored-by: itholic <haejoon309@naver.com> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-16 10:44:57 +09:00
aof00	0933f1c6c2	[SPARK-33451][DOCS] Change to 'spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes' in documentation ### What changes were proposed in this pull request? In the 'Optimizing Skew Join' section of the following two pages: 1. [https://spark.apache.org/docs/3.0.0/sql-performance-tuning.html](https://spark.apache.org/docs/3.0.0/sql-performance-tuning.html) 2. [https://spark.apache.org/docs/3.0.1/sql-performance-tuning.html](https://spark.apache.org/docs/3.0.1/sql-performance-tuning.html) The configuration 'spark.sql.adaptive.skewedPartitionThresholdInBytes' should be changed to 'spark.sql.adaptive.skewJoin.skewedPartitionThresholdInBytes', The former is missing the 'skewJoin'. ### Why are the changes needed? To document the correct name of configuration ### Does this PR introduce _any_ user-facing change? Yes, this is a user-facing doc change. ### How was this patch tested? Jenkins / CI builds in this PR. Closes #30376 from aof00/doc_change. Authored-by: aof00 <x14562573449@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-16 10:32:00 +09:00
zero323	52073ef8ac	[SPARK-33254][PYTHON][DOCS] Migration to NumPy documentation style in Core (pyspark., pyspark.resource., etc.) ### What changes were proposed in this pull request? This PR proposes migration of Core to NumPy documentation style. ### Why are the changes needed? To improve documentation style. ### Does this PR introduce _any_ user-facing change? Yes, this changes both rendered HTML docs and console representation (SPARK-33243). ### How was this patch tested? dev/lint-python and manual inspection. Closes #30320 from zero323/SPARK-33254. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-16 10:21:50 +09:00
artiship	1ae6d64b5f	[SPARK-33358][SQL] Return code when command process failed Exit Spark SQL CLI processing loop if one of the commands (sub sql statement) process failed This is a regression at Apache Spark 3.0.0. ``` $ cat 1.sql select * from nonexistent_table; select 2; ``` Apache Spark 2.4.7 ``` spark-2.4.7-bin-hadoop2.7:$ bin/spark-sql -f 1.sql 20/11/15 16:14:38 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable Error in query: Table or view not found: nonexistent_table; line 1 pos 14 ``` Apache Spark 3.0.1 ``` $ bin/spark-sql -f 1.sql Error in query: Table or view not found: nonexistent_table; line 1 pos 14; 'Project [] +- 'UnresolvedRelation [nonexistent_table] 2 Time taken: 2.786 seconds, Fetched 1 row(s) ``` Apache Hive 1.2.2* ``` apache-hive-1.2.2-bin:$ bin/hive -f 1.sql Logging initialized using configuration in jar:file:/Users/dongjoon/APACHE/hive-release/apache-hive-1.2.2-bin/lib/hive-common-1.2.2.jar!/hive-log4j.properties FAILED: SemanticException [Error 10001]: Line 1:14 Table not found 'nonexistent_table' ``` Yes. This is a fix of regression. Pass the UT. Closes #30263 from artiship/SPARK-33358. Authored-by: artiship <meilziner@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-15 16:57:12 -08:00
Liang-Chi Hsieh	eea846b895	[SPARK-33455][SQL][TEST] Add SubExprEliminationBenchmark for benchmarking subexpression elimination ### What changes were proposed in this pull request? This patch adds a benchmark `SubExprEliminationBenchmark` for benchmarking subexpression elimination feature. ### Why are the changes needed? We need a benchmark for subexpression elimination feature for change such as #30341. ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? Unit test. Closes #30379 from viirya/SPARK-33455. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-14 19:02:36 -08:00
luluorta	156704ba0d	[SPARK-33432][SQL] SQL parser should use active SQLConf ### What changes were proposed in this pull request? This PR makes SQL parser using active SQLConf instead of the one in ctor-parameters. ### Why are the changes needed? In ANSI mode, schema string parsing should fail if the schema uses ANSI reserved keyword as attribute name: ```scala spark.conf.set("spark.sql.ansi.enabled", "true") spark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MM/yyyy'));""").show ``` output: > Cannot parse the data type: > no viable alternative at input 'time'(line 1, pos 0) > > == SQL == > time Timestamp > ^^^ But this query may accidentally succeed in certain cases cause the DataType parser sticks to the configs of the first created session in the current thread: ```scala DataType.fromDDL("time Timestamp") val newSpark = spark.newSession() newSpark.conf.set("spark.sql.ansi.enabled", "true") newSpark.sql("""select from_json('{"time":"26/10/2015"}', 'time Timestamp', map('timestampFormat', 'dd/MM/yyyy'));""").show ``` output: > +--------------------------------+ > \|from_json({"time":"26/10/2015"})\| > +--------------------------------+ > \| {2015-10-26 00:00...\| > +--------------------------------+ ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Newly and updated UTs Closes #30357 from luluorta/SPARK-33432. Authored-by: luluorta <luluorta@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-14 13:37:12 -08:00
artiship	34a9a77ab5	[SPARK-33396][SQL] Spark SQL CLI prints appliction id when process file ### What changes were proposed in this pull request? Modify SparkSQLCLIDriver.scala to move ahead calling the cli.printMasterAndAppId method before process file. ### Why are the changes needed? Even though in SPARK-25043 it has already brought in the printing application id feature. But the process file situation seems have not been included. This small change is to make spark-sql will also print out application id when process file. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? env ``` spark version: 3.0.1 os: centos 7 ``` /tmp/tmp.sql ```sql select 1; ``` submit command: ```sh export HADOOP_USER_NAME=my-hadoop-user bin/spark-sql \ --master yarn \ --deploy-mode client \ --queue my.queue.name \ --conf spark.driver.host=$(hostname -i) \ --conf spark.app.name=spark-test \ --name "spark-test" \ -f /tmp/tmp.sql ``` execution log: ```sh 20/11/09 23:18:39 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 20/11/09 23:18:40 WARN HiveConf: HiveConf of name hive.spark.client.rpc.server.address.use.ip does not exist 20/11/09 23:18:40 WARN HiveConf: HiveConf of name hive.spark.client.submit.timeout.interval does not exist 20/11/09 23:18:40 WARN HiveConf: HiveConf of name hive.enforce.bucketing does not exist 20/11/09 23:18:40 WARN HiveConf: HiveConf of name hive.server2.enable.impersonation does not exist 20/11/09 23:18:40 WARN HiveConf: HiveConf of name hive.run.timeout.seconds does not exist 20/11/09 23:18:40 WARN HiveConf: HiveConf of name hive.support.sql11.reserved.keywords does not exist 20/11/09 23:18:40 WARN DomainSocketFactory: The short-circuit local reads feature cannot be used because libhadoop cannot be loaded. 20/11/09 23:18:41 WARN SparkConf: Note that spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone/kubernetes and LOCAL_DIRS in YARN). 20/11/09 23:18:42 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME. 20/11/09 23:18:52 WARN YarnSchedulerBackend$YarnSchedulerEndpoint: Attempted to request executors before the AM has registered! Spark master: yarn, Application Id: application_1567136266901_27355775 1 1 Time taken: 4.974 seconds, Fetched 1 row(s) ``` Closes #30301 from artiship/SPARK-33396. Authored-by: artiship <meilziner@gmail.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2020-11-14 20:54:17 +08:00
Liang-Chi Hsieh	0046222a75	[SPARK-33337][SQL][FOLLOWUP] Prevent possible flakyness in SubexpressionEliminationSuite ### What changes were proposed in this pull request? This is a simple followup to prevent test flakyness in SubexpressionEliminationSuite. If `getAllEquivalentExprs` returns more than 1 sequences, due to HashMap, we should use `contains` instead of assuming the order of results. ### Why are the changes needed? Prevent test flakyness in SubexpressionEliminationSuite. ### Does this PR introduce _any_ user-facing change? No, dev only. ### How was this patch tested? Unit test. Closes #30371 from viirya/SPARK-33337-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-11-13 15:10:02 -08:00
Chandni Singh	423ba5a160	[SPARK-32916][SHUFFLE][TEST-MAVEN][TEST-HADOOP2.7] Remove the newly added YarnShuffleServiceSuite.java ### What changes were proposed in this pull request? This is a follow-up fix for the failing tests in `YarnShuffleServiceSuite.java`. This java class was introduced in https://github.com/apache/spark/pull/30062. The tests in the class fail when run with hadoop-2.7 profile: ``` [ERROR] testCreateDefaultMergedShuffleFileManagerInstance(org.apache.spark.network.yarn.YarnShuffleServiceSuite) Time elapsed: 0.627 s <<< ERROR! java.lang.NoClassDefFoundError: org/apache/commons/logging/LogFactory at org.apache.spark.network.yarn.YarnShuffleServiceSuite.testCreateDefaultMergedShuffleFileManagerInstance(YarnShuffleServiceSuite.java:37) Caused by: java.lang.ClassNotFoundException: org.apache.commons.logging.LogFactory at org.apache.spark.network.yarn.YarnShuffleServiceSuite.testCreateDefaultMergedShuffleFileManagerInstance(YarnShuffleServiceSuite.java:37) [ERROR] testCreateRemoteBlockPushResolverInstance(org.apache.spark.network.yarn.YarnShuffleServiceSuite) Time elapsed: 0 s <<< ERROR! java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.network.yarn.YarnShuffleService at org.apache.spark.network.yarn.YarnShuffleServiceSuite.testCreateRemoteBlockPushResolverInstance(YarnShuffleServiceSuite.java:47) [ERROR] testInvalidClassNameOfMergeManagerWillUseNoOpInstance(org.apache.spark.network.yarn.YarnShuffleServiceSuite) Time elapsed: 0.001 s <<< ERROR! java.lang.NoClassDefFoundError: Could not initialize class org.apache.spark.network.yarn.YarnShuffleService at org.apache.spark.network.yarn.YarnShuffleServiceSuite.testInvalidClassNameOfMergeManagerWillUseNoOpInstance(YarnShuffleServiceSuite.java:57) ``` A test suit for `YarnShuffleService` did exist here: `resource-managers/yarn/src/test/scala/org/apache/spark/network/yarn/YarnShuffleServiceSuite.scala` I missed this when I created `common/network-yarn/src/test/java/org/apache/spark/network/yarn/YarnShuffleServiceSuite.java`. Moving all the new tests to the earlier test suite fixes the failures with hadoop-2.7 even though why this happened is not clear. ### Why are the changes needed? The newly added tests are failing when run with hadoop profile 2.7 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Ran the unit tests with the default profile as well as hadoop 2.7 profile. `build/mvn test -Dtest=none -DwildcardSuites=org.apache.spark.network.yarn.YarnShuffleServiceSuite -Phadoop-2.7 -Pyarn` ``` Run starting. Expected test count is: 11 YarnShuffleServiceSuite: - executor state kept across NM restart - removed applications should not be in registered executor file - shuffle service should be robust to corrupt registered executor file - get correct recovery path - moving recovery file from NM local dir to recovery path - service throws error if cannot start - recovery db should not be created if NM recovery is not enabled - SPARK-31646: metrics should be registered into Node Manager's metrics system - create default merged shuffle file manager instance - create remote block push resolver instance - invalid class name of merge manager will use noop instance Run completed in 2 seconds, 572 milliseconds. Total number of tests run: 11 Suites: completed 2, aborted 0 Tests: succeeded 11, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #30349 from otterc/SPARK-32916-followup. Authored-by: Chandni Singh <singh.chandni@gmail.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2020-11-13 16:16:23 -06:00
Thomas Graves	acfd846753	[SPARK-33288][SPARK-32661][K8S] Stage level scheduling support for Kubernetes ### What changes were proposed in this pull request? This adds support for Stage level scheduling to kubernetes. Kubernetes can support dynamic allocation via the shuffle tracking option which means we can support stage level scheduling by getting new executors. The main changes here are having the k8s cluster manager pass the resource profile id into the executors and then the ExecutorsPodsAllocator has to request executors based on the individual resource profiles. I tried to keep code changes here to a minimum. I specifically choose to leave the ExecutorPodsSnapshot the way it was and construct the resource profile to pod states on the fly, with a fast path when not using other resource profiles, to keep the impact to a minimum. This results in the main changes required are just wrapping the allocation logic in a for loop over each profile. The other main change is in the basic feature step we have to look at the resources in the ResourceProfile to request pods with the correct resources. Much of the other logic like in the executor life cycle manager doesn't need to be resource profile. This also adds support for [SPARK-32661]Spark executors on K8S should request extra memory for off-heap allocations because the stage level scheduling api has support for this and it made sense to make consistent with YARN. This was started with PR https://github.com/apache/spark/pull/29477 but never updated so I just did it here. To do this I moved a few functions around that were now used by both YARN and kubernetes so you will see some changes in Utils. ### Why are the changes needed? Add the feature to Kubernetes based on customer feedback. ### Does this PR introduce _any_ user-facing change? Yes the feature now works with K8s, but not underlying API changes. ### How was this patch tested? Tested manually on kubernetes cluster and with unit tests. Closes #30204 from tgravescs/stagek8sOrigSnapshotsRebase. Lead-authored-by: Thomas Graves <tgraves@apache.org> Co-authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2020-11-13 16:04:13 -06:00

1 2 3 4 5 ...

28541 commits