ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
zero323	4e6a310f80	[SPARK-32084][PYTHON][SQL] Expand dictionary functions ### What changes were proposed in this pull request? - [x] Expand dictionary definitions into standalone functions. - [x] Fix annotations for ordering functions. ### Why are the changes needed? To simplify further maintenance of docstrings. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. Closes #30143 from zero323/SPARK-32084. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-27 11:05:53 +09:00
HyukjinKwon	7cdc921bc0	[SPARK-32188][PYTHON][DOCS][FOLLOW-UP] Document Column APIs in API reference ### What changes were proposed in this pull request? This PR proposes to document the APIs in `Column` as well in API reference of PySpark documentation. ### Why are the changes needed? To document common APIs in PySpark. ### Does this PR introduce _any_ user-facing change? Yes, `Column.*` will be shown in API reference page. ### How was this patch tested? Manually tested via `cd python` and `make clean html`. Closes #30150 from HyukjinKwon/SPARK-32188. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-27 09:52:09 +09:00
angerszhu	e43cd8ccef	[SPARK-32388][SQL] TRANSFORM with schema-less mode should keep the same with hive ### What changes were proposed in this pull request? In current Spark script transformation with hive serde mode, in case of schema less, result is different with hive. This pr to keep result same with hive script transform serde. #### Hive Scrip Transform with serde in schemaless ``` hive> create table t (c0 int, c1 int, c2 int); hive> INSERT INTO t VALUES (1, 1, 1); hive> INSERT INTO t VALUES (2, 2, 2); hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t; hive> DESCRIBE v; key string value string hive> SELECT * FROM v; 1 1 1 2 2 2 hive> SELECT key FROM v; 1 2 hive> SELECT value FROM v; 1 1 2 2 ``` #### Spark script transform with hive serde in schema less. ``` hive> create table t (c0 int, c1 int, c2 int); hive> INSERT INTO t VALUES (1, 1, 1); hive> INSERT INTO t VALUES (2, 2, 2); hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t; hive> SELECT * FROM v; 1 1 2 2 ``` No serde mode in hive (ROW FORMATTED DELIMITED) ![image](https://user-images.githubusercontent.com/46485123/90088770-55841e00-dd52-11ea-92dd-7fe52d93f0b3.png) ### Why are the changes needed? Keep same behavior with hive script transform ### Does this PR introduce _any_ user-facing change? Before this pr with hive serde script transform ``` select transform() USING 'cat' from ( select 1, 2, 3, 4 ) tmp key value 1 2 ``` After ``` select transform() USING 'cat' from ( select 1, 2, 3, 4 ) tmp key value 1 2 3 4 ``` ### How was this patch tested? UT Closes #29421 from AngersZhuuuu/SPARK-32388. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-27 09:25:53 +09:00
Dongjoon Hyun	afa6aee4f5	[SPARK-33237][K8S][TESTS] Use default Hadoop-3.2 profile from K8s IT Jenkins job ### What changes were proposed in this pull request? This PR aims to use `hadoop-3.2` profile in K8s IT Jenkins jobs. - [x] Switch the default value of `HADOOP_PROFILE` from `hadoop-2.7` to `hadoop-3.2`. - [x] Remove `-Phadoop2.7` from Jenkins K8s IT job. - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/configure BEFORE ``` ./dev/make-distribution.sh --name ${DATE}-${REVISION} --r --pip --tgz -DzincPort=${ZINC_PORT} \ -Phadoop-2.7 -Pkubernetes -Pkinesis-asl -Phive -Phive-thriftserver ``` AFTER ``` ./dev/make-distribution.sh --name ${DATE}-${REVISION} --r --pip --tgz -DzincPort=${ZINC_PORT} \ -Pkubernetes -Pkinesis-asl -Phive -Phive-thriftserver ``` ### Why are the changes needed? Since Apache Spark 3.1.0, Hadoop 3 is the default. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Check the Jenkins K8s IT log and result. - https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34899/ ``` + /home/jenkins/workspace/SparkPullRequestBuilder-K8s/build/mvn clean package -DskipTests -DzincPort=4021 -Pkubernetes -Pkinesis-asl -Phive -Phive-thriftserver Using `mvn` from path: /home/jenkins/tools/hudson.tasks.Maven_MavenInstallation/Maven_3.6.3/bin/mvn [INFO] Scanning for projects... [INFO] ------------------------------------------------------------------------ [INFO] Reactor Build Order: [INFO] ``` Closes #30153 from dongjoon-hyun/SPARK-33237. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-26 15:29:12 -07:00
Steve Loughran	02fa19f102	[SPARK-33230][SQL] Hadoop committers to get unique job ID in "spark.sql.sources.writeJobUUID" ### What changes were proposed in this pull request? This reinstates the old option `spark.sql.sources.write.jobUUID` to set a unique jobId in the jobconf so that hadoop MR committers have a unique ID which is (a) consistent across tasks and workers and (b) not brittle compared to generated-timestamp job IDs. The latter matches that of what JobID requires, but as they are generated per-thread, may not always be unique within a cluster. ### Why are the changes needed? If a committer (e.g s3a staging committer) uses job-attempt-ID as a unique ID then any two jobs started within the same second have the same ID, so can clash. ### Does this PR introduce _any_ user-facing change? Good Q. It is "developer-facing" in the context of anyone writing a committer. But it reinstates a property which was in Spark 1.x and "went away" ### How was this patch tested? Testing: no test here. You'd have to create a new committer which extracted the value in both job and task(s) and verified consistency. That is possible (with a task output whose records contained the UUID), but it would be pretty convoluted and a high maintenance cost. Because it's trying to address a race condition, it's hard to regenerate the problem downstream and so verify a fix in a test run...I'll just look at the logs to see what temporary dir is being used in the cluster FS and verify it's a UUID Closes #30141 from steveloughran/SPARK-33230-jobId. Authored-by: Steve Loughran <stevel@cloudera.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-26 12:31:05 -07:00
neko	11bbb130df	[SPARK-33204][UI] The 'Event Timeline' area cannot be opened when a spark application has some failed jobs ### What changes were proposed in this pull request? The page returned by /jobs in Spark UI will store the detail information of each job in javascript like this: ```javascript { 'className': 'executor added', 'group': 'executors', 'start': new Date(1602834008978), 'content': '<div class="executor-event-content"' + 'data-toggle="tooltip" data-placement="top"' + 'data-title="Executor 3<br>' + 'Added at 2020/10/16 15:40:08"' + 'data-html="true">Executor 3 added</div>' } ``` if an application has a failed job, the failure reason corresponding to the job will be stored in the ` content` field in the javascript . if the failure reason contains the character: ', the javascript code will throw an exception to cause the `event timeline url` had no response ， The following is an example of error json: ```javascript { 'className': 'executor removed', 'group': 'executors', 'start': new Date(1602925908654), 'content': '<div class="executor-event-content"' + 'data-toggle="tooltip" data-placement="top"' + 'data-title="Executor 2<br>' + 'Removed at 2020/10/17 17:11:48' + '<br>Reason: Container from a bad node: ... 20/10/17 16:00:42 WARN ShutdownHookManager: ShutdownHook '$anon$2' timeout..."' + 'data-html="true">Executor 2 removed</div>' } ``` So we need to considier this special case , if the returned job info contains the character:', just remove it ### Why are the changes needed? Ensure that the UI page can function normally ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This pr only fixes an exception in a special case, manual test result as blows: ![fixed](https://user-images.githubusercontent.com/52202080/96711638-74490580-13d0-11eb-93e0-b44d9ed5da5c.gif) Closes #30119 from akiyamaneko/timeline_view_cannot_open. Authored-by: neko <echohlne@gmail.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-10-26 20:41:56 +08:00
Cheng Su	1042d49bf9	[SPARK-33075][SQL] Enable auto bucketed scan by default (disable only for cached query) ### What changes were proposed in this pull request? This PR is to enable auto bucketed table scan by default, with exception to only disable for cached query (similar to AQE). The reason why disabling auto scan for cached query is that, the cached query output partitioning can be leveraged later to avoid shuffle and sort when doing join and aggregate. ### Why are the changes needed? Enable auto bucketed table scan by default is useful as it can optimize query automatically under the hood, without users interaction. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit test for cached query in `DisableUnnecessaryBucketedScanSuite.scala`. Also change a bunch of unit tests which should disable auto bucketed scan to make them work. Closes #30138 from c21/enable-auto-bucket. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-26 20:23:24 +09:00
Dongjoon Hyun	850adeb0fd	[SPARK-33239][INFRA] Use pre-built image at GitHub Action SparkR job ### What changes were proposed in this pull request? This PR aims to use a pre-built image for Github Action SparkR job. ### Why are the changes needed? This will reduce the execution time and the flakiness. BEFORE (21 minutes 39 seconds) ![Screen Shot 2020-10-16 at 1 24 43 PM](https://user-images.githubusercontent.com/9700541/96305593-fbeada80-0fb2-11eb-9b8e-86d8abaad9ef.png) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the GitHub Action `sparkr` job in this PR. Closes #30066 from dongjoon-hyun/SPARKR. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-26 01:50:23 -07:00
Yuning Zhang	a21945ce6c	[SPARK-33197][SQL] Make changes to spark.sql.analyzer.maxIterations take effect at runtime ### What changes were proposed in this pull request? Make changes to `spark.sql.analyzer.maxIterations` take effect at runtime. ### Why are the changes needed? `spark.sql.analyzer.maxIterations` is not a static conf. However, before this patch, changing `spark.sql.analyzer.maxIterations` at runtime does not take effect. ### Does this PR introduce _any_ user-facing change? Yes. Before this patch, changing `spark.sql.analyzer.maxIterations` at runtime does not take effect. ### How was this patch tested? modified unit test Closes #30108 from yuningzh-db/dynamic-analyzer-max-iterations. Authored-by: Yuning Zhang <yuning.zhang@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-26 16:19:06 +09:00
Cheng Su	d87a0bb2ca	[SPARK-32862][SS] Left semi stream-stream join ### What changes were proposed in this pull request? This is to support left semi join in stream-stream join. The implementation of left semi join is (mostly in `StreamingSymmetricHashJoinExec` and `SymmetricHashJoinStateManager`): * For left side input row, check if there's a match on right side state store. * if there's a match, output the left side row, but do not put the row in left side state store (no need to put in state store). * if there's no match, output nothing, but put the row in left side state store (with "matched" field to set to false in state store). * For right side input row, check if there's a match on left side state store. * For all matched left rows in state store, output the rows with "matched" field as false. Set all left rows with "matched" field to be true. Only output the left side rows matched for the first time to guarantee left semi join semantics. * State store eviction: evict rows from left/right side state store below watermark, same as inner join. Note a followup optimization can be to evict matched left side rows from state store earlier, even when the rows are still above watermark. However this needs more change in `SymmetricHashJoinStateManager`, so will leave this as a followup. ### Why are the changes needed? Current stream-stream join supports inner, left outer and right outer join (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamingSymmetricHashJoinExec.scala#L166 ). We do see internally a lot of users are using left semi stream-stream join (not spark structured streaming), e.g. I want to get the ad impression (join left side) which has click (joint right side), but I don't care how many clicks per ad (left semi semantics). ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit tests in `UnsupportedOperationChecker.scala` and `StreamingJoinSuite.scala`. Closes #30076 from c21/stream-join. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-10-26 13:33:06 +09:00
HyukjinKwon	369cc614f3	Revert "[SPARK-32388][SQL] TRANSFORM with schema-less mode should keep the same with hive" This reverts commit `56ab60fb7a`.	2020-10-26 11:38:48 +09:00
angerszhu	56ab60fb7a	[SPARK-32388][SQL] TRANSFORM with schema-less mode should keep the same with hive ### What changes were proposed in this pull request? In current Spark script transformation with hive serde mode, in case of schema less, result is different with hive. This pr to keep result same with hive script transform serde. #### Hive Scrip Transform with serde in schemaless ``` hive> create table t (c0 int, c1 int, c2 int); hive> INSERT INTO t VALUES (1, 1, 1); hive> INSERT INTO t VALUES (2, 2, 2); hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t; hive> DESCRIBE v; key string value string hive> SELECT * FROM v; 1 1 1 2 2 2 hive> SELECT key FROM v; 1 2 hive> SELECT value FROM v; 1 1 2 2 ``` #### Spark script transform with hive serde in schema less. ``` hive> create table t (c0 int, c1 int, c2 int); hive> INSERT INTO t VALUES (1, 1, 1); hive> INSERT INTO t VALUES (2, 2, 2); hive> CREATE VIEW v AS SELECT TRANSFORM(c0, c1, c2) USING 'cat' FROM t; hive> SELECT * FROM v; 1 1 2 2 ``` No serde mode in hive (ROW FORMATTED DELIMITED) ![image](https://user-images.githubusercontent.com/46485123/90088770-55841e00-dd52-11ea-92dd-7fe52d93f0b3.png) ### Why are the changes needed? Keep same behavior with hive script transform ### Does this PR introduce _any_ user-facing change? Before this pr with hive serde script transform ``` select transform() USING 'cat' from ( select 1, 2, 3, 4 ) tmp key value 1 2 ``` After ``` select transform() USING 'cat' from ( select 1, 2, 3, 4 ) tmp key value 1 2 3 4 ``` ### How was this patch tested? UT Closes #29421 from AngersZhuuuu/SPARK-32388. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-26 11:20:29 +09:00
Emi	ce0ebf5f02	[SPARK-33234][INFRA] Generates SHA-512 using shasum ### What changes were proposed in this pull request? I am generating the SHA-512 using the standard shasum which also has a better output compared to GPG. ### Why are the changes needed? Which makes the hash much easier to verify for users that don't have GPG. Because an user having GPG can check the keys but an user without GPG will have a hard time validating the SHA-512 based on the 'pretty printed' format. Apache Spark is the only project where I've seen this format. Most other Apache projects have a one-line hash file. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This patch assumes the build system has shasum (it should, but I can't test this). Closes #30123 from emilianbold/master. Authored-by: Emi <emilian.bold@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-25 17:06:06 -07:00
Takeshi Yamamuro	87b498462b	[SPARK-33228][SQL] Don't uncache data when replacing a view having the same logical plan ### What changes were proposed in this pull request? SPARK-30494's updated the `CreateViewCommand` code to implicitly drop cache when replacing an existing view. But, this change drops cache even when replacing a view having the same logical plan. A sequence of queries to reproduce this as follows; ``` // Spark v2.4.6+ scala> val df = spark.range(1).selectExpr("id a", "id b") scala> df.cache() scala> df.explain() == Physical Plan == (1) ColumnarToRow +- InMemoryTableScan [a#2L, b#3L] +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 replicas) +- (1) Project [id#0L AS a#2L, id#0L AS b#3L] +- (1) Range (0, 1, step=1, splits=4) scala> df.createOrReplaceTempView("t") scala> sql("select from t").explain() == Physical Plan == (1) ColumnarToRow +- InMemoryTableScan [a#2L, b#3L] +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 replicas) +- (1) Project [id#0L AS a#2L, id#0L AS b#3L] +- (1) Range (0, 1, step=1, splits=4) // If one re-runs the same query `df.createOrReplaceTempView("t")`, the cache's swept away scala> df.createOrReplaceTempView("t") scala> sql("select from t").explain() == Physical Plan == (1) Project [id#0L AS a#2L, id#0L AS b#3L] +- (1) Range (0, 1, step=1, splits=4) // Until v2.4.6 scala> val df = spark.range(1).selectExpr("id a", "id b") scala> df.cache() scala> df.createOrReplaceTempView("t") scala> sql("select * from t").explain() 20/10/23 22:33:42 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException == Physical Plan == (1) InMemoryTableScan [a#2L, b#3L] +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 replicas) +- (1) Project [id#0L AS a#2L, id#0L AS b#3L] +- (1) Range (0, 1, step=1, splits=4) scala> df.createOrReplaceTempView("t") scala> sql("select from t").explain() == Physical Plan == (1) InMemoryTableScan [a#2L, b#3L] +- InMemoryRelation [a#2L, b#3L], StorageLevel(disk, memory, deserialized, 1 replicas) +- (1) Project [id#0L AS a#2L, id#0L AS b#3L] +- *(1) Range (0, 1, step=1, splits=4) ``` ### Why are the changes needed? bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests. Closes #30140 from maropu/FixBugInReplaceView. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-25 16:15:55 -07:00
Jungtaek Lim (HeartSaVioR)	0c66a88d1d	[SPARK-29438][SS][FOLLOWUP] Add regression tests for Streaming Aggregation and flatMapGroupsWithState ### What changes were proposed in this pull request? This patch adds new UTs to prevent SPARK-29438 for streaming aggregation as well as flatMapGroupsWithState, as we agree about the review comment quote here: https://github.com/apache/spark/pull/26162#issuecomment-576929692 > LGTM for this PR. But on a additional note, this is a very subtle and easy-to-make bug with TaskContext.getPartitionId. I wonder if this bug is present in any other stateful operation. I wonder if this bug is present in any other stateful operation. Can you please verify how partitionId is used in the other stateful operations? For now they're not broken, but even better if we have UTs to prevent the case for the future. ### Why are the changes needed? New UTs will prevent streaming aggregation and flatMapGroupsWithState to be broken in future where it is placed on the right side of UNION and the number of partition is changing on the left side of UNION. Please refer SPARK-29438 for more details. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added UTs. Closes #27333 from HeartSaVioR/SPARK-29438-add-regression-test. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2020-10-24 15:36:41 -07:00
Shiqi Sun	f659527727	[SPARK-30821][K8S] Handle executor failure with multiple containers Handle executor failure with multiple containers Added a spark property spark.kubernetes.executor.checkAllContainers, with default being false. When it's true, the executor snapshot will take all containers in the executor into consideration when deciding whether the executor is in "Running" state, if the pod restart policy is "Never". Also, added the new spark property to the doc. ### What changes were proposed in this pull request? Checking of all containers in the executor pod when reporting executor status, if the `spark.kubernetes.executor.checkAllContainers` property is set to true. ### Why are the changes needed? Currently, a pod remains "running" as long as there is at least one running container. This prevents Spark from noticing when a container has failed in an executor pod with multiple containers. With this change, user can configure the behavior to be different. Namely, if any container in the executor pod has failed, either the executor process or one of its sidecars, the pod is considered to be failed, and it will be rescheduled. ### Does this PR introduce _any_ user-facing change? Yes, new spark property added. User is now able to choose whether to turn on this feature using the `spark.kubernetes.executor.checkAllContainers` property. ### How was this patch tested? Unit test was added and all passed. I tried to run integration test by following the instruction [here](https://spark.apache.org/developer-tools.html) (section "Testing K8S") and also [here](https://github.com/apache/spark/blob/master/resource-managers/kubernetes/integration-tests/README.md), but I wasn't able to run it smoothly as it fails to talk with minikube cluster. Maybe it's because my minikube version is too new (I'm using v1.13.1)...? Since I've been trying it for two days and still can't make it work, I decided to submit this PR and hopefully the Jenkins test will pass. Closes #29924 from huskysun/exec-sidecar-failure. Authored-by: Shiqi Sun <s.sun@salesforce.com> Signed-off-by: Holden Karau <hkarau@apple.com>	2020-10-24 09:55:57 -07:00
zero323	d7f15b025b	[SPARK-33003][PYTHON][DOCS] Add type hints guidelines to the documentation ### What changes were proposed in this pull request? Add type hints guidelines to developer docs. ### Why are the changes needed? Since it is a new and still somewhat evolving feature, we should provided clear guidelines for potential contributors. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Closes #30094 from zero323/SPARK-33003. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-24 10:00:04 +09:00
Kent Yao	82d500a05c	[SPARK-33193][SQL][TEST] Hive ThriftServer JDBC Database MetaData API Behavior Auditing ### What changes were proposed in this pull request? Add a test case to audit all JDBC metadata behaviors to check and prevent potential APIs silent changing from both the upstream hive-jdbc module or the Spark thrift server side. Forked from my kyuubi project here https://github.com/yaooqinn/kyuubi/blob/master/externals/kyuubi-spark-sql-engine/src/test/scala/org/apache/kyuubi/engine/spark/operation/SparkOperationSuite.scala ### Why are the changes needed? Make the SparkThriftServer safer to evolve. ### Does this PR introduce _any_ user-facing change? dev only ### How was this patch tested? new tests Closes #30101 from yaooqinn/SPARK-33193. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-23 13:34:33 -07:00
HyukjinKwon	10bd42cd47	[SPARK-33104][BUILD] Exclude 'org.apache.hadoop:hadoop-yarn-server-resourcemanager:jar:tests' ### What changes were proposed in this pull request? This PR proposes to exclude `org.apache.hadoop:hadoop-yarn-server-resourcemanager:jar:tests` from `hadoop-yarn-server-tests` when we use Hadoop 2 profile. For some reasons, after SBT 1.3 upgrade at SPARK-21708, SBT starts to pull the dependencies of 'hadoop-yarn-server-tests' with 'tests' classifier: ``` org/apache/hadoop/hadoop-common/2.7.4/hadoop-common-2.7.4-tests.jar org/apache/hadoop/hadoop-yarn-common/2.7.4/hadoop-yarn-common-2.7.4-tests.jar org/apache/hadoop/hadoop-yarn-server-resourcemanager/2.7.4/hadoop-yarn-server-resourcemanager-2.7.4-tests.jar ``` these were not pulled before the upgrade. This specific `hadoop-yarn-server-resourcemanager-2.7.4-tests.jar` causes the problem (SPARK-33104) 1. When the test case creates the Hadoop configuration here, `cc06266ade/core/src/main/scala/org/apache/spark/deploy/SparkHadoopUtil.scala (L122)` 2. Such jars above have higher precedence in the class path, instead of the specified custom `core-site.xml` in the test: `e93b8f02cd/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala (L1375)` 3. Later, `core-site.xml` in the jar is picked instead in Hadoop's `Configuration`: Before this fix: ``` jar:file:/.../https/maven-central.storage-download.googleapis.com/maven2/org/apache/hadoop/ hadoop-yarn-server-resourcemanager/2.7.4/hadoop-yarn-server-resourcemanager-2.7.4-tests.jar!/core-site.xml ``` After this fix: ``` file:/.../spark/resource-managers/yarn/target/org.apache.spark.deploy.yarn.YarnClusterSuite/ org.apache.spark.deploy.yarn.YarnClusterSuite-localDir-nm-0_0/ usercache/.../filecache/10/__spark_conf__.zip/__hadoop_conf__/core-site.xml ``` 4. the `core-site.xml` in the jar of course does not contain: `2cfd215dc4/resource-managers/yarn/src/test/scala/org/apache/spark/deploy/yarn/YarnClusterSuite.scala (L133-L141)` and the specific test fails. This PR uses some kind of hacky approach. It was excluded from 'hadoop-yarn-server-tests' with 'tests' classifier, and then added back as a proper dependency (when Hadoop 2 profile is used). In this way, SBT does not pull `hadoop-yarn-server-resourcemanager` with `tests` classifier anymore. ### Why are the changes needed? To make the build pass. This is a blocker. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Manually tested and debugged: ```bash build/sbt clean "yarn/testOnly *.YarnClusterSuite -- -z SparkHadoopUtil" -Pyarn -Phadoop-2.7 -Phive -Phive-2.3 ``` Closes #30133 from HyukjinKwon/SPARK-33104. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-23 19:19:02 +09:00
Dongjoon Hyun	5e5b48d9a8	[SPARK-33226][BUILD] Upgrade to SBT 1.4.1 ### What changes were proposed in this pull request? This PR aims to upgrade SBT from 1.4.0 to 1.4.1. ### Why are the changes needed? SBT 1.4.1 is a maintenance release at 1.4.x line. There are many bug fixes already. - https://github.com/sbt/sbt/releases/tag/v1.4.1 (Released on 2020-10-19) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CI and check [the Jenkins log](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/130185/testReport). ``` ======================================================================== Building Spark ======================================================================== [info] Building Spark using SBT with these arguments: -Phadoop-3.2 -Phive-2.3 -Phive -Pspark-ganglia-lgpl -Pkinesis-asl -Pyarn -Phadoop-cloud -Phive-thriftserver -Pkubernetes -Pmesos test:package streaming-kinesis-asl-assembly/assembly Using /usr/java/jdk1.8.0_191 as default JAVA_HOME. Note, this will be overridden by -java-home if it is set. Attempting to fetch sbt Launching sbt from build/sbt-launch-1.4.1.jar [info] [launcher] getting org.scala-sbt sbt 1.4.1 (this may take some time)... downloading https://repo1.maven.org/maven2/org/scala-sbt/sbt/1.4.1/sbt-1.4.1.jar ... ``` Closes #30137 from dongjoon-hyun/SBT_1.4.1. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2020-10-22 22:53:24 -07:00
Kent Yao	e21bb710e5	[SPARK-32991][SQL] Use conf in shared state as the original configuraion for RESET ### What changes were proposed in this pull request? #### case the case here covers the static and dynamic SQL configs behavior in `sharedState` and `sessionState`, and the specially handled config `spark.sql.warehouse.dir` the case can be found here - https://github.com/yaooqinn/sugar/blob/master/src/main/scala/com/netease/mammut/spark/training/sql/WarehouseSCBeforeSS.scala ```scala import java.lang.reflect.Field import org.apache.spark.sql.SparkSession import org.apache.spark.{SparkConf, SparkContext} object WarehouseSCBeforeSS extends App { val wh = "spark.sql.warehouse.dir" val td = "spark.sql.globalTempDatabase" val custom = "spark.sql.custom" val conf = new SparkConf() .setMaster("local") .setAppName("SPARK-32991") .set(wh, "./data1") .set(td, "bob") val sc = new SparkContext(conf) val spark = SparkSession.builder() .config(wh, "./data2") .config(td, "alice") .config(custom, "kyao") .getOrCreate() val confField: Field = spark.sharedState.getClass.getDeclaredField("conf") confField.setAccessible(true) private val shared: SparkConf = confField.get(spark.sharedState).asInstanceOf[SparkConf] println() println(s"=====> SharedState: $wh=${shared.get(wh)}") println(s"=====> SharedState: $td=${shared.get(td)}") println(s"=====> SharedState: $custom=${shared.get(custom, "")}") println(s"=====> SessionState: $wh=${spark.conf.get(wh)}") println(s"=====> SessionState: $td=${spark.conf.get(td)}") println(s"=====> SessionState: $custom=${spark.conf.get(custom, "")}") val spark2 = SparkSession.builder().config(td, "fred").getOrCreate() println(s"=====> SessionState 2: $wh=${spark2.conf.get(wh)}") println(s"=====> SessionState 2: $td=${spark2.conf.get(td)}") println(s"=====> SessionState 2: $custom=${spark2.conf.get(custom, "")}") SparkSession.setActiveSession(spark) spark.sql("RESET") println(s"=====> SessionState RESET: $wh=${spark.conf.get(wh)}") println(s"=====> SessionState RESET: $td=${spark.conf.get(td)}") println(s"=====> SessionState RESET: $custom=${spark.conf.get(custom, "")}") val spark3 = SparkSession.builder().getOrCreate() println(s"=====> SessionState 3: $wh=${spark2.conf.get(wh)}") println(s"=====> SessionState 3: $td=${spark2.conf.get(td)}") println(s"=====> SessionState 3: $custom=${spark2.conf.get(custom, "")}") } ``` #### outputs and analysis ``` // 1. Make the cloned spark conf in shared state respect the warehouse dir from the 1st SparkSession //=====> SharedState: spark.sql.warehouse.dir=./data1 // 2. ⏬ //=====> SharedState: spark.sql.globalTempDatabase=alice //=====> SharedState: spark.sql.custom=kyao //=====> SessionState: spark.sql.warehouse.dir=./data2 //=====> SessionState: spark.sql.globalTempDatabase=alice //=====> SessionState: spark.sql.custom=kyao //=====> SessionState 2: spark.sql.warehouse.dir=./data2 //=====> SessionState 2: spark.sql.globalTempDatabase=alice //=====> SessionState 2: spark.sql.custom=kyao // 2'.🔼 OK until here // 3. Make the below 3 ones respect the cloned spark conf in shared state with issue 1 fixed //=====> SessionState RESET: spark.sql.warehouse.dir=./data1 //=====> SessionState RESET: spark.sql.globalTempDatabase=bob //=====> SessionState RESET: spark.sql.custom= // 4. Then the SparkSessions created after RESET will be corrected. //=====> SessionState 3: spark.sql.warehouse.dir=./data1 //=====> SessionState 3: spark.sql.globalTempDatabase=bob //=====> SessionState 3: spark.sql.custom= ``` In this PR, we gather all valid config to the cloned conf of `sharedState` during being constructed, well, actually only `spark.sql.warehouse.dir` is missing. Then we use this conf as defaults for `RESET` Command. `SparkSession.clearActiveSession/clearDefaultSession` will make the shared state invisible and unsharable. They will be internal only soon (confirmed with Wenchen), so cases with them called will not be a problem. ### Why are the changes needed? bugfix for programming API to call RESET while users creating SparkContext first and config SparkSession later. ### Does this PR introduce _any_ user-facing change? yes, before this change when you use programming API and call RESET, all configs will be reset to SparkContext.conf, now they go to SparkSession.sharedState.conf ### How was this patch tested? new tests Closes #30045 from yaooqinn/SPARK-32991. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-23 05:52:38 +00:00
yi.wu	edeecada66	[SPARK-32850][CORE][K8S] Simplify the RPC message flow of decommission ### What changes were proposed in this pull request? This PR cleans up the RPC message flow among the multiple decommission use cases, it includes changes: * Keep `Worker`'s decommission status be consistent between the case where decommission starts from `Worker` and the case where decommission starts from the `MasterWebUI`: sending `DecommissionWorker` from `Master` to `Worker` in the latter case. * Change from two-way communication to one-way communication when notifying decommission between driver and executor: it's obviously unnecessary for the executor to acknowledge the decommission status to the driver since the decommission request is from the driver. And it's same in reverse. * Only send one message instead of two(`DecommissionSelf`/`DecommissionBlockManager`) when decommission the executor: executor and `BlockManager` are in the same JVM. * Clean up codes around here. ### Why are the changes needed? Before: <img width="1948" alt="WeChat56c00cc34d9785a67a544dca036d49da" src="https://user-images.githubusercontent.com/16397174/92850308-dc461c80-f41e-11ea-8ac0-287825f4e0c4.png"> After: <img width="1968" alt="WeChat05f7afb017e3f0132394c5e54245e49e" src="https://user-images.githubusercontent.com/16397174/93189571-de88dd80-f774-11ea-9300-1943920aa27d.png"> (Note the diagrams only counts those RPC calls that needed to go through the network. Local RPC calls are not counted here.) After this change, We reduced 6 original RPC calls and added one more RPC call for keeping the consistent decommission status for the Worker. And the RPC flow becomes more clear. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated existing tests. Closes #29817 from Ngone51/simplify-decommission-rpc. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-23 13:58:44 +09:00
Liang-Chi Hsieh	87b32f65ef	[MINOR][DOCS][TESTS] Fix PLAN_CHANGE_LOG_LEVEL document ### What changes were proposed in this pull request? `PLAN_CHANGE_LOG_LEVEL` config document is wrong. This is to fix it. ### Why are the changes needed? Fix wrong doc. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Only doc change. Closes #30136 from viirya/minor-sqlconf. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-23 13:35:46 +09:00
Ankit Srivastava	3819d39607	[SPARK-32998][BUILD] Add ability to override default remote repos with internal one ### What changes were proposed in this pull request? - Building spark internally in orgs where access to outside internet is not allowed takes a long time because unsuccessful attempts are made to download artifacts from repositories which are not accessible. The unsuccessful attempts unnecessarily add significant amount of time to the build. I have seen a difference of up-to 1hr for some runs. - Adding 1 environment variables that should be present that the start of the build and if they exist, override the default repos defined in the code and scripts. envVariables: - DEFAULT_ARTIFACT_REPOSITORY=https://artifacts.internal.com/libs-release/ ### Why are the changes needed? To allow orgs to build spark internally without relying on external repositories for artifact downloads. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Multiple builds with and without env variables set. Closes #29874 from ankits/SPARK-32998. Authored-by: Ankit Srivastava <ankit_srivastava@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-22 16:35:55 -07:00
Max Gekk	a03d77d326	[SPARK-33160][SQL][FOLLOWUP] Replace the parquet metadata key `org.apache.spark.int96NoRebase` by `org.apache.spark.legacyINT96` ### What changes were proposed in this pull request? 1. Replace the metadata key `org.apache.spark.int96NoRebase` by `org.apache.spark.legacyINT96`. 2. Change the condition when new key should be saved to parquet metadata: it should be saved when the SQL config `spark.sql.legacy.parquet.int96RebaseModeInWrite` is set to `LEGACY`. 3. Change handling the metadata key in read: - If there is no the key in parquet metadata, take the rebase mode from the SQL config: `spark.sql.legacy.parquet.int96RebaseModeInRead` - If parquet files were saved by Spark < 3.1.0, use the `LEGACY` rebasing mode for INT96 type. - For files written by Spark >= 3.1.0, if the `org.apache.spark.legacyINT96` presents in metadata, perform rebasing otherwise don't. ### Why are the changes needed? - To not increase parquet size by default when `spark.sql.legacy.parquet.int96RebaseModeInWrite` is `EXCEPTION` after https://github.com/apache/spark/pull/30121. - To have the implementation similar to `org.apache.spark.legacyDateTime` - To minimise impact on other subsystems that are based on file sizes like gathering statistics. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Modified test in `ParquetIOSuite` Closes #30132 from MaxGekk/int96-flip-metadata-rebase-key. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-22 15:57:03 +00:00
yangjie01	b38f3a5557	[SPARK-32978][SQL] Make sure the number of dynamic part metric is correct ### What changes were proposed in this pull request? The purpose of this pr is to resolve SPARK-32978. The main reason of bad case describe in SPARK-32978 is the `BasicWriteTaskStatsTracker` directly reports the new added partition number of each task, which makes it impossible to remove duplicate data in driver side. The main of this pr is change to report partitionValues to driver and remove duplicate data at driver side to make sure the number of dynamic part metric is correct. ### Why are the changes needed? The the number of dynamic part metric we display on the UI should be correct. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add a new test case refer to described in SPARK-32978 Closes #30026 from LuciferYang/SPARK-32978. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-22 14:01:07 +00:00
angerszhu	a1629b4a57	[SPARK-32852][SQL] spark.sql.hive.metastore.jars support HDFS location ### What changes were proposed in this pull request? Support `spark.sql.hive.metastore.jars` use HDFS location. When user need to use path to set hive metastore jars, you should set `spark.sql.hive.metasstore.jars=path` and set real path in `spark.sql.hive.metastore.jars.path` since we use `File.pathSeperator` to split path, but `FIle.pathSeparator` is `:` in unix, it will split hdfs location `hdfs://nameservice/xx`. So add new config `spark.sql.hive.metastore.jars.path` to set comma separated paths. To keep both two way supported ### Why are the changes needed? All spark app can fetch internal version hive jars in HDFS location, not need distribute to all node. ### Does this PR introduce _any_ user-facing change? User can use HDFS location to store hive metastore jars ### How was this patch tested? Manuel tested. Closes #29881 from AngersZhuuuu/SPARK-32852. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-22 13:53:01 +00:00
Prashant Sharma	8cae7f88b0	[SPARK-33095][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MySQL dialect) ### What changes were proposed in this pull request? Override the default SQL strings for: ALTER TABLE UPDATE COLUMN TYPE ALTER TABLE UPDATE COLUMN NULLABILITY in the following MySQL JDBC dialect according to official documentation. Write MySQL integration tests for JDBC. ### Why are the changes needed? Improved code coverage and support mysql dialect for jdbc. ### Does this PR introduce _any_ user-facing change? Yes, Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MySQL dialect) ### How was this patch tested? Added tests. Closes #30025 from ScrapCodes/mysql-dialect. Authored-by: Prashant Sharma <prashsh1@in.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-22 13:51:42 +00:00
Xuedong Luan	d9ee33cfb9	[SPARK-26533][SQL] Support query auto timeout cancel on thriftserver ### What changes were proposed in this pull request? Support query auto cancelling when running too long on thriftserver. This is the rework of #28991 and the credit should be the original author, leoluan2009. Closes #28991 ### Why are the changes needed? For some cases, we use thriftserver as long-running applications. Some times we want all the query need not to run more than given time. In these cases, we can enable auto cancel for time-consumed query.Which can let us release resources for other queries to run. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests. Closes #29933 from maropu/pr28991. Lead-authored-by: Xuedong Luan <luanxuedong2009@gmail.com> Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Co-authored-by: Luan <luanxuedong2009@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-22 17:23:10 +09:00
Dongjoon Hyun	a908b67502	[SPARK-33218][CORE] Update misleading log messages for removed shuffle blocks ### What changes were proposed in this pull request? This updates the misleading log messages for removed shuffle block during migration. ### Why are the changes needed? 1. For the deleted shuffle blocks, `IndexShuffleBlockResolver` shows users WARN message saying `skipping migration`. However, `BlockManagerDecommissioner` shows users INFO message including `Migrated ShuffleBlockInfo(...)` inconsistently. Technically, we didn't migrated. We should not show `Migrated` message in this case. ``` INFO BlockManagerDecommissioner: Trying to migrate shuffle ShuffleBlockInfo(109,18924) to BlockManagerId(...) (2 / 3) WARN IndexShuffleBlockResolver: Failed to resolve shuffle block ShuffleBlockInfo(109,18924), skipping migration. This is expected to occur if a block is removed after decommissioning has started. INFO BlockManagerDecommissioner: Got migration sub-blocks List() ... INFO BlockManagerDecommissioner: Migrated ShuffleBlockInfo(109,18924) to BlockManagerId(...) ``` 2. In addition, if the shuffle file is deleted while the information is in the queue, the above messages are repeated multiple times, `spark.storage.decommission.maxReplicationFailuresPerBlock`. We had better use one line instead of the group of messages for that case. ``` INFO BlockManagerDecommissioner: Trying to migrate shuffle ShuffleBlockInfo(109,18924) to BlockManagerId(...) (0 / 3) ... INFO BlockManagerDecommissioner: Trying to migrate shuffle ShuffleBlockInfo(109,18924) to BlockManagerId(...) (1 / 3) ... INFO BlockManagerDecommissioner: Trying to migrate shuffle ShuffleBlockInfo(109,18924) to BlockManagerId(...) (2 / 3) ``` 3. Skipping or not is a role of `BlockManagerDecommissioner` class. `IndexShuffleBlockResolver.getMigrationBlocks` is used twice differently like the following. We had better inform users at `BlockManagerDecommissioner` once. - At the beginning, to get the sub-blocks. - In case of `IOException`, to determine whether ignoring it or re-throwing. And, `BlockManagerDecommissioner` shows WARN message (`Skipping block ...`) again. ### Does this PR introduce _any_ user-facing change? No. This is an update for log message info to be consistent. ### How was this patch tested? Manually. Closes #30129 from dongjoon-hyun/SPARK-33218. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-22 01:10:24 -07:00
gengjiaan	eb33bcb4b2	[SPARK-30796][SQL] Add parameter position for REGEXP_REPLACE ### What changes were proposed in this pull request? `REGEXP_REPLACE` could replace all substrings of string that match regexp with replacement string. But `REGEXP_REPLACE` lost some flexibility. such as: converts camel case strings to a string containing lower case words separated by an underscore: AddressLine1 -> address_line_1 If we support the parameter position, we can do like this(e.g. Oracle): ``` WITH strings as ( SELECT 'AddressLine1' s FROM dual union all SELECT 'ZipCode' s FROM dual union all SELECT 'Country' s FROM dual ) SELECT s "STRING", lower(regexp_replace(s, '([A-Z0-9])', '_\1', 2)) "MODIFIED_STRING" FROM strings; ``` The output: ``` STRING MODIFIED_STRING -------------------- -------------------- AddressLine1 address_line_1 ZipCode zip_code Country country ``` There are some mainstream database support the syntax. Oracle https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/REGEXP_REPLACE.html#GUID-EA80A33C-441A-4692-A959-273B5A224490 Vertica https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/Functions/RegularExpressions/REGEXP_REPLACE.htm?zoom_highlight=regexp_replace Redshift https://docs.aws.amazon.com/redshift/latest/dg/REGEXP_REPLACE.html ### Why are the changes needed? The parameter position for `REGEXP_REPLACE` is very useful. ### Does this PR introduce _any_ user-facing change? 'Yes'. ### How was this patch tested? Jenkins test. Closes #29891 from beliefer/add-position-for-regex_replace. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-22 07:59:49 +00:00
Chao Sun	cb3fa6c936	[SPARK-33212][BUILD] Move to shaded clients for Hadoop 3.x profile ### What changes were proposed in this pull request? This switches Spark to use shaded Hadoop clients, namely hadoop-client-api and hadoop-client-runtime, for Hadoop 3.x. For Hadoop 2.7, we'll still use the same modules such as hadoop-client. In order to still keep default Hadoop profile to be hadoop-3.2, this defines the following Maven properties: ``` hadoop-client-api.artifact hadoop-client-runtime.artifact hadoop-client-minicluster.artifact ``` which default to: ``` hadoop-client-api hadoop-client-runtime hadoop-client-minicluster ``` but all switch to `hadoop-client` when the Hadoop profile is hadoop-2.7. A side affect from this is we'll import the same dependency multiple times. For this I have to disable Maven enforcer `banDuplicatePomDependencyVersions`. Besides above, there are the following changes: - explicitly add a few dependencies which are imported via transitive dependencies from Hadoop jars, but are removed from the shaded client jars. - removed the use of `ProxyUriUtils.getPath` from `ApplicationMaster` which is a server-side/private API. - modified `IsolatedClientLoader` to exclude `hadoop-auth` jars when Hadoop version is 3.x. This change should only matter when we're not sharing Hadoop classes with Spark (which is _mostly_ used in tests). ### Why are the changes needed? This serves two purposes: - to unblock Spark from upgrading to Hadoop 3.2.2/3.3.0+. Latest Hadoop versions have upgraded to use Guava 27+ and in order to adopt the latest Hadoop versions in Spark, we'll need to resolve the Guava conflicts. This takes the approach by switching to shaded client jars provided by Hadoop. - avoid pulling 3rd party dependencies from Hadoop and avoid potential future conflicts. ### Does this PR introduce _any_ user-facing change? When people use Spark with `hadoop-provided` option, they should make sure class path contains `hadoop-client-api` and `hadoop-client-runtime` jars. In addition, they may need to make sure these jars appear before other Hadoop jars in the order. Otherwise, classes may be loaded from the other non-shaded Hadoop jars and cause potential conflicts. ### How was this patch tested? Relying on existing tests. Closes #29843 from sunchao/SPARK-29250. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2020-10-22 03:21:34 +00:00
Max Gekk	ba13b94f6b	[SPARK-33210][SQL] Set the rebasing mode for parquet INT96 type to `EXCEPTION` by default ### What changes were proposed in this pull request? 1. Set the default value for the SQL configs `spark.sql.legacy.parquet.int96RebaseModeInWrite` and `spark.sql.legacy.parquet.int96RebaseModeInRead` to `EXCEPTION`. 2. Update the SQL migration guide. ### Why are the changes needed? Current default value `LEGACY` may lead to shifting timestamps in read or in write. We should leave the decision about rebasing to users. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? By existing test suites like `ParquetIOSuite`. Closes #30121 from MaxGekk/int96-exception-by-default. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-10-22 03:04:29 +00:00
Alessandro Patti	4a33cd928d	[SPARK-33203][PYTHON][TEST] Fix tests failing with rounding errors ### What changes were proposed in this pull request? Increase tolerance for two tests that fail in some environments and fail in others (flaky? Pass/fail is constant within the same environment) ### Why are the changes needed? The tests `pyspark.ml.recommendation` and `pyspark.ml.tests.test_algorithms` fail with ``` File "/home/jenkins/python/pyspark/ml/tests/test_algorithms.py", line 96, in test_raw_and_probability_prediction self.assertTrue(np.allclose(result.rawPrediction, expected_rawPrediction, atol=1)) AssertionError: False is not true ``` ``` File "/home/jenkins/python/pyspark/ml/recommendation.py", line 256, in _main_.ALS Failed example: predictions[0] Expected: Row(user=0, item=2, newPrediction=0.6929101347923279) Got: Row(user=0, item=2, newPrediction=0.6929104924201965) ... ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? This path changes a test target. Just executed the tests to verify they pass. Closes #30104 from AlessandroPatti/apatti/rounding-errors. Authored-by: Alessandro Patti <ale812@yahoo.it> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-21 18:14:21 -07:00
Max Gekk	bbf2d6f6df	[SPARK-33160][SQL][FOLLOWUP] Update benchmarks of INT96 type rebasing ### What changes were proposed in this pull request? 1. Turn off/on the SQL config `spark.sql.legacy.parquet.int96RebaseModeInWrite` which was added by https://github.com/apache/spark/pull/30056 in `DateTimeRebaseBenchmark`. The parquet readers should infer correct rebasing mode automatically from metadata. 2. Regenerate benchmark results of `DateTimeRebaseBenchmark` in the environment: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge (spot instance) \| \| AMI \| ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) \| \| Java \| OpenJDK8/11 installed by`sudo add-apt-repository ppa:openjdk-r/ppa` & `sudo apt install openjdk-11-jdk`\| ### Why are the changes needed? To have up-to-date info about INT96 performance which is the default type for Catalyst's timestamp type. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By updating benchmark results: ``` $ SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.DateTimeRebaseBenchmark" ``` Closes #30118 from MaxGekk/int96-rebase-benchmark. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-22 10:03:41 +09:00
HyukjinKwon	66005a3236	[SPARK-31964][PYTHON][FOLLOW-UP] Use is_categorical_dtype instead of deprecated is_categorical ### What changes were proposed in this pull request? This PR is a small followup of https://github.com/apache/spark/pull/28793 and proposes to use `is_categorical_dtype` instead of deprecated `is_categorical`. `is_categorical_dtype` exists from minimum pandas version we support (https://github.com/pandas-dev/pandas/blob/v0.23.2/pandas/core/dtypes/api.py), and `is_categorical` was deprecated from pandas 1.1.0 (`87a1cc21ca`). ### Why are the changes needed? To avoid using deprecated APIs, and remove warnings. ### Does this PR introduce _any_ user-facing change? Yes, it will remove warnings that says `is_categorical` is deprecated. ### How was this patch tested? By running any pandas UDF with pandas 1.1.0+: ```python import pandas as pd from pyspark.sql.functions import pandas_udf def func(x: pd.Series) -> pd.Series: return x spark.range(10).select(pandas_udf(func, "long")("id")).show() ``` Before: ``` /.../python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py:151: FutureWarning: is_categorical is deprecated and will be removed in a future version. Use is_categorical_dtype instead ... ``` After: ``` ... ``` Closes #30114 from HyukjinKwon/replace-deprecated-is_categorical. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2020-10-21 14:46:47 -07:00
Dongjoon Hyun	7aed81d492	[SPARK-33202][CORE] Fix BlockManagerDecommissioner to return the correct migration status ### What changes were proposed in this pull request? This PR changes `<` into `>` in the following to fix data loss during storage migrations. ```scala // If we found any new shuffles to migrate or otherwise have not migrated everything. - newShufflesToMigrate.nonEmpty \|\| migratingShuffles.size < numMigratedShuffles.get() + newShufflesToMigrate.nonEmpty \|\| migratingShuffles.size > numMigratedShuffles.get() ``` ### Why are the changes needed? `refreshOffloadingShuffleBlocks` should return `true` when the migration is still on-going. Since `migratingShuffles` is defined like the following, `migratingShuffles.size > numMigratedShuffles.get()` means the migration is not finished. ```scala // Shuffles which are either in queue for migrations or migrated protected[storage] val migratingShuffles = mutable.HashSet[ShuffleBlockInfo]() ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CI with the updated test cases. Closes #30116 from dongjoon-hyun/SPARK-33202. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-21 14:37:56 -07:00
Takeshi Yamamuro	1b7367ccd7	[SPARK-33205][BUILD] Bump snappy-java version to 1.1.8 ### What changes were proposed in this pull request? This PR intends to upgrade snappy-java from 1.1.7.5 to 1.1.8. ### Why are the changes needed? For performance improvements; the released `snappy-java` bundles the latest `Snappy` v1.1.8 binaries with small performance improvements. - snappy-java release note: https://github.com/xerial/snappy-java/releases/tag/1.1.8 - snappy release note: https://github.com/google/snappy/releases/tag/1.1.8 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA tests. Closes #30120 from maropu/Snappy1.1.8. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2020-10-21 13:04:39 -07:00
zhengruifeng	618695b78f	[SPARK-33111][ML][FOLLOW-UP] aft transform optimization - predictQuantiles ### What changes were proposed in this pull request? 1, optimize `predictQuantiles` by pre-computing an auxiliary var. ### Why are the changes needed? In https://github.com/apache/spark/pull/30000, I optimized the `transform` method. I find that we can also optimize `predictQuantiles` by pre-computing an auxiliary var. It is about 56% faster than existing impl. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes #30034 from zhengruifeng/aft_quantiles_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-10-21 08:49:25 -05:00
Kent Yao	dcb0820433	[SPARK-32785][SQL][DOCS][FOLLOWUP] Update migaration guide for incomplete interval literals ### What changes were proposed in this pull request? Address comments https://github.com/apache/spark/pull/29635#discussion_r507241899 to improve migration guide ### Why are the changes needed? improve migration guide ### Does this PR introduce _any_ user-facing change? NO，only doc update ### How was this patch tested? passing GitHub action Closes #30113 from yaooqinn/SPARK-32785-F. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-21 15:51:16 +09:00
Bryan Cutler	47a6568265	[SPARK-33189][PYTHON][TESTS] Add env var to tests for legacy nested timestamps in pyarrow ### What changes were proposed in this pull request? Add an environment variable `PYARROW_IGNORE_TIMEZONE` to pyspark tests in run-tests.py to use legacy nested timestamp behavior. This means that when converting arrow to pandas, nested timestamps with timezones will have the timezone localized during conversion. ### Why are the changes needed? The default behavior was changed in PyArrow 2.0.0 to propagate timezone information. Using the environment variable enables testing with newer versions of pyarrow until the issue can be fixed in SPARK-32285. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #30111 from BryanCutler/arrow-enable-legacy-nested-timestamps-SPARK-33189. Authored-by: Bryan Cutler <cutlerb@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-21 09:13:33 +09:00
Dongjoon Hyun	385d5db941	[SPARK-33198][CORE] getMigrationBlocks should not fail at missing files ### What changes were proposed in this pull request? This PR aims to fix `getMigrationBlocks` error handling and to add test coverage. 1. `getMigrationBlocks` should not fail at indexFile only case. 2. `assert` causes `java.lang.AssertionError` which is not an `Exception`. ### Why are the changes needed? To handle the exception correctly. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CI with the newly added test case. Closes #30110 from dongjoon-hyun/SPARK-33198. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-20 15:02:36 -07:00
Dongjoon Hyun	c824db2d8b	[MINOR][CORE] Improve log message during storage decommission ### What changes were proposed in this pull request? This PR aims to improve the log message for better analysis. ### Why are the changes needed? Good logs are crucial always. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. Closes #30109 from dongjoon-hyun/k8s_log. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-20 14:55:08 -07:00
Keiji Yoshida	46ad325e56	[MINOR][DOCS] Fix the description about to_avro and from_avro functions ### What changes were proposed in this pull request? This pull request changes the description about `to_avro` and `from_avro` functions to include Python as a supported language as the functions have been supported in Python since Apache Spark 3.0.0 [[SPARK-26856](https://issues.apache.org/jira/browse/SPARK-26856)]. ### Why are the changes needed? Same as above. ### Does this PR introduce _any_ user-facing change? Yes. The description changed by this pull request is on https://spark.apache.org/docs/latest/sql-data-sources-avro.html#to_avro-and-from_avro. ### How was this patch tested? Tested manually by building and checking the document in the local environment. Closes #30105 from kjmrknsn/fix-docs-sql-data-sources-avro. Authored-by: Keiji Yoshida <kjmrknsn@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-21 00:36:45 +09:00
HyukjinKwon	2cfd215dc4	[SPARK-33191][YARN][TESTS] Fix PySpark test cases in YarnClusterSuite ### What changes were proposed in this pull request? This PR proposes to fix: ``` org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application in yarn-client mode org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application in yarn-cluster mode org.apache.spark.deploy.yarn.YarnClusterSuite.run Python application in yarn-cluster mode using spark.yarn.appMasterEnv to override local envvar ``` it currently fails as below: ``` 20/10/16 19:20:36 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0) (amp-jenkins-worker-03.amp executor 1): org.apache.spark.SparkException: Error from python worker: Traceback (most recent call last): File "/usr/lib64/python2.6/runpy.py", line 104, in _run_module_as_main loader, code, fname = _get_module_details(mod_name) File "/usr/lib64/python2.6/runpy.py", line 79, in _get_module_details loader = get_loader(mod_name) File "/usr/lib64/python2.6/pkgutil.py", line 456, in get_loader return find_loader(fullname) File "/usr/lib64/python2.6/pkgutil.py", line 466, in find_loader for importer in iter_importers(fullname): File "/usr/lib64/python2.6/pkgutil.py", line 422, in iter_importers __import__(pkg) File "/home/jenkins/workspace/SparkPullRequestBuilder2/python/pyspark/__init__.py", line 53, in <module> from pyspark.rdd import RDD, RDDBarrier File "/home/jenkins/workspace/SparkPullRequestBuilder2/python/pyspark/rdd.py", line 34, in <module> from pyspark.java_gateway import local_connect_and_auth File "/home/jenkins/workspace/SparkPullRequestBuilder2/python/pyspark/java_gateway.py", line 29, in <module> from py4j.java_gateway import java_import, JavaGateway, JavaObject, GatewayParameters File "/home/jenkins/workspace/SparkPullRequestBuilder2/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 60 PY4J_TRUE = {"yes", "y", "t", "true"} ^ SyntaxError: invalid syntax ``` I think this was broken when Python 2 was dropped but was not caught because this specific test does not run when there's no change in YARN codes. See also https://github.com/apache/spark/pull/29843#issuecomment-712540024 The root cause seems like the paths are different, see https://github.com/apache/spark/pull/29843#pullrequestreview-502595199. I _think_ Jenkins uses a different Python executable via Anaconda and the executor side does not know where it is for some reasons. This PR proposes to fix it just by explicitly specifying the absolute path for Python executable so the tests should pass in any environment. ### Why are the changes needed? To make tests pass. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? This issue looks specific to Jenkins. It should run the tests on Jenkins. Closes #30099 from HyukjinKwon/SPARK-33191. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-21 00:31:58 +09:00
HyukjinKwon	eb9966b700	[SPARK-33190][INFRA][TESTS] Set upper bound of PyArrow version in GitHub Actions ### What changes were proposed in this pull request? PyArrow is uploaded into PyPI today (https://pypi.org/project/pyarrow/), and some tests fail with PyArrow 2.0.0+: ``` ====================================================================== ERROR [0.774s]: test_grouped_over_window_with_key (pyspark.sql.tests.test_pandas_grouped_map.GroupedMapInPandasTests) ---------------------------------------------------------------------- Traceback (most recent call last): File "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 595, in test_grouped_over_window_with_key .select('id', 'result').collect() File "/__w/spark/spark/python/pyspark/sql/dataframe.py", line 588, in collect sock_info = self._jdf.collectToPython() File "/__w/spark/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/__w/spark/spark/python/pyspark/sql/utils.py", line 117, in deco raise converted from None pyspark.sql.utils.PythonException: An exception was thrown from the Python worker. Please see the stack trace below. Traceback (most recent call last): File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 601, in main process() File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 593, in process serializer.dump_stream(out_iter, outfile) File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 255, in dump_stream return ArrowStreamSerializer.dump_stream(self, init_stream_yield_batches(), stream) File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 81, in dump_stream for batch in iterator: File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/sql/pandas/serializers.py", line 248, in init_stream_yield_batches for series in iterator: File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 426, in mapper return f(keys, vals) File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 170, in <lambda> return lambda k, v: [(wrapped(k, v), to_arrow_type(return_type))] File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/worker.py", line 158, in wrapped result = f(key, pd.concat(value_series, axis=1)) File "/__w/spark/spark/python/lib/pyspark.zip/pyspark/util.py", line 68, in wrapper return f(args, *kwargs) File "/__w/spark/spark/python/pyspark/sql/tests/test_pandas_grouped_map.py", line 590, in f "{} != {}".format(expected_key[i][1], window_range) AssertionError: {'start': datetime.datetime(2018, 3, 15, 0, 0), 'end': datetime.datetime(2018, 3, 20, 0, 0)} != {'start': datetime.datetime(2018, 3, 15, 0, 0, tzinfo=<StaticTzInfo 'Etc/UTC'>), 'end': datetime.datetime(2018, 3, 20, 0, 0, tzinfo=<StaticTzInfo 'Etc/UTC'>)} ``` https://github.com/apache/spark/runs/1278917457 This PR proposes to set the upper bound of PyArrow in GitHub Actions build. This should be removed when we properly support PyArrow 2.0.0+ (SPARK-33189). ### Why are the changes needed? To make build pass. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? GitHub Actions in this build will test it out. Closes #30098 from HyukjinKwon/hot-fix-test. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-20 17:35:09 +09:00
Gabor Somogyi	fbb6843620	[SPARK-32229][SQL] Fix PostgresConnectionProvider and MSSQLConnectionProvider by accessing wrapped driver ### What changes were proposed in this pull request? Postgres and MSSQL connection providers are not able to get custom `appEntry` because under some circumstances the driver is wrapped with `DriverWrapper`. Such case is not handled in the mentioned providers. In this PR I've added this edge case handling by passing unwrapped `Driver` from `JdbcUtils`. ### Why are the changes needed? `DriverWrapper` is not considered. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing + additional unit tests. Closes #30024 from gaborgsomogyi/SPARK-32229. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-10-20 15:14:38 +09:00
Max Gekk	a44e008de3	[SPARK-33160][SQL] Allow saving/loading INT96 in parquet w/o rebasing ### What changes were proposed in this pull request? 1. Add the SQL config `spark.sql.legacy.parquet.int96RebaseModeInWrite` to control timestamps rebasing in saving them as INT96. It supports the same set of values as `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` but the default value is `LEGACY` to preserve backward compatibility with Spark <= 3.0. 2. Write the metadata key `org.apache.spark.int96NoRebase` to parquet files if the files are saved with `spark.sql.legacy.parquet.int96RebaseModeInWrite` isn't set to `LEGACY`. 3. Add the SQL config `spark.sql.legacy.parquet.datetimeRebaseModeInRead` to control loading INT96 timestamps when parquet metadata doesn't have enough info (the `org.apache.spark.int96NoRebase` tag) about parquet writer - either INT96 was written by Proleptic Gregorian system or some Julian one. 4. Modified Vectorized and Parquet-mr Readers to support loading/saving INT96 timestamps w/o rebasing depending on SQL config and the metadata tag: - No rebasing in testing when the SQL config `spark.test.forceNoRebase` is set to `true` - No rebasing if parquet metadata contains the tag `org.apache.spark.int96NoRebase`. This is the case when parquet files are saved by Spark >= 3.1 with `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` is set to `CORRECTED`, or saved by other systems with the tag `org.apache.spark.int96NoRebase`. - With rebasing if parquet files saved by Spark (any versions) without the metadata tag `org.apache.spark.int96NoRebase`. - Rebasing depend on the SQL config `spark.sql.legacy.parquet.datetimeRebaseModeInRead` if there are no metadata tags `org.apache.spark.version` and `org.apache.spark.int96NoRebase`. New SQL configs are added instead of re-using existing `spark.sql.legacy.parquet.datetimeRebaseModeInWrite` and `spark.sql.legacy.parquet.datetimeRebaseModeInRead` because of: - To allow users have different modes for INT96 and for TIMESTAMP_MICROS (MILLIS). For example, users might want to save INT96 as LEGACY but TIMESTAMP_MICROS as CORRECTED. - To have different modes for INT96 and DATE in load (or in save). - To be backward compatible with Spark 2.4. For now, `spark.sql.legacy.parquet.datetimeRebaseModeInWrite/Read` are set to `EXCEPTION` by default. ### Why are the changes needed? 1. Parquet spec says that INT96 must be stored as Julian days (see https://github.com/apache/parquet-format/pull/49). This doesn't mean that a reader ( or a writer) is based on the Julian calendar. So, rebasing from Proleptic Gregorian to Julian calendar can be not needed. 2. Rebasing from/to Julian calendar can loose information because dates in one calendar don't exist in another one. Like 1582-10-04..1582-10-15 exist in Proleptic Gregorian calendar but not in the hybrid calendar (Julian + Gregorian), and visa versa, Julian date 1000-02-29 doesn't exist in Proleptic Gregorian calendar. We should allow users to save timestamps without loosing such dates (rebasing shifts such dates to the next valid date). 3. It would also make Spark compatible with other systems such as Impala and newer versions of Hive that write proleptic Gregorian based INT96 timestamps. ### Does this PR introduce _any_ user-facing change? It can when `spark.sql.legacy.parquet.int96RebaseModeInWrite` is set non-default value `LEGACY`. ### How was this patch tested? - Added a test to check the metadata key `org.apache.spark.int96NoRebase` - By `ParquetIOSuite` Closes #30056 from MaxGekk/parquet-rebase-int96. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-20 14:58:59 +09:00
Nan Zhu	35133901f7	[SPARK-32351][SQL] Show partially pushed down partition filters in explain() ### What changes were proposed in this pull request? Currently, actual non-dynamic partition pruning is executed in the optimizer phase (PruneFileSourcePartitions) if an input relation has a catalog file index. The current code assumes the same partition filters are generated again in FileSourceStrategy and passed into FileSourceScanExec. FileSourceScanExec uses the partition filters when listing files, but these non-dynamic partition filters do nothing because unnecessary partitions are already pruned in advance, so the filters are mainly used for explain output in this case. If a WHERE clause has DNF-ed predicates, FileSourceStrategy cannot extract the same filters with PruneFileSourcePartitions and then PartitionFilters is not shown in explain output. This patch proposes to extract partition filters in FileSourceStrategy and HiveStrategy with `extractPredicatesWithinOutputSet` added in https://github.com/apache/spark/pull/29101/files#diff-6be42cfa3c62a7536b1eb1d6447c073c again, then It will show the partially pushed down partition filter in explain(). ### Why are the changes needed? without the patch, the explained plan is inconsistent with what is actually executed <b>without the change </b> the explained plan of `"SELECT * FROM t WHERE p = '1' OR (p = '2' AND i = 1)"` for datasource and hive tables are like the following respectively (missing pushed down partition filters) ``` == Physical Plan == (1) Filter ((p#21 = 1) OR ((p#21 = 2) AND (i#20 = 1))) +- (1) ColumnarToRow +- FileScan parquet default.t[i#20,p#21] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/nanzhu/code/spark/sql/hive/target/tmp/hive_execution_test_group/war..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<i:int> ``` ``` == Physical Plan == (1) Filter ((p#33 = 1) OR ((p#33 = 2) AND (i#32 = 1))) +- Scan hive default.t [i#32, p#33], HiveTableRelation [`default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [i#32], Partition Cols: [p#33], Pruned Partitions: [(p=1), (p=2)]] ``` <b> with change </b> the plan looks like (the actually executed partition filters are exhibited) ``` == Physical Plan == (1) Filter ((p#21 = 1) OR ((p#21 = 2) AND (i#20 = 1))) +- (1) ColumnarToRow +- FileScan parquet default.t[i#20,p#21] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex[file:/Users/nanzhu/code/spark/sql/hive/target/tmp/hive_execution_test_group/war..., PartitionFilters: [((p#21 = 1) OR (p#21 = 2))], PushedFilters: [], ReadSchema: struct<i:int> ``` ``` == Physical Plan == (1) Filter ((p#37 = 1) OR ((p#37 = 2) AND (i#36 = 1))) +- Scan hive default.t [i#36, p#37], HiveTableRelation [`default`.`t`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [i#36], Partition Cols: [p#37], Pruned Partitions: [(p=1), (p=2)]], [((p#37 = 1) OR (p#37 = 2))] ``` ### Does this PR introduce _any_ user-facing change no ### How was this patch tested? Unit test. Closes #29831 from CodingCat/SPARK-32351. Lead-authored-by: Nan Zhu <nanzhu@uber.com> Co-authored-by: Nan Zhu <CodingCat@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-20 11:13:16 +09:00
liaoaoyuan97	f65a24412b	[SPARK-33181][SQL][DOCS] Document Load Table Directly from File in SQL Select Reference ### What changes were proposed in this pull request? Add the link to the feature: "Run SQL on files directly" to SQL reference documentation page ### Why are the changes needed? To make SQL Reference complete ### Does this PR introduce _any_ user-facing change? yes. Previously, reading in sql from file directly is not included in the documentation: https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select.html, not listed in from_items. The new link is added to the select statement documentation, like the below: ![image](https://user-images.githubusercontent.com/16770242/96517999-c34f3900-121e-11eb-8d56-c4ba0432855e.png) ![image](https://user-images.githubusercontent.com/16770242/96518808-8126f700-1220-11eb-8c98-fb398eee0330.png) ### How was this patch tested? Manually built and tested Closes #30095 from liaoaoyuan97/master. Authored-by: liaoaoyuan97 <al3468@columbia.edu> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-10-20 10:23:58 +09:00

1 2 3 4 5 ...

28355 commits