ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Kousuke Saruta	f7501ddd70	Revert "[SPARK-30119][WEBUI] Add Pagination Support to Streaming Page" This PR reverts #28439 due to that PR breaks QA build. Closes #28747 from sarutak/revert-SPARK-30119. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2020-06-08 10:10:52 +09:00
iRakson	e9337f505b	[SPARK-30119][WEBUI] Add Pagination Support to Streaming Page ### What changes were proposed in this pull request? * Pagination Support is added to all tables of streaming page in spark web UI. For adding pagination support, existing classes from #7399 were used. * Earlier streaming page has two tables `Active Batches` and `Completed Batches`. Now, we will have three tables `Running Batches`, `Waiting Batches` and `Completed Batches`. If we have large number of waiting and running batches then keeping track in a single table is difficult. Also other pages have different table for different type type of data. * Earlier empty tables were shown. Now only non-empty tables will be shown. `Active Batches` table used to show details of waiting batches followed by running batches. ### Why are the changes needed? Pagination will allow users to analyse the table in much better way. All spark web UI pages support pagination apart from streaming pages, so this will add consistency as well. Also it might fix the potential OOM errors that can arise. ### Does this PR introduce _any_ user-facing change? Yes. `Active Batches` table is split into two tables `Running Batches` and `Waiting Batches`. Pagination Support is added to the all the tables. Every other functionality is unchanged. ### How was this patch tested? Manually. Before changes: <img width="1667" alt="Screenshot 2020-05-03 at 7 07 14 PM" src="https://user-images.githubusercontent.com/15366835/80915680-8fb44b80-8d71-11ea-9957-c4a3769b8b67.png"> After Changes: <img width="1669" alt="Screenshot 2020-05-03 at 6 51 22 PM" src="https://user-images.githubusercontent.com/15366835/80915694-a9ee2980-8d71-11ea-8fc5-246413a4951d.png"> Closes #28439 from iRakson/streamingPagination. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2020-06-07 13:08:50 +09:00
Gabor Somogyi	04f66bfd4e	[MINOR][SS][DOCS] fileNameOnly parameter description re-unite ### What changes were proposed in this pull request? `fileNameOnly` parameter is split to 2 pieces in [this](`dbb8143501`) commit. This PR re-unites it. ### Why are the changes needed? Parameter description split in doc. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? ``` cd docs/ SKIP_API=1 jekyll build ``` Manual webpage check. Closes #28739 from gaborgsomogyi/datasettxtfix. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-06-06 16:49:48 +09:00
LantaoJin	5079831106	[SPARK-31904][SQL] Fix case sensitive problem of char and varchar partition columns ### What changes were proposed in this pull request? ```sql CREATE TABLE t1(a STRING, B VARCHAR(10), C CHAR(10)) STORED AS parquet; CREATE TABLE t2 USING parquet PARTITIONED BY (b, c) AS SELECT * FROM t1; SELECT * FROM t2 WHERE b = 'A'; ``` Above SQL throws MetaException > Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.spark.sql.hive.client.Shim_v0_13.getPartitionsByFilter(HiveShim.scala:810) ... 114 more Caused by: MetaException(message:Filtering is supported only on partition keys of type string, or integral types) at org.apache.hadoop.hive.metastore.parser.ExpressionTree$FilterBuilder.setError(ExpressionTree.java:184) at org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.getJdoFilterPushdownParam(ExpressionTree.java:439) at org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.generateJDOFilterOverPartitions(ExpressionTree.java:356) at org.apache.hadoop.hive.metastore.parser.ExpressionTree$LeafNode.generateJDOFilter(ExpressionTree.java:278) at org.apache.hadoop.hive.metastore.parser.ExpressionTree.generateJDOFilterFragment(ExpressionTree.java:583) at org.apache.hadoop.hive.metastore.ObjectStore.makeQueryFilterString(ObjectStore.java:3315) at org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsViaOrmFilter(ObjectStore.java:2768) at org.apache.hadoop.hive.metastore.ObjectStore.access$500(ObjectStore.java:182) at org.apache.hadoop.hive.metastore.ObjectStore$7.getJdoResult(ObjectStore.java:3248) at org.apache.hadoop.hive.metastore.ObjectStore$7.getJdoResult(ObjectStore.java:3232) at org.apache.hadoop.hive.metastore.ObjectStore$GetHelper.run(ObjectStore.java:2974) at org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByFilterInternal(ObjectStore.java:3250) at org.apache.hadoop.hive.metastore.ObjectStore.getPartitionsByFilter(ObjectStore.java:2906) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.RawStoreProxy.invoke(RawStoreProxy.java:101) at com.sun.proxy.$Proxy25.getPartitionsByFilter(Unknown Source) at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.get_partitions_by_filter(HiveMetaStore.java:5093) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invokeInternal(RetryingHMSHandler.java:148) at org.apache.hadoop.hive.metastore.RetryingHMSHandler.invoke(RetryingHMSHandler.java:107) at com.sun.proxy.$Proxy26.get_partitions_by_filter(Unknown Source) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.listPartitionsByFilter(HiveMetaStoreClient.java:1232) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173) at com.sun.proxy.$Proxy27.listPartitionsByFilter(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.getPartitionsByFilter(Hive.java:2679) ... 119 more ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add a unit test. Closes #28724 from LantaoJin/SPARK-31904. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-06-06 07:35:25 +09:00
Kent Yao	fc6af9d900	[SPARK-31867][SQL][FOLLOWUP] Check result differences for datetime formatting ### What changes were proposed in this pull request? In this PR, we throw `SparkUpgradeException` when getting `DateTimeException` for datetime formatting in the `EXCEPTION` legacy Time Parser Policy. ### Why are the changes needed? `DateTimeException` is also declared by `java.time.format.DateTimeFormatter#format`, but in Spark, it can barely occur. We have suspected one that due to a JDK bug so far. see https://bugs.openjdk.java.net/browse/JDK-8079628. For `from_unixtime` function, we will suppress the DateTimeException caused by `DD` and result `NULL`. It is a silent date change that should be avoided in Java 8. ### Does this PR introduce _any_ user-facing change? Yes, when running on Java8 and using `from_unixtime` function with pattern `DD` to format datetimes, if dayofyear>=100, `SparkUpgradeException` will alert users instead of silently resulting null. For `date_format`, `SparkUpgradeException` take the palace of `DateTimeException`. ### How was this patch tested? add unit tests. Closes #28736 from yaooqinn/SPARK-31867-F. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-06-05 16:44:16 +00:00
Max Gekk	2c9988eaf3	[SPARK-31910][SQL] Enable Java 8 time API in Thrift server ### What changes were proposed in this pull request? Set `spark.sql.datetime.java8API.enabled` to `true` in: 1. `SparkSQLEnv.init()` of Thrift server, and 2. `SparkSQLSessionManager.openSession()` ### Why are the changes needed? 1. Date and timestamp string literals are parsed by using Java 8 time API and Spark's session time zone. Before the changes, date/timestamp values were collected as legacy types `java.sql.Date`/`java.sql.Timestamp`, and the value of such types didn't respect the config `spark.sql.session.timeZone`. To have consistent view, users had to keep JVM time zone and Spark's session time zone in sync. 2. After the changes, formatting of date values doesn't depend on JVM time zone. 3. While returning dates/timestamps of Java 8 type, we can avoid dates/timestamps rebasing from Proleptic Gregorian calendar to the hybrid calendar (Julian + Gregorian), and the issues related to calendar switching. 4. Properly handle negative years (BCE). 5. Consistent conversion of date/timestamp strings to/from internal Catalyst types in both direction to and from Spark. ### Does this PR introduce any user-facing change? Yes. Before: ```sql spark-sql> select make_date(-44, 3, 15); 0045-03-15 ``` After: ```sql spark-sql> select make_date(-44, 3, 15); -0044-03-15 ``` ### How was this patch tested? Manually via `bin/spark-sql`. Closes #28729 from MaxGekk/enable-java8-time-api-in-thrift-server. Lead-authored-by: Max Gekk <max.gekk@gmail.com> Co-authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-06-05 14:18:16 +00:00
Juliusz Sompolski	ea010138e9	[SPARK-31859][SPARK-31861][FOLLOWUP] Fix typo in tests ### What changes were proposed in this pull request? It appears I have unintentionally used nested JDBC statements in the two tests I added. ### Why are the changes needed? Cleanup a typo. Please merge to master/branch-3.0 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit tests. Closes #28735 from juliuszsompolski/SPARK-31859-fixup. Authored-by: Juliusz Sompolski <julek@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-06-05 10:01:00 +00:00
Kent Yao	9d5b5d0a58	[SPARK-31879][SQL][TEST-JAVA11] Make week-based pattern invalid for formatting too # What changes were proposed in this pull request? After all these attempts https://github.com/apache/spark/pull/28692 and https://github.com/apache/spark/pull/28719 an https://github.com/apache/spark/pull/28727. they all have limitations as mentioned in their discussions. Maybe the only way is to forbid them all ### Why are the changes needed? These week-based fields need Locale to express their semantics, the first day of the week varies from country to country. From the Java doc of WeekFields ```java /** * Gets the first day-of-week. * <p> * The first day-of-week varies by culture. * For example, the US uses Sunday, while France and the ISO-8601 standard use Monday. * This method returns the first day using the standard {code DayOfWeek} enum. * * return the first day-of-week, not null */ public DayOfWeek getFirstDayOfWeek() { return firstDayOfWeek; } ``` But for the SimpleDateFormat, the day-of-week is not localized ``` u Day number of week (1 = Monday, ..., 7 = Sunday) Number 1 ``` Currently, the default locale we use is the US, so the result moved a day or a year or a week backward. e.g. For the date `2019-12-29(Sunday)`, in the Sunday Start system(e.g. en-US), it belongs to 2020 of week-based-year, in the Monday Start system(en-GB), it goes to 2019. the week-of-week-based-year(w) will be affected too ```sql spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-US')); 2020 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY', 'locale', 'en-GB')); 2019 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US')); 2020-01-01 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2019-12-29', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB')); 2019-52-07 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-US')); 2020-02-01 spark-sql> SELECT to_csv(named_struct('time', to_timestamp('2020-01-05', 'yyyy-MM-dd')), map('timestampFormat', 'YYYY-ww-uu', 'locale', 'en-GB')); 2020-01-07 ``` For other countries, please refer to [First Day of the Week in Different Countries](http://chartsbin.com/view/41671) ### Does this PR introduce _any_ user-facing change? With this change, user can not use 'YwuW', but 'e' for 'u' instead. This can at least turn this not to be a silent data change. ### How was this patch tested? add unit tests Closes #28728 from yaooqinn/SPARK-31879-NEW2. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-06-05 08:14:01 +00:00
HyukjinKwon	a8266e44d4	Revert "[SPARK-28624][SQL][TESTS] Run date.sql via Thrift Server" This reverts commit `a4195d28ae`.	2020-06-05 15:43:11 +09:00
HyukjinKwon	53ce58da34	[MINOR][PYTHON] Add one more newline between JVM and Python tracebacks ### What changes were proposed in this pull request? This PR proposes to add one more newline to clearly separate JVM and Python tracebacks: Before: ``` Traceback (most recent call last): ... pyspark.sql.utils.AnalysisException: Reference 'column' is ambiguous, could be: column, column.; JVM stacktrace: org.apache.spark.sql.AnalysisException: Reference 'column' is ambiguous, could be: column, column.; ... ``` After: ``` Traceback (most recent call last): ... pyspark.sql.utils.AnalysisException: Reference 'column' is ambiguous, could be: column, column.; JVM stacktrace: org.apache.spark.sql.AnalysisException: Reference 'column' is ambiguous, could be: column, column.; ... ``` This is kind of a followup of `e69466056f` (SPARK-31849). ### Why are the changes needed? To make it easier to read. ### Does this PR introduce _any_ user-facing change? It's in the unreleased branches. ### How was this patch tested? Manually tested. Closes #28732 from HyukjinKwon/python-minor. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-06-05 13:31:35 +09:00
Takuya UESHIN	632b5bce23	[SPARK-31903][SQL][PYSPARK][R] Fix toPandas with Arrow enabled to show metrics in Query UI ### What changes were proposed in this pull request? In `Dataset.collectAsArrowToR` and `Dataset.collectAsArrowToPython`, since the code block for `serveToStream` is run in the separate thread, `withAction` finishes as soon as it starts the thread. As a result, it doesn't collect the metrics of the actual action and Query UI shows the plan graph without metrics. We should call `serveToStream` first, then `withAction` in it. ### Why are the changes needed? When calling toPandas, usually Query UI shows each plan node's metric and corresponding Stage ID and Task ID: ```py >>> df = spark.createDataFrame([(1, 10, 'abc'), (2, 20, 'def')], schema=['x', 'y', 'z']) >>> df.toPandas() x y z 0 1 10 abc 1 2 20 def ``` ![Screen Shot 2020-06-03 at 4 47 07 PM](https://user-images.githubusercontent.com/506656/83815735-bec22380-a675-11ea-8ecc-bf2954731f35.png) but if Arrow execution is enabled, it shows only plan nodes and the duration is not correct: ```py >>> spark.conf.set('spark.sql.execution.arrow.pyspark.enabled', True) >>> df.toPandas() x y z 0 1 10 abc 1 2 20 def ``` ![Screen Shot 2020-06-03 at 4 47 27 PM](https://user-images.githubusercontent.com/506656/83815804-de594c00-a675-11ea-933a-d0ffc0f534dd.png) ### Does this PR introduce _any_ user-facing change? Yes, the Query UI will show the plan with the correct metrics. ### How was this patch tested? I checked it manually in my local. ![Screen Shot 2020-06-04 at 3 19 41 PM](https://user-images.githubusercontent.com/506656/83816265-d77f0900-a676-11ea-84b8-2a8d80428bc6.png) Closes #28730 from ueshin/issues/SPARK-31903/to_pandas_with_arrow_query_ui. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-06-05 12:53:58 +09:00
Max Gekk	a4195d28ae	[SPARK-28624][SQL][TESTS] Run date.sql via Thrift Server ### What changes were proposed in this pull request? Enable `date.sql` and run it via Thrift Server in `ThriftServerQueryTestSuite`. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By running the enabled tests via: ``` $ build/sbt -Phive-thriftserver "hive-thriftserver/test-only *ThriftServerQueryTestSuite -- -z date.sql" ``` Closes #28721 from MaxGekk/enable-date.sql-for-thrift. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-06-04 10:45:36 +09:00
Enrico Minack	4bbe3c2bb4	[SPARK-31853][DOCS] Mention removal of params mixins setter in migration guide ### What changes were proposed in this pull request? The Pyspark Migration Guide needs to mention a breaking change of the Pyspark ML API. ### Why are the changes needed? In SPARK-29093, all setters have been removed from `Params` mixins in `pyspark.ml.param.shared`. Those setters had been part of the public pyspark ML API, hence this is a breaking change. ### Does this PR introduce _any_ user-facing change? Only documentation. ### How was this patch tested? Visually. Closes #28663 from EnricoMi/branch-pyspark-migration-guide-setters. Authored-by: Enrico Minack <github@enrico.minack.dev> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-06-03 18:06:13 -05:00
Wenchen Fan	dc0709fa0c	[SPARK-29947][SQL][FOLLOWUP] ResolveRelations should return relations with fresh attribute IDs ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/26589, which caches the table relations to speed up the table lookup. However, it brings some side effects: the rule `ResolveRelations` may return exactly the same relations, while before it always returns relations with fresh attribute IDs. This PR is to eliminate this side effect. ### Why are the changes needed? There is no bug report yet, but this side effect may impact things like self-join. It's better to restore the 2.4 behavior and always return refresh relations. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #28717 from cloud-fan/fix. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-06-03 19:08:36 +00:00
Wenchen Fan	e61d0de11f	Revert "[SPARK-31879][SQL] Using GB as default Locale for datetime formatters" This reverts commit `c59f51bcc2`.	2020-06-04 01:54:22 +08:00
Kent Yao	afcc14c6d2	[SPARK-31896][SQL] Handle am-pm timestamp parsing when hour is missing ### What changes were proposed in this pull request? This PR set the hour to 12/0 when the AMPM_OF_DAY field exists ### Why are the changes needed? When the hour is absent but the am-pm is present, the time is incorrect for pm ### Does this PR introduce _any_ user-facing change? yes, the change is user-facing but to change back to 2.4 to keep backward compatibility e.g. ```sql spark-sql> select to_timestamp('33:33 PM', 'mm:ss a'); 1970-01-01 12:33:33 spark-sql> select to_timestamp('33:33 AM', 'mm:ss a'); 1970-01-01 00:33:33 ``` otherwise, the results are all `1970-01-01 00:33:33` ### How was this patch tested? add unit tests Closes #28713 from yaooqinn/SPARK-31896. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-06-03 13:30:22 +00:00
Wenchen Fan	349015dce0	fix compilation	2020-06-03 20:11:41 +08:00
Max Gekk	125a89ce08	[SPARK-31878][SQL] Create date formatter only once in `HiveResult` ### What changes were proposed in this pull request? 1. Replace `def dateFormatter` to `val dateFormatter`. 2. Modify the `date formatting in hive result` test in `HiveResultSuite` to check modified code on various time zones. ### Why are the changes needed? To avoid creation of `DateFormatter` per every incoming date in `HiveResult.toHiveString`. This should eliminate unnecessary creation of `SimpleDateFormat` instances and compilation of the default pattern `yyyy-MM-dd`. The changes can speed up processing of legacy date values of the `java.sql.Date` type which is collected by default. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Modified a test in `HiveResultSuite`. Closes #28687 from MaxGekk/HiveResult-val-dateFormatter. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-06-03 12:00:20 +00:00
Dongjoon Hyun	e5b9b862e6	[SPARK-31881][K8S][TESTS][FOLLOWUP] Activate hadoop-2.7 by default in K8S IT ### What changes were proposed in this pull request? This PR aims to activate `hadoop-2.7` profile by default in Kubernetes IT module. ### Why are the changes needed? While SPARK-31881 added Hadoop 3.2 support, one default test dependency was moved to `hadoop-2.7` profile. It works when we give one of `hadoop-2.7` and `hadoop-3.2`, but it fails when we don't give any profile. BEFORE ``` $ mvn test-compile -pl resource-managers/kubernetes/integration-tests -Pkubernetes-integration-tests ... [ERROR] [Error] /APACHE/spark-merge/resource-managers/kubernetes/integration-tests/src/test/scala/org/apache/spark/deploy/k8s/integrationtest/DepsTestsSuite.scala:23: object amazonaws is not a member of package com ``` AFTER ``` $ mvn test-compile -pl resource-managers/kubernetes/integration-tests -Pkubernetes-integration-tests .. [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ ``` The default activated profile will be override when we give `hadoop-3.2`. ``` $ mvn help:active-profiles -Pkubernetes-integration-tests ... Active Profiles for Project 'org.apache.spark:spark-kubernetes-integration-tests_2.12🫙3.1.0-SNAPSHOT': The following profiles are active: - hadoop-2.7 (source: org.apache.spark:spark-kubernetes-integration-tests_2.12:3.1.0-SNAPSHOT) - kubernetes-integration-tests (source: org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT) - test-java-home (source: org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT) ``` ``` $ mvn help:active-profiles -Pkubernetes-integration-tests -Phadoop-3.2 ... Active Profiles for Project 'org.apache.spark:spark-kubernetes-integration-tests_2.12🫙3.1.0-SNAPSHOT': The following profiles are active: - hadoop-3.2 (source: org.apache.spark:spark-kubernetes-integration-tests_2.12:3.1.0-SNAPSHOT) - hadoop-3.2 (source: org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT) - kubernetes-integration-tests (source: org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT) - test-java-home (source: org.apache.spark:spark-parent_2.12:3.1.0-SNAPSHOT) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins UT and IT. Currently, all Jenkins build and tests (UT & IT) passes without this patch. This should be tested manually with the above command. `hadoop-3.2` K8s IT also passed like the following. ``` KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - All pods have the same service account by default - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark with Python2 to test a pyfiles example - Run PySpark with Python3 to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - PVs with local storage - Launcher client dependencies - Test basic decommissioning Run completed in 8 minutes, 33 seconds. Total number of tests run: 19 Suites: completed 2, aborted 0 Tests: succeeded 19, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #28716 from dongjoon-hyun/SPARK-31881-2. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-06-03 02:17:25 -07:00
Kousuke Saruta	8ed93c9355	[SPARK-31886][WEBUI] Fix the wrong coloring of nodes in DAG-viz ### What changes were proposed in this pull request? This PR fixes a wrong coloring issue in the DAG-viz. In the Job Page and Stage Page, nodes which are associated with "barrier mode" in the DAG-viz will be colored pale green. But, with some type of jobs, nodes which are not associated with the mode will also colored. You can reproduce with the following operation. ``` sc.parallelize(1 to 10).barrier.mapPartitions(identity).repartition(1).collect() ``` <img width="376" alt="wrong-coloring" src="https://user-images.githubusercontent.com/4736016/83403670-1711df00-a444-11ea-9457-c683f75bc566.png"> In the screen shot above, `repartition` in `Stage 1` is not associated with barrier mode so the corresponding node should not be colored pale green. The cause of this issue is the logic which chooses HTML elements to be colored is wrong. The logic chooses such elements based on whether each element is associated with a style class (`clusterId` in the code). But when an operation crosses over shuffle (like `repartition` above), a `clusterId` can be duplicated and non-barrier mode node is also associated with the same `clusterId`. ### Why are the changes needed? This is a bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Newly added test case with the following command. ``` build/sbt -Dtest.default.exclude.tags= -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver "testOnly org.apache.spark.ui.ChromeUISeleniumSuite -- -z SPARK-31886" ``` Closes #28694 from sarutak/fix-wrong-barrier-color. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-06-03 01:15:36 -07:00
Kent Yao	afe95bd9ad	[SPARK-31892][SQL] Disable week-based date filed for parsing ### What changes were proposed in this pull request? This PR disables week-based date filed for parsing closes #28674 ### Why are the changes needed? 1. It's an un-fixable behavior change to fill the gap between SimpleDateFormat and DateTimeFormater and backward-compatibility for different JDKs.A lot of effort has been made to prove it at https://github.com/apache/spark/pull/28674 2. The existing behavior itself in 2.4 is confusing, e.g. ```sql spark-sql> select to_timestamp('1', 'w'); 1969-12-28 00:00:00 spark-sql> select to_timestamp('1', 'u'); 1970-01-05 00:00:00 ``` The 'u' here seems not to go to the Monday of the first week in week-based form or the first day of the year in non-week-based form but go to the Monday of the second week in week-based form. And, e.g. ```sql spark-sql> select to_timestamp('2020 2020', 'YYYY yyyy'); 2020-01-01 00:00:00 spark-sql> select to_timestamp('2020 2020', 'yyyy YYYY'); 2019-12-29 00:00:00 spark-sql> select to_timestamp('2020 2020 1', 'YYYY yyyy w'); NULL spark-sql> select to_timestamp('2020 2020 1', 'yyyy YYYY w'); 2019-12-29 00:00:00 ``` I think we don't need to introduce all the weird behavior from Java 3. The current test coverage for week-based date fields is almost 0%, which indicates that we've never imagined using it. 4. Avoiding JDK bugs https://issues.apache.org/jira/browse/SPARK-31880 ### Does this PR introduce _any_ user-facing change? Yes, the 'Y/W/w/u/F/E' pattern cannot be used datetime parsing functions. ### How was this patch tested? more tests added Closes #28706 from yaooqinn/SPARK-31892. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-06-03 06:49:58 +00:00
Kent Yao	c59f51bcc2	[SPARK-31879][SQL] Using GB as default Locale for datetime formatters # What changes were proposed in this pull request? This PR switches the default Locale from the `US` to `GB` to change the behavior of the first day of the week from Sunday-started to Monday-started as same as v2.4 ### Why are the changes needed? #### cases ```sql spark-sql> select to_timestamp('2020-1-1', 'YYYY-w-u'); 2019-12-29 00:00:00 spark-sql> set spark.sql.legacy.timeParserPolicy=legacy; spark.sql.legacy.timeParserPolicy legacy spark-sql> select to_timestamp('2020-1-1', 'YYYY-w-u'); 2019-12-30 00:00:00 ``` #### reasons These week-based fields need Locale to express their semantics, the first day of the week varies from country to country. From the Java doc of WeekFields ```java /** * Gets the first day-of-week. * <p> * The first day-of-week varies by culture. * For example, the US uses Sunday, while France and the ISO-8601 standard use Monday. * This method returns the first day using the standard {code DayOfWeek} enum. * * return the first day-of-week, not null */ public DayOfWeek getFirstDayOfWeek() { return firstDayOfWeek; } ``` But for the SimpleDateFormat, the day-of-week is not localized ``` u Day number of week (1 = Monday, ..., 7 = Sunday) Number 1 ``` Currently, the default locale we use is the US, so the result moved a day backward. For other countries, please refer to [First Day of the Week in Different Countries](http://chartsbin.com/view/41671) With this change, it restores the first day of week calculating for functions when using the default locale. ### Does this PR introduce _any_ user-facing change? Yes, but the behavior change is used to restore the old one of v2.4 ### How was this patch tested? add unit tests Closes #28692 from yaooqinn/SPARK-31879. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-06-03 06:07:53 +00:00
HyukjinKwon	baafd4386c	Revert "[SPARK-31765][WEBUI] Upgrade HtmlUnit >= 2.37.0" This reverts commit `e5c3463910`.	2020-06-03 14:15:30 +09:00
HyukjinKwon	e1d5201140	[SPARK-31895][PYTHON][SQL] Support DataFrame.explain(extended: str) case to be consistent with Scala side ### What changes were proposed in this pull request? Scala: ```scala scala> spark.range(10).explain("cost") ``` ``` == Optimized Logical Plan == Range (0, 10, step=1, splits=Some(12)), Statistics(sizeInBytes=80.0 B) == Physical Plan == (1) Range (0, 10, step=1, splits=12) ``` PySpark: ```python >>> spark.range(10).explain("cost") ``` ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/dataframe.py", line 333, in explain raise TypeError(err_msg) TypeError: extended (optional) should be provided as bool, got <class 'str'> ``` In addition, it is consistent with other codes too, for example, `DataFrame.sample` also can support `DataFrame.sample(1.0)` and `DataFrame.sample(False)`. ### Why are the changes needed? To provide the consistent API support across APIs. ### Does this PR introduce _any_ user-facing change? Nope, it's only changes in unreleased branches. If this lands to master only, yes, users will be able to set `mode` as `df.explain("...")` in Spark 3.1. After this PR: ```python >>> spark.range(10).explain("cost") ``` ``` == Optimized Logical Plan == Range (0, 10, step=1, splits=Some(12)), Statistics(sizeInBytes=80.0 B) == Physical Plan == (1) Range (0, 10, step=1, splits=12) ``` ### How was this patch tested? Unittest was added and manually tested as well to make sure: ```python spark.range(10).explain(True) spark.range(10).explain(False) spark.range(10).explain("cost") spark.range(10).explain(extended="cost") spark.range(10).explain(mode="cost") spark.range(10).explain() spark.range(10).explain(True, "cost") spark.range(10).explain(1.0) ``` Closes #28711 from HyukjinKwon/SPARK-31895. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-06-03 12:07:05 +09:00
Pablo Langa	e4db3b5b17	[SPARK-29431][WEBUI] Improve Web UI / Sql tab visualization with cached dataframes ### What changes were proposed in this pull request? With this pull request I want to improve the Web UI / SQL tab visualization. The principal problem that I find is when you have a cache in your plan, the SQL visualization don’t show any information about the part of the plan that has been cached. Before the change ![image](https://user-images.githubusercontent.com/12819544/66587418-aa7f6280-eb8a-11e9-80cf-bf10d6c0abea.png) After the change ![image](https://user-images.githubusercontent.com/12819544/66587526-ddc1f180-eb8a-11e9-92de-c3b3f5657b66.png) ### Why are the changes needed? When we have a SQL plan with cached dataframes we lose the graphical information of this dataframe in the sql tab ### Does this PR introduce any user-facing change? Yes, in the sql tab ### How was this patch tested? Unit testing and manual tests throught spark shell Closes #26082 from planga82/feature/SPARK-29431_SQL_Cache_webUI. Lead-authored-by: Pablo Langa <soypab@gmail.com> Co-authored-by: Gengliang Wang <gengliang.wang@databricks.com> Co-authored-by: Unknown <soypab@gmail.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-06-02 17:26:43 -07:00
Kousuke Saruta	271eb26c02	[SPARK-31882][WEBUI] DAG-viz is not rendered correctly with pagination ### What changes were proposed in this pull request? This PR fix an issue related to DAG-viz. Because DAG-viz for a job fetches link urls for each stage from the stage table, rendering can fail with pagination. You can reproduce this issue with the following operation. ``` sc.parallelize(1 to 10).map(value => (value ,value)).repartition(1).repartition(1).repartition(1).reduceByKey(_ + _).collect ``` And then, visit the corresponding job page. There are 5 stages so show <5 stages in the paged table. <img width="1440" alt="dag-rendering-issue1" src="https://user-images.githubusercontent.com/4736016/83376286-c29f3d00-a40c-11ea-891b-eb8f42afbb27.png"> <img width="1439" alt="dag-rendering-issue2" src="https://user-images.githubusercontent.com/4736016/83376288-c3d06a00-a40c-11ea-8bb2-38542e5010c1.png"> ### Why are the changes needed? This is a bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Newly added test case with following command. `build/sbt -Dtest.default.exclude.tags= -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver "testOnly org.apache.spark.ui.ChromeUISeleniumSuite -- -z SPARK-31882"` Closes #28690 from sarutak/fix-dag-rendering-issue. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-06-02 16:45:16 -07:00
William Hyun	367d94a30d	[SPARK-31876][BUILD] Upgrade to Zstd 1.4.5 ### What changes were proposed in this pull request? This PR aims to upgrade to Zstd 1.4.5. ### Why are the changes needed? Zstd 1.4.5 improves performance. https://github.com/facebook/zstd/releases/tag/v1.4.5 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Passed the Jenkins. Closes #28682 from williamhyun/zstd. Authored-by: William Hyun <williamhyun3@gmail.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2020-06-02 21:58:33 +00:00
Eren Avsarogullari	979593d708	[SPARK-31566][SQL][DOCS] Add SQL Rest API Documentation ### What changes were proposed in this pull request? SQL Rest API exposes query execution details and metrics as Public API. Its documentation will be useful for the end-users. ### Why are the changes needed? SQL Rest API does not exist under Spark Rest API. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually build and check Closes #28354 from erenavsarogullari/SPARK-31566. Lead-authored-by: Eren Avsarogullari <eren.avsarogullari@gmail.com> Co-authored-by: Gengliang Wang <gengliang.wang@databricks.com> Co-authored-by: Eren Avsarogullari <erenavsarogullari@gmail.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-06-02 12:46:12 -07:00
Holden Karau	25702281dc	[SPARK-31778][K8S][BUILD] Support cross-building docker images ### What changes were proposed in this pull request? Add cross build support to our docker image script using the new dockerx extension. ### Why are the changes needed? We have a CI for Spark on ARM, we should support building images for ARM and AMD64. ### Does this PR introduce _any_ user-facing change? Yes, a new flag is added to the docker image build script to cross-build ### How was this patch tested? Manually ran build script & pushed to https://hub.docker.com/repository/registry-1.docker.io/holdenk/spark/tags?page=1 verified amd64 & arm64 listed. Closes #28615 from holdenk/cross-build. Lead-authored-by: Holden Karau <hkarau@apple.com> Co-authored-by: Holden Karau <holden@pigscanfly.ca> Signed-off-by: Holden Karau <hkarau@apple.com>	2020-06-02 11:11:23 -07:00
Holden Karau	69ba9b662e	[SPARK-31860][BUILD] only push release tags on succes ### What changes were proposed in this pull request? Only push the release tag after the build has finished. ### Why are the changes needed? If the build fails we don't need a release tag. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Running locally with a fake user in https://github.com/apache/spark/pull/28667 Closes #28700 from holdenk/SPARK-31860-build-master-only-push-tags-on-success. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Holden Karau <hkarau@apple.com>	2020-06-02 11:04:07 -07:00
Kousuke Saruta	e5c3463910	[SPARK-31765][WEBUI] Upgrade HtmlUnit >= 2.37.0 ### What changes were proposed in this pull request? This PR upgrades HtmlUnit. Selenium and Jetty also upgraded because of dependency. ### Why are the changes needed? Recently, a security issue which affects HtmlUnit is reported. https://nvd.nist.gov/vuln/detail/CVE-2020-5529 According to the report, arbitrary code can be run by malicious users. HtmlUnit is used for test so the impact might not be large but it's better to upgrade it just in case. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing testcases. Closes #28585 from sarutak/upgrade-htmlunit. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-06-02 08:29:07 -05:00
lipzhu	d79a8a88b1	[SPARK-31834][SQL] Improve error message for incompatible data types ### What changes were proposed in this pull request? We should use dataType.catalogString to unified the data type mismatch message. Before: ```sql spark-sql> create table SPARK_31834(a int) using parquet; spark-sql> insert into SPARK_31834 select '1'; Error in query: Cannot write incompatible data to table '`default`.`spark_31834`': - Cannot safely cast 'a': StringType to IntegerType; ``` After: ```sql spark-sql> create table SPARK_31834(a int) using parquet; spark-sql> insert into SPARK_31834 select '1'; Error in query: Cannot write incompatible data to table '`default`.`spark_31834`': - Cannot safely cast 'a': string to int; ``` ### How was this patch tested? UT. Closes #28654 from lipzhu/SPARK-31834. Authored-by: lipzhu <lipzhu@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-06-02 21:07:10 +09:00
Max Gekk	00b355b97b	[SPARK-31888][SQL] Support `java.time.Instant` in Parquet filter pushdown ### What changes were proposed in this pull request? 1. Modified `ParquetFilters.valueCanMakeFilterOn()` to accept filters with `java.time.Instant` attributes. 2. Added `ParquetFilters.timestampToMicros()` to support both types `java.sql.Timestamp` and `java.time.Instant` in conversions to microseconds. 3. Re-used `timestampToMicros` in constructing of Parquet filters. ### Why are the changes needed? To support pushed down filters with `java.time.Instant` attributes. Before the changes, date filters are not pushed down to Parquet datasource when `spark.sql.datetime.java8API.enabled` is `true`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Modified tests to `ParquetFilterSuite` to check the case when Java 8 API is enabled. Closes #28696 from MaxGekk/support-instant-parquet-filters. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-06-02 11:53:58 +00:00
Sunitha Kambhampati	4161c62429	[SPARK-28067][SQL] Fix incorrect results for decimal aggregate sum by returning null on decimal overflow ### What changes were proposed in this pull request? JIRA SPARK-28067: Wrong results are returned for aggregate sum with decimals with whole stage codegen enabled Repro: WholeStage enabled enabled -> Wrong results WholeStage disabled -> Returns exception Decimal precision 39 exceeds max precision 38 Issues: 1. Wrong results are returned which is bad 2. Inconsistency between whole stage enabled and disabled. Cause: Sum does not take care of possibility of overflow for the intermediate steps. ie the updateExpressions and mergeExpressions. This PR makes the following changes: - Add changes to check if overflow occurs for decimal in aggregate Sum and if there is an overflow, it will return null for the Sum operation when spark.sql.ansi.enabled is false. - When spark.sql.ansi.enabled is true, then the sum operation will return an exception if an overflow occurs for the decimal operation in Sum. - This is keeping it consistent with the behavior defined in spark.sql.ansi.enabled property Before the fix: Scenario 1: - WRONG RESULTS ``` scala> val df = Seq( \| (BigDecimal("10000000000000000000"), 1), \| (BigDecimal("10000000000000000000"), 1), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2)).toDF("decNum", "intNum") df: org.apache.spark.sql.DataFrame = [decNum: decimal(38,18), intNum: int] scala> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, "intNum").agg(sum("decNum")) df2: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)] scala> df2.show(40,false) +---------------------------------------+ \|sum(decNum) \| +---------------------------------------+ \|20000000000000000000.000000000000000000\| +---------------------------------------+ ``` -- Before fix: Scenario2: Setting spark.sql.ansi.enabled to true - WRONG RESULTS ``` scala> spark.conf.set("spark.sql.ansi.enabled", "true") scala> val df = Seq( \| (BigDecimal("10000000000000000000"), 1), \| (BigDecimal("10000000000000000000"), 1), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2)).toDF("decNum", "intNum") df: org.apache.spark.sql.DataFrame = [decNum: decimal(38,18), intNum: int] scala> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, "intNum").agg(sum("decNum")) df2: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)] scala> df2.show(40,false) +---------------------------------------+ \|sum(decNum) \| +---------------------------------------+ \|20000000000000000000.000000000000000000\| +---------------------------------------+ ``` After the fix: Scenario1: ``` scala> val df = Seq( \| (BigDecimal("10000000000000000000"), 1), \| (BigDecimal("10000000000000000000"), 1), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2)).toDF("decNum", "intNum") df: org.apache.spark.sql.DataFrame = [decNum: decimal(38,18), intNum: int] scala> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, "intNum").agg(sum("decNum")) df2: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)] scala> df2.show(40,false) +-----------+ \|sum(decNum)\| +-----------+ \|null \| +-----------+ ``` After fix: Scenario2: Setting the spark.sql.ansi.enabled to true: ``` scala> spark.conf.set("spark.sql.ansi.enabled", "true") scala> val df = Seq( \| (BigDecimal("10000000000000000000"), 1), \| (BigDecimal("10000000000000000000"), 1), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2), \| (BigDecimal("10000000000000000000"), 2)).toDF("decNum", "intNum") df: org.apache.spark.sql.DataFrame = [decNum: decimal(38,18), intNum: int] scala> val df2 = df.withColumnRenamed("decNum", "decNum2").join(df, "intNum").agg(sum("decNum")) df2: org.apache.spark.sql.DataFrame = [sum(decNum): decimal(38,18)] scala> df2.show(40,false) 20/02/18 10:57:43 ERROR Executor: Exception in task 5.0 in stage 4.0 (TID 30) java.lang.ArithmeticException: Decimal(expanded,100000000000000000000.000000000000000000,39,18}) cannot be represented as Decimal(38, 18). ``` ### Why are the changes needed? The changes are needed in order to fix the wrong results that are returned for decimal aggregate sum. ### Does this PR introduce any user-facing change? User would see wrong results on aggregate sum that involved decimal overflow prior to this change, but now the user will see null. But if user enables the spark.sql.ansi.enabled flag to true, then the user will see an exception and not incorrect results. ### How was this patch tested? New test has been added and existing tests for sql, catalyst and hive suites were run ok. Closes #27627 from skambha/decaggfixwrongresults. Lead-authored-by: Sunitha Kambhampati <skambha@us.ibm.com> Co-authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-06-02 11:30:30 +00:00
HyukjinKwon	64cb6f7066	[SPARK-29137][ML][PYTHON][TESTS] Increase the timeout for StreamingLinearRegressionWithTests.test_train_prediction ### What changes were proposed in this pull request? It increases the timeout for `StreamingLinearRegressionWithTests.test_train_prediction` ``` Traceback (most recent call last): File "/home/jenkins/workspace/SparkPullRequestBuilder3/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 503, in test_train_prediction self._eventually(condition) File "/home/jenkins/workspace/SparkPullRequestBuilder3/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 69, in _eventually lastValue = condition() File "/home/jenkins/workspace/SparkPullRequestBuilder3/python/pyspark/mllib/tests/test_streaming_algorithms.py", line 498, in condition self.assertGreater(errors[1] - errors[-1], 2) AssertionError: 1.672640157855923 not greater than 2 ``` This could likely happen when the PySpark tests run in parallel and it become slow. ### Why are the changes needed? To make the tests less flaky. Seems it's being reported multiple times: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123144/consoleFull https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123146/testReport/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123141/testReport/ https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/123142/testReport/ ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Jenkins will test it out. Closes #28701 from HyukjinKwon/SPARK-29137. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-06-01 22:03:47 -07:00
Thomas Graves	ff4a97d4b6	[SPARK-29150][CORE] Update RDD API for Stage level scheduling to be public ### What changes were proposed in this pull request? This PR is to make the RDD api for stage level scheduling public. All the other jiras for functionality are in so now we can make it public for people to use and tag it properly. ### Why are the changes needed? to make it usable. ### Does this PR introduce _any_ user-facing change? Yes the api is now public ### How was this patch tested? Unit tests and manually tested. Closes #28697 from tgravescs/SPARK-29150. Lead-authored-by: Thomas Graves <tgraves@apache.org> Co-authored-by: Thomas Graves <tgraves@nvidia.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-06-02 12:57:15 +09:00
manuzhang	283814a426	[SPARK-31870][SQL][TESTS] Fix "Do not optimize skew join if additional shuffle" test having no skew join ### What changes were proposed in this pull request? Fix configurations and ensure there is skew join in the test "Do not optimize skew join if additional shuffle". ### Why are the changes needed? The existing "Do not optimize skew join if additional shuffle" test has no skew join at all. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Fixed existing test. Closes #28679 from manuzhang/spark-31870. Authored-by: manuzhang <owenzhang1990@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-06-02 02:00:58 +00:00
HyukjinKwon	ea45fc5192	[SPARK-28344][SQL][FOLLOW-UP] Check the ambiguous self-join only if there is a join in the plan ### What changes were proposed in this pull request? This PR proposes to check `DetectAmbiguousSelfJoin` only if there is `Join` in the plan. Currently, the checking is too strict even to non-join queries. For example, the codes below don't have join at all but it fails as the ambiguous self-join: ```scala import org.apache.spark.sql.expressions.Window import org.apache.spark.sql.functions.sum val df = Seq(1, 1, 2, 2).toDF("A") val w = Window.partitionBy(df("A")) df.select(df("A").alias("X"), sum(df("A")).over(w)).explain(true) ``` It is because `ExtractWindowExpressions` can create a `AttributeReference` with the same metadata but a different expression ID, see: `0fd98abd85/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (L2679)` `71c73d58f6/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala (L63)` `5945d46c11/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala (L180)` Before: ``` 'Project [A#19 AS X#21, sum(A#19) windowspecdefinition(A#19, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L] +- Relation[A#19] parquet ``` After: ``` Project [X#21, sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L] +- Project [X#21, A#19, sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L, sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L] +- Window [sum(A#19) windowspecdefinition(A#19, specifiedwindowframe(RowFrame, unboundedpreceding$(), unboundedfollowing$())) AS sum(A) OVER (PARTITION BY A unspecifiedframe$())#23L], [A#19] +- Project [A#19 AS X#21, A#19] +- Relation[A#19] parquet ``` `X#21` holds the same metadata of DataFrame ID and column position with `A#19` but it has a different expression ID which ends up with the checking fails. ### Why are the changes needed? To loose the checking and make users not surprised. ### Does this PR introduce _any_ user-facing change? It's the changes in unreleased branches only. ### How was this patch tested? Manually tested and unittest was added. Closes #28695 from HyukjinKwon/SPARK-28344-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-06-01 16:31:39 -07:00
Holden Karau	ab9e5a2fe9	[SPARK-31889][BUILD] Docker release script does not allocate enough memory to reliably publish ### What changes were proposed in this pull request? Allow overriding the zinc options in the docker release and set a higher so the publish step can succeed consistently. ### Why are the changes needed? The publish step experiences memory pressure. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Running test locally with fake user to see if publish step (besides svn part) succeeds Closes #28698 from holdenk/SPARK-31889-docker-release-script-does-not-allocate-enough-memory-to-reliably-publish. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Holden Karau <hkarau@apple.com>	2020-06-01 15:49:17 -07:00
Dongjoon Hyun	17586f9ed2	[SPARK-31881][K8S][TESTS] Support Hadoop 3.2 K8s integration tests ### What changes were proposed in this pull request? This PR aims to support Hadoop 3.2 K8s integration tests. ### Why are the changes needed? Currently, K8s integration suite assumes Hadoop 2.7 and has hard-coded parts. ### Does this PR introduce _any_ user-facing change? No. This is a dev-only change. ### How was this patch tested? Pass the Jenkins K8s IT (with Hadoop 2.7) and do the manual testing for Hadoop 3.2 as described in `README.md`. ``` ./dev/dev-run-integration-tests.sh --hadoop-profile hadoop-3.2 ``` I verified this manually like the following. ``` $ resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh \ --spark-tgz .../spark-3.1.0-SNAPSHOT-bin-3.2.0.tgz \ --exclude-tags r \ --hadoop-profile hadoop-3.2 ... KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - All pods have the same service account by default - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark with Python2 to test a pyfiles example - Run PySpark with Python3 to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - PVs with local storage - Launcher client dependencies - Test basic decommissioning Run completed in 8 minutes, 49 seconds. Total number of tests run: 19 Suites: completed 2, aborted 0 Tests: succeeded 19, failed 0, canceled 0, ignored 0, pending 0 All tests passed. ``` Closes #28689 from dongjoon-hyun/SPARK-31881. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-06-01 11:19:42 -07:00
Max Gekk	9c0dc28a6c	[SPARK-31885][SQL] Fix filter push down for old millis timestamps to Parquet ### What changes were proposed in this pull request? Fixed conversions of `java.sql.Timestamp` to milliseconds in `ParquetFilter` by using existing functions from `DateTimeUtils` `fromJavaTimestamp()` and `microsToMillis()`. ### Why are the changes needed? The changes fix the bug: ```scala scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MILLIS") scala> spark.conf.set("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED") scala> Seq(java.sql.Timestamp.valueOf("1000-06-14 08:28:53.123")).toDF("ts").write.mode("overwrite").parquet("/Users/maximgekk/tmp/ts_millis_old_filter") scala> spark.read.parquet("/Users/maximgekk/tmp/ts_millis_old_filter").filter($"ts" === "1000-06-14 08:28:53.123").show(false) +---+ \|ts \| +---+ +---+ ``` ### Does this PR introduce _any_ user-facing change? Yes, after the changes (for the example above): ```scala scala> spark.read.parquet("/Users/maximgekk/tmp/ts_millis_old_filter").filter($"ts" === "1000-06-14 08:28:53.123").show(false) +-----------------------+ \|ts \| +-----------------------+ \|1000-06-14 08:28:53.123\| +-----------------------+ ``` ### How was this patch tested? Modified tests in `ParquetFilterSuite` to check old timestamps. Closes #28693 from MaxGekk/parquet-ts-millis-filter. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-06-01 15:13:44 +00:00
Kousuke Saruta	6a895d0771	[SPARK-31804][WEBUI] Add real headless browser support for HistoryServer tests ### What changes were proposed in this pull request? This PR adds two things. Real headless browser support for HistoryServer tests. A test suite using headless Chrome as one instance of those browsers. ### Why are the changes needed? The motivation is same as #28578 . In the current master, there is a testcase for HistoryServer which uses Ajax so we need the support for HistoryServer tests. Also this change is necessary to upgrade HtmlUnit (See #28585) ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I tested with following patterns. Both Chrome and Chrome driver should be installed to test. 1. sbt / with default excluded tags (ChromeUIHistoryServerSuite is expected to be skipped and SQLQueryTestSuite is expected to succeed) `build/sbt -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver "testOnly org.apache.spark.deploy.history.ChromeUIHistoryServerSuite org.apache.spark.sql.SQLQueryTestSuite" 2. sbt / overwrite default excluded tags as empty string (Both suites are expected to succeed) `build/sbt -Dtest.default.exclude.tags= -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver "testOnly org.apache.spark.deploy.history.ChromeUIHistoryServerSuite org.apache.spark.sql.SQLQueryTestSuite" 3. sbt / set `test.exclude.tags` to `org.apache.spark.tags.ExtendedSQLTest` (Both suites are expected to be skipped) `build/sbt -Dtest.exclude.tags=org.apache.spark.tags.ExtendedSQLTest -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver "testOnly org.apache.spark.deploy.history.ChromeUIHistoryServerSuite org.apache.spark.sql.SQLQueryTestSuite" 4. Maven / with default excluded tags (ChromeUIHistoryServerSuite is expected to be skipped and SQLQueryTestSuite is expected to succeed) `build/mvn -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver -Dtest=none -DwildcardSuites=org.apache.spark.deploy.history.ChromeUIHistoryServerSuite,org.apache.spark.sql.SQLQueryTestSuite test` 5. Maven / overwrite default excluded tags as empty string (Both suites are expected to succeed) `build/mvn -Dtest.default.exclude.tags= -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver -Dtest=none -DwildcardSuites=org.apache.spark.deploy.history.ChromeUIHistoryServerSuite,org.apache.spark.sql.SQLQueryTestSuite test` 6. Maven / set `test.exclude.tags` to `org.apache.spark.tags.ExtendedSQLTest` (Both suites are expected to be skipped) `build/mvn -Dtest.exclude.tags=org.apache.spark.tags.ExtendedSQLTest -Dspark.test.webdriver.chrome.driver=/path/to/chromedriver -Dtest=none -DwildcardSuites=org.apache.spark.deploy.history.ChromeUIHistoryServerSuite,org.apache.spark.sql.SQLQueryTestSuite test` Closes #28622 from sarutak/headless-browser-support-for-historyserver. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-06-01 10:00:10 -05:00
Yuexin Zhang	e70df2cea4	[SPARK-29683][YARN] False report isAllNodeBlacklisted when RM is having issue ### What changes were proposed in this pull request? Improve the check logic on if all node managers are really being backlisted. ### Why are the changes needed? I observed when the AM is out of sync with ResourceManager, or RM is having issue report back with current number of available NMs, something like below happens: ... 20/05/13 09:01:21 INFO RetryInvocationHandler: java.io.EOFException: End of File Exception between local host is: "client.zyx.com/x.x.x.124"; destination host is: "rm.zyx.com":8030; : java.io.EOFException; For more details see: http://wiki.apache.org/hadoop/EOFException, while invoking ApplicationMasterProtocolPBClientImpl.allocate over rm543. Trying to failover immediately. ... 20/05/13 09:01:28 WARN AMRMClientImpl: ApplicationMaster is out of sync with ResourceManager, hence resyncing. ... then the spark job would suddenly run into AllNodeBlacklisted state: ... 20/05/13 09:01:31 INFO ApplicationMaster: Final app status: FAILED, exitCode: 11, (reason: Due to executor failures all available nodes are blacklisted) ... but actually there's no black listed nodes in currentBlacklistedYarnNodes, and I do not see any blacklisting message from: https://github.com/apache/spark/blob/master/resource-managers/yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocatorBlacklistTracker.scala#L119 We should only return isAllNodeBlacklisted =true when we see there are >0 numClusterNodes AND 'currentBlacklistedYarnNodes.size >= numClusterNodes'. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? A minor change. No changes on tests. Closes #28606 from cnZach/false_AllNodeBlacklisted_when_RM_is_having_issue. Authored-by: Yuexin Zhang <zach.yx.zhang@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-06-01 09:46:18 -05:00
yi.wu	bc24c99af4	[SPARK-31837][CORE] Shift to the new highest locality level if there is when recomputeLocality ### What changes were proposed in this pull request? This PR proposes to shift to the new most local locality level if there're any new more local locality levels are added during `TaskSetManager.recomputeLocality`. ### Why are the changes needed? There's a race condition between `resourceOffers` and `submitTasks`. If `submitTasks` happens before `resourceOffers`, especially when there are no executors added to `TaskSchedulerImpl` at all, the `TaskSetManager` 's `myLocalityLevels` will only have ANY locality level, see: `8b4862953a/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala (L218)` And then, `resourceOffers` is called with new executors added to `TaskSchedulerImpl`. Then, `recomputeLocality` will be called because of `executorAdded`. During `recomputeLocality`, the TaskSetManager's `myLocalityLevels` might have PROCESS_LOCAL, NODE_LOCAL, ANY(because at this time we could find alive executors from TaskSchedulerImpl). But the `TaskSetManager` will stick to the previous locality level, which is ANY, see: `8b4862953a/core/src/main/scala/org/apache/spark/scheduler/TaskSetManager.scala (L1107-L1114)` As a result, in the first round of `resourceOffers`, the new version delay scheduling won't take effect and `TaskSetManager` can only schedule tasks at ANY level. Please note that the problem also exists in old version delay scheduling but the impact is minor because we always reset the locality level after successfully launching a task, which is broken in the new version of dealy scheduling. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated and added new test. Closes #28656 from Ngone51/impr_recompute. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-06-01 13:06:18 +00:00
Takeshi Yamamuro	b806fc4582	[SPARK-31854][SQL] Invoke in MapElementsExec should not propagate null ### What changes were proposed in this pull request? This PR intends to fix a bug of `Dataset.map` below when the whole-stage codegen enabled; ``` scala> val ds = Seq(1.asInstanceOf[Integer], null.asInstanceOf[Integer]).toDS() scala> sql("SET spark.sql.codegen.wholeStage=true") scala> ds.map(v=>(v,v)).explain == Physical Plan == (1) SerializeFromObject [assertnotnull(input[0, scala.Tuple2, true])._1.intValue AS _1#69, assertnotnull(input[0, scala.Tuple2, true])._2.intValue AS _2#70] +- (1) MapElements <function1>, obj#68: scala.Tuple2 +- (1) DeserializeToObject staticinvoke(class java.lang.Integer, ObjectType(class java.lang.Integer), valueOf, value#1, true, false), obj#67: java.lang.Integer +- LocalTableScan [value#1] // `AssertNotNull` in `SerializeFromObject` will fail; scala> ds.map(v => (v, v)).show() java.lang.NullPointerException: Null value appeared in non-nullable fails: top level Product input object If the schema is inferred from a Scala tuple/case class, or a Java bean, please try to use scala.Option[_] or other nullable types (e.g. java.lang.Integer instead of int/scala.Int). // When the whole-stage codegen disabled, the query works well; scala> sql("SET spark.sql.codegen.wholeStage=false") scala> ds.map(v=>(v,v)).show() +----+----+ \| _1\| _2\| +----+----+ \| 1\| 1\| \|null\|null\| +----+----+ ``` A root cause is that `Invoke` used in `MapElementsExec` propagates input null, and then [AssertNotNull](`1b780f364b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/encoders/ExpressionEncoder.scala (L253-L255)`) in `SerializeFromObject` fails because a top-level row becomes null. So, `MapElementsExec` should not return `null` but `(null, null)`. NOTE: the generated code of the query above in the current master; ``` / 033 / private void mapelements_doConsume_0(java.lang.Integer mapelements_expr_0_0, boolean mapelements_exprIsNull_0_0) throws java.io.IOException { / 034 / boolean mapelements_isNull_1 = true; / 035 / scala.Tuple2 mapelements_value_1 = null; / 036 / if (!false) { / 037 / mapelements_resultIsNull_0 = false; / 038 / / 039 / if (!mapelements_resultIsNull_0) { / 040 / mapelements_resultIsNull_0 = mapelements_exprIsNull_0_0; / 041 / mapelements_mutableStateArray_0[0] = mapelements_expr_0_0; / 042 / } / 043 / / 044 / mapelements_isNull_1 = mapelements_resultIsNull_0; / 045 / if (!mapelements_isNull_1) { / 046 / Object mapelements_funcResult_0 = null; / 047 / mapelements_funcResult_0 = ((scala.Function1) references[1] / literal /).apply(mapelements_mutableStateArray_0[0]); / 048 / / 049 / if (mapelements_funcResult_0 != null) { / 050 / mapelements_value_1 = (scala.Tuple2) mapelements_funcResult_0; / 051 / } else { / 052 / mapelements_isNull_1 = true; / 053 / } / 054 / / 055 / } / 056 / } / 057 / / 058 / serializefromobject_doConsume_0(mapelements_value_1, mapelements_isNull_1); / 059 / / 060 / } ``` The generated code w/ this fix; ``` / 032 / private void mapelements_doConsume_0(java.lang.Integer mapelements_expr_0_0, boolean mapelements_exprIsNull_0_0) throws java.io.IOException { / 033 / boolean mapelements_isNull_1 = true; / 034 / scala.Tuple2 mapelements_value_1 = null; / 035 / if (!false) { / 036 / mapelements_mutableStateArray_0[0] = mapelements_expr_0_0; / 037 / / 038 / mapelements_isNull_1 = false; / 039 / if (!mapelements_isNull_1) { / 040 / Object mapelements_funcResult_0 = null; / 041 / mapelements_funcResult_0 = ((scala.Function1) references[1] / literal /).apply(mapelements_mutableStateArray_0[0]); / 042 / / 043 / if (mapelements_funcResult_0 != null) { / 044 / mapelements_value_1 = (scala.Tuple2) mapelements_funcResult_0; / 045 / mapelements_isNull_1 = false; / 046 / } else { / 047 / mapelements_isNull_1 = true; / 048 / } / 049 / / 050 / } / 051 / } / 052 / / 053 / serializefromobject_doConsume_0(mapelements_value_1, mapelements_isNull_1); / 054 / / 055 */ } ``` ### Why are the changes needed? Bugfix. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests. Closes #28681 from maropu/SPARK-31854. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-06-01 04:50:00 +00:00
HyukjinKwon	e69466056f	[SPARK-31849][PYTHON][SQL] Make PySpark SQL exceptions more Pythonic ### What changes were proposed in this pull request? This PR proposes to make PySpark exception more Pythonic by hiding JVM stacktrace by default. It can be enabled by turning on `spark.sql.pyspark.jvmStacktrace.enabled` configuration. ``` Traceback (most recent call last): ... pyspark.sql.utils.PythonException: An exception was thrown from Python worker in the executor. The below is the Python worker stacktrace. Traceback (most recent call last): ... ``` If this `spark.sql.pyspark.jvmStacktrace.enabled` is enabled, it appends: ``` JVM stacktrace: org.apache.spark.Exception: ... ... ``` For example, the codes below: ```python from pyspark.sql.functions import udf udf def divide_by_zero(v): raise v / 0 spark.range(1).select(divide_by_zero("id")).show() ``` will show an error messages that looks like Python exception thrown from the local. <details> <summary>Python exception message when <code>spark.sql.pyspark.jvmStacktrace.enabled</code> is off (default)</summary> ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/dataframe.py", line 427, in show print(self._jdf.showString(n, 20, vertical)) File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../spark/python/pyspark/sql/utils.py", line 131, in deco raise_from(converted) File "<string>", line 3, in raise_from pyspark.sql.utils.PythonException: An exception was thrown from Python worker in the executor. The below is the Python worker stacktrace. Traceback (most recent call last): File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main process() File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process serializer.dump_stream(out_iter, outfile) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream for obj in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched for item in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr> result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda> return lambda a: f(a) File "/.../spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper return f(args, kwargs) File "<stdin>", line 3, in divide_by_zero ZeroDivisionError: division by zero ``` </details> <details> <summary>Python exception message when <code>spark.sql.pyspark.jvmStacktrace.enabled</code> is on</summary> ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/dataframe.py", line 427, in show print(self._jdf.showString(n, 20, vertical)) File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../spark/python/pyspark/sql/utils.py", line 137, in deco raise_from(converted) File "<string>", line 3, in raise_from pyspark.sql.utils.PythonException: An exception was thrown from Python worker in the executor. The below is the Python worker stacktrace. Traceback (most recent call last): File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main process() File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process serializer.dump_stream(out_iter, outfile) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream for obj in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched for item in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr> result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda> return lambda a: f(a) File "/.../spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper return f(args, *kwargs) File "<stdin>", line 3, in divide_by_zero ZeroDivisionError: division by zero JVM stacktrace: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 4, 192.168.35.193, executor 0): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main process() File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process serializer.dump_stream(out_iter, outfile) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream for obj in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched for item in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr> result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda> return lambda a: f(a) File "/.../spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper return f(args, *kwargs) File "<stdin>", line 3, in divide_by_zero ZeroDivisionError: division by zero at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:516) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:81) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:64) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:469) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:753) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:469) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:472) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2117) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2066) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2065) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2065) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1021) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1021) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1021) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2297) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2246) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2235) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:823) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2108) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2129) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2148) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:467) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:420) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3653) at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2695) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3644) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3642) at org.apache.spark.sql.Dataset.head(Dataset.scala:2695) at org.apache.spark.sql.Dataset.take(Dataset.scala:2902) at org.apache.spark.sql.Dataset.getRows(Dataset.scala:300) at org.apache.spark.sql.Dataset.showString(Dataset.scala:337) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main process() File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process serializer.dump_stream(out_iter, outfile) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream for obj in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched for item in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr> result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda> return lambda a: f(a) File "/.../spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper return f(args, *kwargs) File "<stdin>", line 3, in divide_by_zero ZeroDivisionError: division by zero at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:516) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:81) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:64) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:469) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:753) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:469) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:472) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ... 1 more ``` </details> <details> <summary>Python exception message without this change</summary> ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/dataframe.py", line 427, in show print(self._jdf.showString(n, 20, vertical)) File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../spark/python/pyspark/sql/utils.py", line 98, in deco return f(a, *kw) File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o160.showString. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 10 in stage 5.0 failed 4 times, most recent failure: Lost task 10.3 in stage 5.0 (TID 37, 192.168.35.193, executor 3): org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main process() File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process serializer.dump_stream(out_iter, outfile) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream for obj in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched for item in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr> result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda> return lambda a: f(a) File "/.../spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper return f(args, *kwargs) File "<stdin>", line 3, in divide_by_zero ZeroDivisionError: division by zero at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:516) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:81) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:64) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:469) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:753) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:469) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:472) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2117) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2066) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2065) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2065) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1021) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1021) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1021) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2297) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2246) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2235) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:823) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2108) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2129) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2148) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:467) at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:420) at org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:47) at org.apache.spark.sql.Dataset.collectFromPlan(Dataset.scala:3653) at org.apache.spark.sql.Dataset.$anonfun$head$1(Dataset.scala:2695) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3644) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3642) at org.apache.spark.sql.Dataset.head(Dataset.scala:2695) at org.apache.spark.sql.Dataset.take(Dataset.scala:2902) at org.apache.spark.sql.Dataset.getRows(Dataset.scala:300) at org.apache.spark.sql.Dataset.showString(Dataset.scala:337) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent call last): File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 605, in main process() File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 597, in process serializer.dump_stream(out_iter, outfile) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream self.serializer.dump_stream(self._batched(iterator), stream) File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream for obj in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched for item in iterator: File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr> result = tuple(f([a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs) File "/.../spark/python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda> return lambda a: f(a) File "/.../spark/python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper return f(args, *kwargs) File "<stdin>", line 3, in divide_by_zero ZeroDivisionError: division by zero at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.handlePythonException(PythonRunner.scala:516) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:81) at org.apache.spark.sql.execution.python.PythonUDFRunner$$anon$2.read(PythonUDFRunner.scala:64) at org.apache.spark.api.python.BasePythonRunner$ReaderIterator.hasNext(PythonRunner.scala:469) at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:489) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:753) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:340) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:898) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:898) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:469) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:472) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ... 1 more ``` </details> <br/> Another example with Python 3.7: ```python sql("a") ``` <details> <summary>Python exception message when <code>spark.sql.pyspark.jvmStacktrace.enabled</code> is off (default)</summary> ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/session.py", line 646, in sql return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped) File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../spark/python/pyspark/sql/utils.py", line 131, in deco raise_from(converted) File "<string>", line 3, in raise_from pyspark.sql.utils.ParseException: mismatched input 'a' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0) == SQL == a ^^^ ``` </details> <details> <summary>Python exception message when <code>spark.sql.pyspark.jvmStacktrace.enabled</code> is on</summary> ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/session.py", line 646, in sql return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped) File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../spark/python/pyspark/sql/utils.py", line 131, in deco raise_from(converted) File "<string>", line 3, in raise_from pyspark.sql.utils.ParseException: mismatched input 'a' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0) == SQL == a ^^^ JVM stacktrace: org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'a' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0) == SQL == a ^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:266) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:133) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:49) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:81) at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:604) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:604) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:601) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) ``` </details> <details> <summary>Python exception message without this change</summary> ``` Traceback (most recent call last): File "/.../spark/python/pyspark/sql/utils.py", line 98, in deco return f(a, **kw) File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 328, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o26.sql. : org.apache.spark.sql.catalyst.parser.ParseException: mismatched input 'a' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0) == SQL == a ^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:266) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:133) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:49) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:81) at org.apache.spark.sql.SparkSession.$anonfun$sql$2(SparkSession.scala:604) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:604) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:763) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:601) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) During handling of the above exception, another exception occurred: Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/sql/session.py", line 646, in sql return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped) File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__ File "/.../spark/python/pyspark/sql/utils.py", line 102, in deco raise converted pyspark.sql.utils.ParseException: mismatched input 'a' expecting {'(', 'ADD', 'ALTER', 'ANALYZE', 'CACHE', 'CLEAR', 'COMMENT', 'COMMIT', 'CREATE', 'DELETE', 'DESC', 'DESCRIBE', 'DFS', 'DROP', 'EXPLAIN', 'EXPORT', 'FROM', 'GRANT', 'IMPORT', 'INSERT', 'LIST', 'LOAD', 'LOCK', 'MAP', 'MERGE', 'MSCK', 'REDUCE', 'REFRESH', 'REPLACE', 'RESET', 'REVOKE', 'ROLLBACK', 'SELECT', 'SET', 'SHOW', 'START', 'TABLE', 'TRUNCATE', 'UNCACHE', 'UNLOCK', 'UPDATE', 'USE', 'VALUES', 'WITH'}(line 1, pos 0) == SQL == a ^^^ ``` </details> ### Why are the changes needed? Currently, PySpark exceptions are very unfriendly to Python users with causing a bunch of JVM stacktrace. See "Python exception message without this change" above. ### Does this PR introduce _any_ user-facing change? Yes, it will change the exception message. See the examples above. ### How was this patch tested? Manually tested by ```bash ./bin/pyspark --conf spark.sql.pyspark.jvmStacktrace.enabled=true ``` and running the examples above. Closes #28661 from HyukjinKwon/python-debug. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-06-01 09:45:21 +09:00
HyukjinKwon	29c51d682b	[SPARK-31788][CORE][DSTREAM][PYTHON] Recover the support of union for different types of RDD and DStreams ### What changes were proposed in this pull request? This PR manually specifies the class for the input array being used in `(SparkContext\|StreamingContext).union`. It fixes a regression introduced from SPARK-25737. ```python rdd1 = sc.parallelize([1,2,3,4,5]) rdd2 = sc.parallelize([6,7,8,9,10]) pairRDD1 = rdd1.zip(rdd2) sc.union([pairRDD1, pairRDD1]).collect() ``` in the current master and `branch-3.0`: ``` Traceback (most recent call last): File "<stdin>", line 1, in <module> File "/.../spark/python/pyspark/context.py", line 870, in union jrdds[i] = rdds[i]._jrdd File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 238, in __setitem__ File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/java_collections.py", line 221, in __set_item File "/.../spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py", line 332, in get_return_value py4j.protocol.Py4JError: An error occurred while calling None.None. Trace: py4j.Py4JException: Cannot convert org.apache.spark.api.java.JavaPairRDD to org.apache.spark.api.java.JavaRDD at py4j.commands.ArrayCommand.convertArgument(ArrayCommand.java:166) at py4j.commands.ArrayCommand.setArray(ArrayCommand.java:144) at py4j.commands.ArrayCommand.execute(ArrayCommand.java:97) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) ``` which works in Spark 2.4.5: ``` [(1, 6), (2, 7), (3, 8), (4, 9), (5, 10), (1, 6), (2, 7), (3, 8), (4, 9), (5, 10)] ``` It assumed the class of the input array is the same `JavaRDD` or `JavaDStream`; however, that can be different such as `JavaPairRDD`. This fix is based on redsanket's initial approach, and will be co-authored. ### Why are the changes needed? To fix a regression from Spark 2.4.5. ### Does this PR introduce _any_ user-facing change? No, it's only in unreleased branches. This is to fix a regression. ### How was this patch tested? Manually tested, and a unittest was added. Closes #28648 from HyukjinKwon/SPARK-31788. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-06-01 09:43:03 +09:00
Huaxin Gao	45cf5e9950	[SPARK-31840][ML] Add instance weight support in LogisticRegressionSummary ### What changes were proposed in this pull request? Add instance weight support in LogisticRegressionSummary ### Why are the changes needed? LogisticRegression, MulticlassClassificationEvaluator and BinaryClassificationEvaluator support instance weight. We should support instance weight in LogisticRegressionSummary too. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add new tests Closes #28657 from huaxingao/weighted_summary. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-05-31 10:24:20 -05:00
Max Gekk	47dc332258	[SPARK-31874][SQL] Use `FastDateFormat` as the legacy fractional formatter ### What changes were proposed in this pull request? 1. Replace `SimpleDateFormat` by `FastDateFormat` as the legacy formatter of `FractionTimestampFormatter`. 2. Optimise `LegacyFastTimestampFormatter` for `java.sql.Timestamp` w/o fractional part. ### Why are the changes needed? 1. By default `HiveResult`.`hiveResultString` retrieves timestamps values as instances of `java.sql.Timestamp`, and uses the legacy parser `SimpleDateFormat` to convert the timestamps to strings. After the fix https://github.com/apache/spark/pull/28024, the fractional formatter and its companion - legacy formatter `SimpleDateFormat` are created per every value. By switching from `LegacySimpleTimestampFormatter` to `LegacyFastTimestampFormatter`, we can utilize the internal cache of `FastDateFormat`, and avoid parsing the default pattern `yyyy-MM-dd HH:mm:ss`. 2. The second change in the method `def format(ts: Timestamp): String` of `LegacyFastTimestampFormatter` is needed to optimize the formatter for patterns without the fractional part and avoid conversions to microseconds. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing tests in `TimestampFormatter`. Closes #28678 from MaxGekk/fastdateformat-as-legacy-frac-formatter. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-31 13:05:00 +00:00
Kent Yao	547c5bf552	[SPARK-31867][SQL] Disable year type datetime patterns which are longer than 10 ### What changes were proposed in this pull request? As mentioned in https://github.com/apache/spark/pull/28673 and suggested via cloud-fan at https://github.com/apache/spark/pull/28673#discussion_r432817075 In this PR, we disable datetime pattern in the form of `y..y` and `Y..Y` whose lengths are greater than 10 to avoid sort of JDK bug as described below he new datetime formatter introduces silent data change like, ```sql spark-sql> select from_unixtime(1, 'yyyyyyyyyyy-MM-dd'); NULL spark-sql> set spark.sql.legacy.timeParserPolicy=legacy; spark.sql.legacy.timeParserPolicy legacy spark-sql> select from_unixtime(1, 'yyyyyyyyyyy-MM-dd'); 00000001970-01-01 spark-sql> ``` For patterns that support `SignStyle.EXCEEDS_PAD`, e.g. `y..y`(len >=4), when using the `NumberPrinterParser` to format it ```java switch (signStyle) { case EXCEEDS_PAD: if (minWidth < 19 && value >= EXCEED_POINTS[minWidth]) { buf.append(decimalStyle.getPositiveSign()); } break; .... ``` the `minWidth` == `len(y..y)` the `EXCEED_POINTS` is ```java /** * Array of 10 to the power of n. */ static final long[] EXCEED_POINTS = new long[] { 0L, 10L, 100L, 1000L, 10000L, 100000L, 1000000L, 10000000L, 100000000L, 1000000000L, 10000000000L, }; ``` So when the `len(y..y)` is greater than 10, ` ArrayIndexOutOfBoundsException` will be raised. And at the caller side, for `from_unixtime`, the exception will be suppressed and silent data change occurs. for `date_format`, the `ArrayIndexOutOfBoundsException` will continue. ### Why are the changes needed? fix silent data change ### Does this PR introduce _any_ user-facing change? Yes, SparkUpgradeException will take place of `null` result when the pattern contains 10 or more continuous 'y' or 'Y' ### How was this patch tested? new tests Closes #28684 from yaooqinn/SPARK-31867-2. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-31 12:34:39 +00:00

1 2 3 4 5 ...

27357 commits