ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Huaxin Gao	bfb257f078	[SPARK-32405][SQL] Apply table options while creating tables in JDBC Table Catalog ### What changes were proposed in this pull request? Currently in JDBCTableCatalog, we ignore the table options when creating table. ``` // TODO (SPARK-32405): Apply table options while creating tables in JDBC Table Catalog if (!properties.isEmpty) { logWarning("Cannot create JDBC table with properties, these properties will be " + "ignored: " + properties.asScala.map { case (k, v) => s"$k=$v" }.mkString("[", ", ", "]")) } ``` ### Why are the changes needed? need to apply the table options when we create table ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? add new test Closes #30154 from huaxingao/table_options. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-09 07:02:14 +00:00
Dongjoon Hyun	aa0849b46a	[SPARK-33387][CORE] Support ordered shuffle block migration ### What changes were proposed in this pull request? This PR aims to support sorted shuffle block migration. ### Why are the changes needed? Since the current shuffle block migration works in a random order, the failure during worker decommission affects all shuffles. We had better finish the shuffles one by one to minimize the number of affected shuffle. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs with the newly added test case. Closes #30293 from dongjoon-hyun/SPARK-33387. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-08 22:43:27 -08:00
Liang-Chi Hsieh	c269b53f07	[SPARK-33384][SS] Delete temporary file when cancelling writing to final path even underlying stream throwing error ### What changes were proposed in this pull request? In `RenameBasedFSDataOutputStream.cancel`, we do two things: closing underlying stream and delete temporary file, in a single try/catch block. Closing `OutputStream` could possibly throw `IOException` so we possibly missing deleting temporary file. This patch proposes to delete temporary even underlying stream throwing error. ### Why are the changes needed? To avoid leaving temporary files during canceling writing in `RenameBasedFSDataOutputStream`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #30290 from viirya/SPARK-33384. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-08 18:44:26 -08:00
yangjie01	02fd52cfbc	[SPARK-33352][CORE][SQL][SS][MLLIB][AVRO][K8S] Fix procedure-like declaration compilation warnings in Scala 2.13 ### What changes were proposed in this pull request? There are two similar compilation warnings about procedure-like declaration in Scala 2.13: ``` [WARNING] [Warn] /spark/core/src/main/scala/org/apache/spark/HeartbeatReceiver.scala:70: procedure syntax is deprecated for constructors: add `=`, as in method definition ``` and ``` [WARNING] [Warn] /spark/core/src/main/scala/org/apache/spark/storage/BlockManagerDecommissioner.scala:211: procedure syntax is deprecated: instead, add `: Unit =` to explicitly declare `run`'s return type ``` this pr is the first part to resolve SPARK-33352： - For constructors method definition add `=` to convert to function syntax - For without `return type` methods definition add `: Unit =` to convert to function syntax ### Why are the changes needed? Eliminate compilation warnings in Scala 2.13 and this change should be compatible with Scala 2.12 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #30255 from LuciferYang/SPARK-29392-FOLLOWUP.1. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-11-08 12:51:48 -06:00
Hannah Amundson	1090b1b00a	[SPARK-32860][DOCS][SQL] Updating documentation about map support in Encoders ### What changes were proposed in this pull request? Javadocs updated for the encoder to include maps as a collection type ### Why are the changes needed? The javadocs were not updated with fix SPARK-16706 ### Does this PR introduce _any_ user-facing change? Yes, the javadocs are updated ### How was this patch tested? sbt was run to ensure it meets scalastyle Closes #30274 from hannahkamundson/SPARK-32860. Lead-authored-by: Hannah Amundson <amundson.hannah@heb.com> Co-authored-by: Hannah <48397717+hannahkamundson@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-08 20:29:24 +09:00
HyukjinKwon	e11a24c1ba	[SPARK-33371][PYTHON] Update setup.py and tests for Python 3.9 ### What changes were proposed in this pull request? This PR proposes to fix PySpark to officially support Python 3.9. The main codes already work. We should just note that we support Python 3.9. Also, this PR fixes some minor fixes into the test codes. - `Thread.isAlive` is removed in Python 3.9, and `Thread.is_alive` exists in Python 3.6+, see https://docs.python.org/3/whatsnew/3.9.html#removed - Fixed `TaskContextTestsWithWorkerReuse.test_barrier_with_python_worker_reuse` and `TaskContextTests.test_barrier` to be less flaky. This becomes more flaky in Python 3.9 for some reasons. NOTE that PyArrow does not support Python 3.9 yet. ### Why are the changes needed? To officially support Python 3.9. ### Does this PR introduce _any_ user-facing change? Yes, it officially supports Python 3.9. ### How was this patch tested? Manually ran the tests: ``` $ ./run-tests --python-executable=python Running PySpark tests. Output is in /.../spark/python/unit-tests.log Will test against the following Python executables: ['python'] Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-resource', 'pyspark-sql', 'pyspark-streaming'] python python_implementation is CPython python version is: Python 3.9.0 Starting test(python): pyspark.ml.tests.test_base Starting test(python): pyspark.ml.tests.test_evaluation Starting test(python): pyspark.ml.tests.test_algorithms Starting test(python): pyspark.ml.tests.test_feature Finished test(python): pyspark.ml.tests.test_base (12s) Starting test(python): pyspark.ml.tests.test_image Finished test(python): pyspark.ml.tests.test_evaluation (15s) Starting test(python): pyspark.ml.tests.test_linalg Finished test(python): pyspark.ml.tests.test_feature (25s) Starting test(python): pyspark.ml.tests.test_param Finished test(python): pyspark.ml.tests.test_image (17s) Starting test(python): pyspark.ml.tests.test_persistence Finished test(python): pyspark.ml.tests.test_param (17s) Starting test(python): pyspark.ml.tests.test_pipeline Finished test(python): pyspark.ml.tests.test_linalg (30s) Starting test(python): pyspark.ml.tests.test_stat Finished test(python): pyspark.ml.tests.test_pipeline (6s) Starting test(python): pyspark.ml.tests.test_training_summary Finished test(python): pyspark.ml.tests.test_stat (12s) Starting test(python): pyspark.ml.tests.test_tuning Finished test(python): pyspark.ml.tests.test_algorithms (68s) Starting test(python): pyspark.ml.tests.test_wrapper Finished test(python): pyspark.ml.tests.test_persistence (51s) Starting test(python): pyspark.mllib.tests.test_algorithms Finished test(python): pyspark.ml.tests.test_training_summary (33s) Starting test(python): pyspark.mllib.tests.test_feature Finished test(python): pyspark.ml.tests.test_wrapper (19s) Starting test(python): pyspark.mllib.tests.test_linalg Finished test(python): pyspark.mllib.tests.test_feature (26s) Starting test(python): pyspark.mllib.tests.test_stat Finished test(python): pyspark.mllib.tests.test_stat (22s) Starting test(python): pyspark.mllib.tests.test_streaming_algorithms Finished test(python): pyspark.mllib.tests.test_algorithms (53s) Starting test(python): pyspark.mllib.tests.test_util Finished test(python): pyspark.mllib.tests.test_linalg (54s) Starting test(python): pyspark.sql.tests.test_arrow Finished test(python): pyspark.sql.tests.test_arrow (0s) ... 61 tests were skipped Starting test(python): pyspark.sql.tests.test_catalog Finished test(python): pyspark.mllib.tests.test_util (11s) Starting test(python): pyspark.sql.tests.test_column Finished test(python): pyspark.sql.tests.test_catalog (16s) Starting test(python): pyspark.sql.tests.test_conf Finished test(python): pyspark.sql.tests.test_column (17s) Starting test(python): pyspark.sql.tests.test_context Finished test(python): pyspark.sql.tests.test_context (6s) ... 3 tests were skipped Starting test(python): pyspark.sql.tests.test_dataframe Finished test(python): pyspark.sql.tests.test_conf (11s) Starting test(python): pyspark.sql.tests.test_datasources Finished test(python): pyspark.sql.tests.test_datasources (19s) Starting test(python): pyspark.sql.tests.test_functions Finished test(python): pyspark.sql.tests.test_dataframe (35s) ... 3 tests were skipped Starting test(python): pyspark.sql.tests.test_group Finished test(python): pyspark.sql.tests.test_functions (32s) Starting test(python): pyspark.sql.tests.test_pandas_cogrouped_map Finished test(python): pyspark.sql.tests.test_pandas_cogrouped_map (1s) ... 15 tests were skipped Starting test(python): pyspark.sql.tests.test_pandas_grouped_map Finished test(python): pyspark.sql.tests.test_group (19s) Starting test(python): pyspark.sql.tests.test_pandas_map Finished test(python): pyspark.sql.tests.test_pandas_grouped_map (0s) ... 21 tests were skipped Starting test(python): pyspark.sql.tests.test_pandas_udf Finished test(python): pyspark.sql.tests.test_pandas_map (0s) ... 6 tests were skipped Starting test(python): pyspark.sql.tests.test_pandas_udf_grouped_agg Finished test(python): pyspark.sql.tests.test_pandas_udf (0s) ... 6 tests were skipped Starting test(python): pyspark.sql.tests.test_pandas_udf_scalar Finished test(python): pyspark.sql.tests.test_pandas_udf_grouped_agg (0s) ... 13 tests were skipped Starting test(python): pyspark.sql.tests.test_pandas_udf_typehints Finished test(python): pyspark.sql.tests.test_pandas_udf_scalar (0s) ... 50 tests were skipped Starting test(python): pyspark.sql.tests.test_pandas_udf_window Finished test(python): pyspark.sql.tests.test_pandas_udf_typehints (0s) ... 10 tests were skipped Starting test(python): pyspark.sql.tests.test_readwriter Finished test(python): pyspark.sql.tests.test_pandas_udf_window (0s) ... 14 tests were skipped Starting test(python): pyspark.sql.tests.test_serde Finished test(python): pyspark.sql.tests.test_serde (19s) Starting test(python): pyspark.sql.tests.test_session Finished test(python): pyspark.mllib.tests.test_streaming_algorithms (120s) Starting test(python): pyspark.sql.tests.test_streaming Finished test(python): pyspark.sql.tests.test_readwriter (25s) Starting test(python): pyspark.sql.tests.test_types Finished test(python): pyspark.ml.tests.test_tuning (208s) Starting test(python): pyspark.sql.tests.test_udf Finished test(python): pyspark.sql.tests.test_session (31s) Starting test(python): pyspark.sql.tests.test_utils Finished test(python): pyspark.sql.tests.test_streaming (35s) Starting test(python): pyspark.streaming.tests.test_context Finished test(python): pyspark.sql.tests.test_types (34s) Starting test(python): pyspark.streaming.tests.test_dstream Finished test(python): pyspark.sql.tests.test_utils (14s) Starting test(python): pyspark.streaming.tests.test_kinesis Finished test(python): pyspark.streaming.tests.test_kinesis (0s) ... 2 tests were skipped Starting test(python): pyspark.streaming.tests.test_listener Finished test(python): pyspark.streaming.tests.test_listener (11s) Starting test(python): pyspark.tests.test_appsubmit Finished test(python): pyspark.sql.tests.test_udf (39s) Starting test(python): pyspark.tests.test_broadcast Finished test(python): pyspark.streaming.tests.test_context (23s) Starting test(python): pyspark.tests.test_conf Finished test(python): pyspark.tests.test_conf (15s) Starting test(python): pyspark.tests.test_context Finished test(python): pyspark.tests.test_broadcast (33s) Starting test(python): pyspark.tests.test_daemon Finished test(python): pyspark.tests.test_daemon (5s) Starting test(python): pyspark.tests.test_install_spark Finished test(python): pyspark.tests.test_context (44s) Starting test(python): pyspark.tests.test_join Finished test(python): pyspark.tests.test_appsubmit (68s) Starting test(python): pyspark.tests.test_profiler Finished test(python): pyspark.tests.test_join (7s) Starting test(python): pyspark.tests.test_rdd Finished test(python): pyspark.tests.test_profiler (9s) Starting test(python): pyspark.tests.test_rddbarrier Finished test(python): pyspark.tests.test_rddbarrier (7s) Starting test(python): pyspark.tests.test_readwrite Finished test(python): pyspark.streaming.tests.test_dstream (107s) Starting test(python): pyspark.tests.test_serializers Finished test(python): pyspark.tests.test_serializers (8s) Starting test(python): pyspark.tests.test_shuffle Finished test(python): pyspark.tests.test_readwrite (14s) Starting test(python): pyspark.tests.test_taskcontext Finished test(python): pyspark.tests.test_install_spark (65s) Starting test(python): pyspark.tests.test_util Finished test(python): pyspark.tests.test_shuffle (8s) Starting test(python): pyspark.tests.test_worker Finished test(python): pyspark.tests.test_util (5s) Starting test(python): pyspark.accumulators Finished test(python): pyspark.accumulators (5s) Starting test(python): pyspark.broadcast Finished test(python): pyspark.broadcast (6s) Starting test(python): pyspark.conf Finished test(python): pyspark.tests.test_worker (14s) Starting test(python): pyspark.context Finished test(python): pyspark.conf (4s) Starting test(python): pyspark.ml.classification Finished test(python): pyspark.tests.test_rdd (60s) Starting test(python): pyspark.ml.clustering Finished test(python): pyspark.context (21s) Starting test(python): pyspark.ml.evaluation Finished test(python): pyspark.tests.test_taskcontext (69s) Starting test(python): pyspark.ml.feature Finished test(python): pyspark.ml.evaluation (26s) Starting test(python): pyspark.ml.fpm Finished test(python): pyspark.ml.clustering (45s) Starting test(python): pyspark.ml.functions Finished test(python): pyspark.ml.fpm (24s) Starting test(python): pyspark.ml.image Finished test(python): pyspark.ml.functions (17s) Starting test(python): pyspark.ml.linalg.__init__ Finished test(python): pyspark.ml.linalg.__init__ (0s) Starting test(python): pyspark.ml.recommendation Finished test(python): pyspark.ml.classification (74s) Starting test(python): pyspark.ml.regression Finished test(python): pyspark.ml.image (8s) Starting test(python): pyspark.ml.stat Finished test(python): pyspark.ml.stat (29s) Starting test(python): pyspark.ml.tuning Finished test(python): pyspark.ml.regression (53s) Starting test(python): pyspark.mllib.classification Finished test(python): pyspark.ml.tuning (35s) Starting test(python): pyspark.mllib.clustering Finished test(python): pyspark.ml.feature (103s) Starting test(python): pyspark.mllib.evaluation Finished test(python): pyspark.mllib.classification (33s) Starting test(python): pyspark.mllib.feature Finished test(python): pyspark.mllib.evaluation (21s) Starting test(python): pyspark.mllib.fpm Finished test(python): pyspark.ml.recommendation (103s) Starting test(python): pyspark.mllib.linalg.__init__ Finished test(python): pyspark.mllib.linalg.__init__ (1s) Starting test(python): pyspark.mllib.linalg.distributed Finished test(python): pyspark.mllib.feature (26s) Starting test(python): pyspark.mllib.random Finished test(python): pyspark.mllib.fpm (23s) Starting test(python): pyspark.mllib.recommendation Finished test(python): pyspark.mllib.clustering (50s) Starting test(python): pyspark.mllib.regression Finished test(python): pyspark.mllib.random (13s) Starting test(python): pyspark.mllib.stat.KernelDensity Finished test(python): pyspark.mllib.stat.KernelDensity (1s) Starting test(python): pyspark.mllib.stat._statistics Finished test(python): pyspark.mllib.linalg.distributed (42s) Starting test(python): pyspark.mllib.tree Finished test(python): pyspark.mllib.stat._statistics (19s) Starting test(python): pyspark.mllib.util Finished test(python): pyspark.mllib.regression (33s) Starting test(python): pyspark.profiler Finished test(python): pyspark.mllib.recommendation (36s) Starting test(python): pyspark.rdd Finished test(python): pyspark.profiler (9s) Starting test(python): pyspark.resource.tests.test_resources Finished test(python): pyspark.mllib.tree (19s) Starting test(python): pyspark.serializers Finished test(python): pyspark.mllib.util (21s) Starting test(python): pyspark.shuffle Finished test(python): pyspark.resource.tests.test_resources (9s) Starting test(python): pyspark.sql.avro.functions Finished test(python): pyspark.shuffle (1s) Starting test(python): pyspark.sql.catalog Finished test(python): pyspark.rdd (22s) Starting test(python): pyspark.sql.column Finished test(python): pyspark.serializers (12s) Starting test(python): pyspark.sql.conf Finished test(python): pyspark.sql.conf (6s) Starting test(python): pyspark.sql.context Finished test(python): pyspark.sql.catalog (14s) Starting test(python): pyspark.sql.dataframe Finished test(python): pyspark.sql.avro.functions (15s) Starting test(python): pyspark.sql.functions Finished test(python): pyspark.sql.column (24s) Starting test(python): pyspark.sql.group Finished test(python): pyspark.sql.context (20s) Starting test(python): pyspark.sql.pandas.conversion Finished test(python): pyspark.sql.pandas.conversion (13s) Starting test(python): pyspark.sql.pandas.group_ops Finished test(python): pyspark.sql.group (36s) Starting test(python): pyspark.sql.pandas.map_ops Finished test(python): pyspark.sql.pandas.group_ops (21s) Starting test(python): pyspark.sql.pandas.serializers Finished test(python): pyspark.sql.pandas.serializers (0s) Starting test(python): pyspark.sql.pandas.typehints Finished test(python): pyspark.sql.pandas.typehints (0s) Starting test(python): pyspark.sql.pandas.types Finished test(python): pyspark.sql.pandas.types (0s) Starting test(python): pyspark.sql.pandas.utils Finished test(python): pyspark.sql.pandas.utils (0s) Starting test(python): pyspark.sql.readwriter Finished test(python): pyspark.sql.dataframe (56s) Starting test(python): pyspark.sql.session Finished test(python): pyspark.sql.functions (57s) Starting test(python): pyspark.sql.streaming Finished test(python): pyspark.sql.pandas.map_ops (12s) Starting test(python): pyspark.sql.types Finished test(python): pyspark.sql.types (10s) Starting test(python): pyspark.sql.udf Finished test(python): pyspark.sql.streaming (16s) Starting test(python): pyspark.sql.window Finished test(python): pyspark.sql.session (19s) Starting test(python): pyspark.streaming.util Finished test(python): pyspark.streaming.util (0s) Starting test(python): pyspark.util Finished test(python): pyspark.util (0s) Finished test(python): pyspark.sql.readwriter (24s) Finished test(python): pyspark.sql.udf (13s) Finished test(python): pyspark.sql.window (14s) Tests passed in 780 seconds ``` Closes #30277 from HyukjinKwon/SPARK-33371. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-06 15:05:37 -08:00
yangjie01	fb9c873e7d	[SPARK-33347][CORE] Cleanup useless variables of MutableApplicationInfo ### What changes were proposed in this pull request? There are 4 fields in `MutableApplicationInfo ` seems useless: - `coresGranted` - `maxCores` - `coresPerExecutor` - `memoryPerExecutorMB` They are always `None` and not reassigned. So the main change of this pr is cleanup these useless fields in `MutableApplicationInfo`. ### Why are the changes needed? Cleanup useless variables. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #30251 from LuciferYang/SPARK-33347. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-11-07 06:43:27 +09:00
Stuart White	09fa7ecae1	[SPARK-33291][SQL] Improve DataFrame.show for nulls in arrays and structs ### What changes were proposed in this pull request? The changes in [SPARK-32501 Inconsistent NULL conversions to strings](https://issues.apache.org/jira/browse/SPARK-32501) introduced some behavior that I'd like to clean up a bit. Here's sample code to illustrate the behavior I'd like to clean up: ```scala val rows = Seq[String](null) .toDF("value") .withColumn("struct1", struct('value as "value1")) .withColumn("struct2", struct('value as "value1", 'value as "value2")) .withColumn("array1", array('value)) .withColumn("array2", array('value, 'value)) // Show the DataFrame using the "first" codepath. rows.show(truncate=false) +-----+-------+-------------+------+--------+ \|value\|struct1\|struct2 \|array1\|array2 \| +-----+-------+-------------+------+--------+ \|null \|{ null}\|{ null, null}\|[] \|[, null]\| +-----+-------+-------------+------+--------+ // Write the DataFrame to disk, then read it back and show it to trigger the "codegen" code path: rows.write.parquet("rows") spark.read.parquet("rows").show(truncate=false) +-----+-------+-------------+-------+-------------+ \|value\|struct1\|struct2 \|array1 \|array2 \| +-----+-------+-------------+-------+-------------+ \|null \|{ null}\|{ null, null}\|[ null]\|[ null, null]\| +-----+-------+-------------+-------+-------------+ ``` Notice: 1. If the first element of a struct is null, it is printed with a leading space (e.g. "\{ null\}"). I think it's preferable to print it without the leading space (e.g. "\{null\}"). This is consistent with how non-null values are printed inside a struct. 2. If the first element of an array is null, it is not printed at all in the first code path, and the "codegen" code path prints it with a leading space. I think both code paths should be consistent and print it without a leading space (e.g. "[null]"). The desired result of this PR is to product the following output via both code paths: ``` +-----+-------+------------+------+------------+ \|value\|struct1\|struct2 \|array1\|array2 \| +-----+-------+------------+------+------------+ \|null \|{null} \|{null, null}\|[null]\|[null, null]\| +-----+-------+------------+------+------------+ ``` This contribution is my original work and I license the work to the project under the project’s open source license. ### Why are the changes needed? To correct errors and inconsistencies in how DataFrame.show() displays nulls inside arrays and structs. ### Does this PR introduce _any_ user-facing change? Yes. This PR changes what is printed out by DataFrame.show(). ### How was this patch tested? I added new test cases in CastSuite.scala to cover the cases addressed by this PR. Closes #30189 from stwhit/show_nulls. Authored-by: Stuart White <stuart.white1@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2020-11-06 13:12:35 -08:00
Warren Zhu	93ad26be01	[SPARK-23432][UI] Add executor peak jvm memory metrics in executors page ### What changes were proposed in this pull request? Add executor peak jvm memory metrics in executors page ![image](https://user-images.githubusercontent.com/1633312/97767765-9121bf00-1adb-11eb-93c7-7912d9fe7826.png) ### Why are the changes needed? Users can know executor peak jvm metrics on in executors page ### Does this PR introduce _any_ user-facing change? Users can know executor peak jvm metrics on in executors page ### How was this patch tested? Manually tested Closes #30186 from warrenzhu25/23432. Authored-by: Warren Zhu <warren.zhu25@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-11-06 16:53:10 +09:00
Terry Kim	68c032c246	[SPARK-33364][SQL] Introduce the "purge" option in TableCatalog.dropTable for v2 catalog ### What changes were proposed in this pull request? This PR proposes to introduce the `purge` option in `TableCatalog.dropTable` so that v2 catalogs can use the option if needed. Related discussion: https://github.com/apache/spark/pull/30079#discussion_r510594110 ### Why are the changes needed? Spark DDL supports passing the purge option to `DROP TABLE` command. However, the option is not used (ignored) for v2 catalogs. ### Does this PR introduce _any_ user-facing change? This PR introduces a new API in `TableCatalog`. ### How was this patch tested? Added a test. Closes #30267 from imback82/purge_table. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-05 22:00:45 -08:00
Prashant Sharma	733a468726	[SPARK-33130][SQL] Support ALTER TABLE in JDBC v2 Table Catalog: add, update type and nullability of columns (MsSqlServer dialect) ### What changes were proposed in this pull request? Override the default SQL strings for: ALTER TABLE RENAME COLUMN ALTER TABLE UPDATE COLUMN NULLABILITY in the following MsSQLServer JDBC dialect according to official documentation. Write MsSqlServer integration tests for JDBC. ### Why are the changes needed? To add the support for alter table when interacting with MSSql Server. ### Does this PR introduce _any_ user-facing change? ### How was this patch tested? added tests Closes #30038 from ScrapCodes/mssql-dialect. Authored-by: Prashant Sharma <prashsh1@in.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-06 05:46:38 +00:00
neko	f6c0007970	[SPARK-33342][WEBUI] fix the wrong url and display name of blocking thread in threadDump page ### What changes were proposed in this pull request? fix the wrong url and display name of blocking thread in threadDump page. The blockingThreadId variable passed to the page should be of string type instead of Option type. ### Why are the changes needed? blocking threadId in the ui page is not displayed well, and the corresponding url cannot be redirected normally ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? The pr only involves minor changes to the page and does not affect other functions, The manual test results are as follows. The thread name displayed on the page is correct, and you can click on the URL to jump to the corresponding url ![shows_ok](https://user-images.githubusercontent.com/52202080/98108177-89488d00-1ed6-11eb-9488-8446c3f38bad.gif) Closes #30249 from akiyamaneko/thread-dump-improve. Authored-by: neko <echohlne@gmail.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-11-06 13:45:02 +08:00
Wenchen Fan	d16311051d	[SPARK-32934][SQL][FOLLOW-UP] Refine class naming and code comments ### What changes were proposed in this pull request? 1. Rename `OffsetWindowSpec` to `OffsetWindowFunction`, as it's the base class for all offset based window functions. 2. Refine and add more comments. 3. Remove `isRelative` as it's useless. ### Why are the changes needed? code refinement ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? existing tests Closes #30261 from cloud-fan/window. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-06 05:20:25 +00:00
Dongjoon Hyun	90f35c663e	[MINOR][SQL] Fix incorrect JIRA ID comments in Analyzer ### What changes were proposed in this pull request? This PR fixes incorrect JIRA ids in `Analyzer.scala` introduced by SPARK-31670 (https://github.com/apache/spark/pull/28490) ```scala - // SPARK-31607: Resolve Struct field in selectedGroupByExprs/groupByExprs and aggregations + // SPARK-31670: Resolve Struct field in selectedGroupByExprs/groupByExprs and aggregations ``` ### Why are the changes needed? Fix the wrong information. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This is a comment change. Manually review. Closes #30269 from dongjoon-hyun/SPARK-31670-MINOR. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-06 12:46:26 +09:00
William Hyun	4941b7ae18	[SPARK-33365][BUILD] Update SBT to 1.4.2 ### What changes were proposed in this pull request? This PR aims to update SBT from 1.4.1 to 1.4.2. ### Why are the changes needed? This will bring the latest bug fixes. - https://github.com/sbt/sbt/releases/tag/v1.4.2 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #30268 from williamhyun/sbt. Authored-by: William Hyun <williamhyun3@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-05 17:37:44 -08:00
Wenchen Fan	cd4e3d3b0c	[SPARK-33360][SQL] Simplify DS v2 write resolution ### What changes were proposed in this pull request? Removing duplicated code in `ResolveOutputRelation`, by adding `V2WriteCommand.withNewQuery` ### Why are the changes needed? code cleanup ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes #30264 from cloud-fan/ds-minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-05 15:44:04 -08:00
Erik Krogen	324275ae83	[SPARK-33185][YARN] Set up yarn.Client to print direct links to driver stdout/stderr ### What changes were proposed in this pull request? Currently when run in `cluster` mode on YARN, the Spark `yarn.Client` will print out the application report into the logs, to be easily viewed by users. For example: ``` INFO yarn.Client: client token: Token { kind: YARN_CLIENT_TOKEN, service: } diagnostics: N/A ApplicationMaster host: X.X.X.X ApplicationMaster RPC port: 0 queue: default start time: 1602782566027 final status: UNDEFINED tracking URL: http://hostname:8888/proxy/application_<id>/ user: xkrogen ``` I propose adding, alongside the application report, some additional lines like: ``` Driver Logs (stdout): http://hostname:8042/node/containerlogs/container_<id>/xkrogen/stdout?start=-4096 Driver Logs (stderr): http://hostname:8042/node/containerlogs/container_<id>/xkrogen/stderr?start=-4096 ``` This information isn't contained in the `ApplicationReport`, so it's necessary to query the ResourceManager REST API. For now I have added this as an always-on feature, but if there is any concern about adding this REST dependency, I think hiding this feature behind an off-by-default flag is reasonable. ### Why are the changes needed? Typically, the tracking URL can be used to find the logs of the ApplicationMaster/driver while the application is running. Later, the Spark History Server can be used to track this information down, using the stdout/stderr links on the Executors page. However, in the situation when the driver crashed _before_ writing out a history file, the SHS may not be aware of this application, and thus does not contain links to the driver logs. When this situation arises, it can be difficult for users to debug further, since they can't easily find their driver logs. It is possible to reach the logs by using the `yarn logs` commands, but the average Spark user isn't aware of this and shouldn't have to be. With this information readily available in the logs, users can quickly jump to their driver logs, even if it crashed before the SHS became aware of the application. This has the additional benefit of providing a quick way to access driver logs, which often contain useful information, in a single click (instead of navigating through the Spark UI). ### Does this PR introduce _any_ user-facing change? Yes, some additional print statements will be created in the application report when using YARN in cluster mode. ### How was this patch tested? Added unit tests for the parsing logic in `yarn.ClientSuite`. Also tested against a live cluster. When the driver is running: ``` INFO Client: Application report for application_XXXXXXXXX_YYYYYY (state: RUNNING) INFO Client: client token: Token { kind: YARN_CLIENT_TOKEN, service: } diagnostics: N/A ApplicationMaster host: host.example.com ApplicationMaster RPC port: ###### queue: queue_name start time: 1604529046091 final status: UNDEFINED tracking URL: http://host.example.com:8080/proxy/application_XXXXXXXXX_YYYYYY/ user: xkrogen Driver Logs (stdout): http://host.example.com:8042/node/containerlogs/container_e07_XXXXXXXXX_YYYYYY_01_000001/xkrogen/stdout?start=-4096 Driver Logs (stderr): http://host.example.com:8042/node/containerlogs/container_e07_XXXXXXXXX_YYYYYY_01_000001/xkrogen/stderr?start=-4096 INFO Client: Application report for application_XXXXXXXXX_YYYYYY (state: RUNNING) ``` I confirmed that when the driver has not yet launched, the report does not include the two Driver Logs items. Will omit the output here for brevity since it looks the same. Closes #30096 from xkrogen/xkrogen-SPARK-33185-yarn-client-print. Authored-by: Erik Krogen <xkrogen@apache.org> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>	2020-11-05 12:38:42 -06:00
Chao Sun	1a704793f4	[SPARK-33290][SQL][DOCS][FOLLOW-UP] Update SQL migration guide ### What changes were proposed in this pull request? Update SQL migration guide for SPARK-33290 ### Why are the changes needed? Make the change better documented. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #30256 from sunchao/SPARK-33290-2. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-05 10:09:28 -08:00
Kousuke Saruta	208b94e4c1	[SPARK-33353][BUILD] Cache dependencies for Coursier with new sbt in GitHub Actions ### What changes were proposed in this pull request? This PR change the behavior of GitHub Actions job that caches dependencies. SPARK-33226 upgraded sbt to 1.4.1. As of 1.3.0, sbt uses Coursier as the dependency resolver / fetcher. So let's change the dependency cache configuration for the GitHub Actions job. ### Why are the changes needed? To make build faster with Coursier for the GitHub Actions job. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Should be done by GitHub Actions itself. Closes #30259 from sarutak/coursier-cache. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-05 09:29:53 -08:00
Wenchen Fan	26ea417b14	[SPARK-33362][SQL] skipSchemaResolution should still require query to be resolved ### What changes were proposed in this pull request? Fix a small bug in `V2WriteCommand.resolved`. It should always require the `table` and `query` to be resolved. ### Why are the changes needed? To prevent potential bugs that we skip resolve the input query. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? a new test Closes #30265 from cloud-fan/ds-minor-2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-05 09:23:41 -08:00
Jungtaek Lim (HeartSaVioR)	21413b7dd4	[SPARK-30294][SS] Explicitly defines read-only StateStore and optimize for HDFSBackedStateStore ### What changes were proposed in this pull request? There's a concept of 'read-only' and 'read+write' state store in Spark which is defined "implicitly". Spark doesn't prevent write for 'read-only' state store; Spark just assumes read-only stateful operator will not modify the state store. Given it's not defined explicitly, the instance of state store has to be implemented as 'read+write' even it's being used as 'read-only', which sometimes brings confusion. For example, abort() in HDFSBackedStateStore - `d38f816748/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/state/HDFSBackedStateStoreProvider.scala (L143-L155)` The comment sounds as if statement works differently between 'read-only' and 'read+write', but that's not true as both state store has state initialized as UPDATING (no difference). So 'read-only' state also creates the temporary file, initializes output streams to write to temporary file, closes output streams, and finally deletes the temporary file. This unnecessary operations are being done per batch/partition. This patch explicitly defines 'read-only' StateStore, and enables state store provider to create 'read-only' StateStore instance if requested. Relevant code paths are modified, as well as 'read-only' StateStore implementation for HDFSBackedStateStore is introduced. The new implementation gets rid of unnecessary operations explained above. In point of backward-compatibility view, the only thing being changed in public API side is `StateStoreProvider`. The trait `StateStoreProvider` has to be changed to allow requesting 'read-only' StateStore; this patch adds default implementation which leverages 'read+write' StateStore but wrapping with 'write-protected' StateStore instance, so that custom providers don't need to change their code to reflect the change. But if the providers can optimize for read-only workload, they'll be happy to make a change. Please note that this patch makes ReadOnlyStateStore extend StateStore and being referred as StateStore, as StateStore is being used in so many places and it's not easy to support both traits if we differentiate them. So unfortunately these write methods are still exposed for read-only state; it just throws UnsupportedOperationException. ### Why are the changes needed? The new API opens the chance to optimize read-only state store instance compared with read+write state store instance. HDFSBackedStateStoreProvider is modified to provide read-only version of state store which doesn't deal with temporary file as well as state machine. ### Does this PR introduce any user-facing change? Clearly "no" for most end users, and also "no" for custom state store providers as it doesn't touch trait `StateStore` as well as provides default implementation for added method in trait `StateStoreProvider`. ### How was this patch tested? Modified UT. Existing UTs ensure the change doesn't break anything. Closes #26935 from HeartSaVioR/SPARK-30294. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-11-05 18:21:17 +09:00
Sarvesh Dave	e66201b30b	[MINOR][SS][DOCS] Update join type in stream static joins code examples ### What changes were proposed in this pull request? Update join type in stream static joins code examples in structured streaming programming guide. 1) Scala, Java and Python examples have a common issue. The join keyword is "right_join", it should be "left_outer". _Reasons:_ a) This code snippet is an example of "left outer join" as the streaming df is on left and static df is on right. Also, right outer join between stream df(left) and static df(right) is not supported. b) The keyword "right_join/left_join" is unsupported and it should be "right_outer/left_outer". So, all of these code snippets have been updated to "left_outer". 2) R exmaple is correct, but the example is of "right_outer" with static df (left) and streaming df(right). It is changed to "left_outer" to make it consistent with other three examples of scala, java and python. ### Why are the changes needed? To fix the mistake in example code of documentation. ### Does this PR introduce _any_ user-facing change? Yes, it is a user-facing change (but documentation update only). Screenshots 1: Scala/Java/python example (similar issue) _Before:_ <img width="941" alt="Screenshot 2020-11-05 at 12 16 09 AM" src="https://user-images.githubusercontent.com/62717942/98155351-19e59400-1efc-11eb-8142-e6a25a5e6497.png"> _After:_ <img width="922" alt="Screenshot 2020-11-05 at 12 17 12 AM" src="https://user-images.githubusercontent.com/62717942/98155503-5d400280-1efc-11eb-96e1-5ba0f3c35c82.png"> Screenshots 2: R example (Make it consistent with above change) _Before:_ <img width="896" alt="Screenshot 2020-11-05 at 12 19 57 AM" src="https://user-images.githubusercontent.com/62717942/98155685-ac863300-1efc-11eb-93bc-b7ca4dd34634.png"> _After:_ <img width="919" alt="Screenshot 2020-11-05 at 12 20 51 AM" src="https://user-images.githubusercontent.com/62717942/98155739-c0ca3000-1efc-11eb-8f95-a7538fa784b7.png"> ### How was this patch tested? The change was tested locally. 1) cd docs/ SKIP_API=1 jekyll build 2) Verify docs/_site/structured-streaming-programming-guide.html file in browser. Closes #30252 from sarveshdave1/doc-update-stream-static-joins. Authored-by: Sarvesh Dave <sarveshdave1@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-11-05 16:22:31 +09:00
HyukjinKwon	d530ed0ea8	Revert "[SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming after the task ends" This reverts commit `b8a440f098`.	2020-11-05 16:15:17 +09:00
Kyle Bendickson	0535b34ad4	[SPARK-33282] Migrate from deprecated probot autolabeler to GitHub labeler action ### What changes were proposed in this pull request? This PR removes the old Probot Autolabeler labeling configuration, as the probot autolabeler has been deprecated. I've updated the configs in Iceberg and in Avro, and we also need to update here. This PR adds in an additional workflow for labeling PRs and migrates the old probot config to the new format. Unfortunately, because certain features have not been released upstream, we will not get the _exact_ behavior as before. I have documented where that is and what changes are neeeded, and in the associated ticket I've also discussed other options and why I think this is the best way to go. Definitely a follow up ticket is needed to get the original behavior back in these few cases, but PRs have not been labeled for almost a month and so it's probably best to get it right 95% of the time and occasionally have some UI related PRs labeled as `CORE` while the issue is resolved upstream and/or further investigated. ### Why are the changes needed? The probot autolabeler is dead and will not be maintained going forward. This has been confirmed with github user [at]mithro in an issue in their repository. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? To test this PR, I first merged the config into my local fork. I then edited it several times and ran tests on that. Unfortunately, I've overwritten my fork with the apache repo in order to create a proper PR. However, I've also added the config for the same thing in the Iceberg repo as well as the Avro repo. I have now merged this PR into my local repo and will be running some tests on edge cases there and for validating in general: - [Check that the SQL label is applied for changes directly below repo root's sql directory](https://github.com/kbendick/spark/pull/16) ✅ - [Check that the structured streaming label is applied](https://github.com/kbendick/spark/pull/20) ✅ - [Check that a wildcard at the end of a pattern will match nested files](https://github.com/kbendick/spark/pull/19) ✅ - [Check that the rule */pom.xml will match the root pom.xml file](https://github.com/kbendick/spark/pull/25) ✅ I've also discovered that we're likely not killing github actions that run (like large tests etc) when users push to their PR. In most cases, I see that a user has to mark something as "OK to test", but it still seems like we might want to discuss whether or not we should add a cancellation step In order to save time / capacity on the runners. If so desired, we would add an action in each workflow that cancels old runs when a `push` action occurs on a PR. This will likely make waiting for test runners much faster iff tests are automatically rerun on push by anybody (such as PMCs, PRs that have been marked OK to test, etc). We could free a large number of resources potentially if a cancellation step was added to all of the workflows in the Apache account (as github action API limits are set at the account level). Admittedly, the fact that the "old" workflow runs weren't cancelled could admittedly be because of the fact that I was working in a fork, but given that there are explicit actions to be added to the start of workflows to cancel old PR workflows and given that we don't have them configured indicates to me that likely this is the case in this repo (and in most `apache` repos as well), at least under certain circumstances (e.g. repos that don't have "Ok to test"-like webhooks as one example). This is a separate issue though, which I can bring up on the mailing list once I'm done with this PR. Unfortunately I've been very busy the past two weeks, but if somebody else wanted to work on that I would be happy to support with any knowledge I have. The last Apache repo to still have the probot autolabeler in it is Beam, at which point we can have Gavin from ASF Infra remove the permissions for the probot autolabeler entirely. See the associated JIRA ticket for the links to other tickets, like the one for ASF Infra to remove the dead probot autolabeler's read and write permissions to our PRs in the Apache organization. Closes #30244 from kbendick/begin-migration-to-github-labeler-action. Authored-by: Kyle Bendickson <kjbendickson@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-05 16:10:52 +09:00
Bo Zhang	551b504cfe	[SPARK-33316][SQL] Support user provided nullable Avro schema for non-nullable catalyst schema in Avro writing ### What changes were proposed in this pull request? This change is to support user provided nullable Avro schema for data with non-nullable catalyst schema in Avro writing. Without this change, when users try to use a nullable Avro schema to write data with a non-nullable catalyst schema, it will throw an `IncompatibleSchemaException` with a message like `Cannot convert Catalyst type StringType to Avro type ["null","string"]`. With this change it will assume that the data is non-nullable, log a warning message for the nullability difference and serialize the data to Avro format with the nullable Avro schema provided. ### Why are the changes needed? This change is needed because sometimes our users do not have full control over the nullability of the Avro schemas they use, and this change provides them with the flexibility. ### Does this PR introduce _any_ user-facing change? Yes. Users are allowed to use nullable Avro schemas for data with non-nullable catalyst schemas in Avro writing after the change. ### How was this patch tested? Added unit tests. Closes #30224 from bozhang2820/avro-nullable. Authored-by: Bo Zhang <bo.zhang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-11-05 12:27:20 +08:00
Bruce Robbins	7e8eb0447b	[SPARK-33314][SQL] Avoid dropping rows in Avro reader ### What changes were proposed in this pull request? This PR adds a check to RowReader#hasNextRow such that multiple calls to RowReader#hasNextRow with no intervening call to RowReader#nextRow will avoid consuming more than 1 record. This PR also modifies RowReader#nextRow such that consecutive calls will return new rows (previously consecutive calls would return the same row). ### Why are the changes needed? SPARK-32346 slightly refactored the AvroFileFormat and AvroPartitionReaderFactory to use a new iterator-like trait called AvroUtils#RowReader. RowReader#hasNextRow consumes a raw input record and stores the deserialized row for the next call to RowReader#nextRow. Unfortunately, sometimes hasNextRow is called twice before nextRow is called, resulting in a lost row. For example (which assumes V1 Avro reader): ```scala val df = spark.range(0, 25).toDF("index") df.write.mode("overwrite").format("avro").save("index_avro") val loaded = spark.read.format("avro").load("index_avro") // The following will give the expected size loaded.collect.size // The following will give the wrong size loaded.orderBy("index").collect.size ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added tests, which fail without the fix. Closes #30221 from bersprockets/avro_iterator_play. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-05 11:50:11 +09:00
Kousuke Saruta	d24dbe8955	[SPARK-33343][BUILD] Fix the build with sbt to copy hadoop-client-runtime.jar ### What changes were proposed in this pull request? This PR fix the issue that spark-shell doesn't work if it's built with `sbt package` (without any profiles specified). It's due to hadoop-client-runtime.jar isn't copied to assembly/target/scala-2.12/jars. ``` $ bin/spark-shell Exception in thread "main" java.lang.NoClassDefFoundError: org/apache/hadoop/shaded/com/ctc/wstx/io/InputBootstrapper at org.apache.spark.deploy.SparkHadoopUtil$.newConfiguration(SparkHadoopUtil.scala:426) at org.apache.spark.deploy.SparkSubmit.$anonfun$prepareSubmitEnvironment$2(SparkSubmit.scala:342) at scala.Option.getOrElse(Option.scala:189) at org.apache.spark.deploy.SparkSubmit.prepareSubmitEnvironment(SparkSubmit.scala:342) at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:877) at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180) at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203) at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90) at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1013) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1022) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: org.apache.hadoop.shaded.com.ctc.wstx.io.InputBootstrapper at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) at java.lang.ClassLoader.loadClass(ClassLoader.java:351) ``` ### Why are the changes needed? This is a bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Ran spark-shell and confirmed it works. Closes #30250 from sarutak/copy-runtime-sbt. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-04 15:05:35 -08:00
Luca Canali	b7fff03973	[SPARK-31711][CORE] Register the executor source with the metrics system when running in local mode ### What changes were proposed in this pull request? This PR proposes to register the executor source with the Spark metrics system when running in local mode. ### Why are the changes needed? The Apache Spark metrics system provides many useful insights on the Spark workload. In particular, the [executor source metrics](https://github.com/apache/spark/blob/master/docs/monitoring.md#component-instance--executor) provide detailed info, including the number of active tasks, I/O metrics, and several task metrics details. The executor source metrics, contrary to other sources (for example ExecutorMetrics source), is not available when running in local mode. Having executor metrics in local mode can be useful when testing and troubleshooting Spark workloads in a development environment. The metrics can be fed to a dashboard to see the evolution of resource usage and can be used to troubleshoot performance, as [in this example](https://github.com/cerndb/spark-dashboard). Currently users will have to deploy on a cluster to be able to collect executor source metrics, while the possibility of having them in local mode is handy for testing. ### Does this PR introduce _any_ user-facing change? - This PR exposes executor source metrics data when running in local mode. ### How was this patch tested? - Manually tested by running in local mode and inspecting the metrics listed in http://localhost:4040/metrics/json/ - Also added a test in `SourceConfigSuite` Closes #28528 from LucaCanali/metricsWithLocalMode. Authored-by: Luca Canali <luca.canali@cern.ch> Signed-off-by: Thomas Graves <tgraves@apache.org>	2020-11-04 16:48:55 -06:00
Dongjoon Hyun	42c0b175ce	[SPARK-33338][SQL] GROUP BY using literal map should not fail ### What changes were proposed in this pull request? This PR aims to fix `semanticEquals` works correctly on `GetMapValue` expressions having literal maps with `ArrayBasedMapData` and `GenericArrayData`. ### Why are the changes needed? This is a regression from Apache Spark 1.6.x. ```scala scala> sc.version res1: String = 1.6.3 scala> sqlContext.sql("SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k]").show +---+ \|_c0\| +---+ \| v1\| +---+ ``` Apache Spark 2.x ~ 3.0.1 raise`RuntimeException` for the following queries. ```sql CREATE TABLE t USING ORC AS SELECT map('k1', 'v1') m, 'k1' k SELECT map('k1', 'v1')[k] FROM t GROUP BY 1 SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k] SELECT map('k1', 'v1')[k] a FROM t GROUP BY a ``` BEFORE ```scala Caused by: java.lang.RuntimeException: Couldn't find k#3 in [keys: [k1], values: [v1][k#3]#6] at scala.sys.package$.error(package.scala:27) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:85) at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1$$anonfun$applyOrElse$1.apply(BoundAttribute.scala:79) at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:52) ``` AFTER ```sql spark-sql> SELECT map('k1', 'v1')[k] FROM t GROUP BY 1; v1 Time taken: 1.278 seconds, Fetched 1 row(s) spark-sql> SELECT map('k1', 'v1')[k] FROM t GROUP BY map('k1', 'v1')[k]; v1 Time taken: 0.313 seconds, Fetched 1 row(s) spark-sql> SELECT map('k1', 'v1')[k] a FROM t GROUP BY a; v1 Time taken: 0.265 seconds, Fetched 1 row(s) ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs with the newly added test case. Closes #30246 from dongjoon-hyun/SPARK-33338. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-04 08:35:10 -08:00
Kousuke Saruta	0b557b3290	[SPARK-33265][TEST] Rename classOf[Seq] to classOf[scala.collection.Seq] in PostgresIntegrationSuite for Scala 2.13 ### What changes were proposed in this pull request? This PR renames some part of `Seq` in `PostgresIntegrationSuite` to `scala.collection.Seq`. When I run `docker-integration-test`, I noticed that `PostgresIntegrationSuite` failed due to `ClassCastException`. The reason is the same as what is resolved in SPARK-29292. ### Why are the changes needed? To pass `docker-integration-test` for Scala 2.13. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Ran `PostgresIntegrationSuite` fixed and confirmed it successfully finished. Closes #30166 from sarutak/fix-toseq-postgresql. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-04 17:39:06 +09:00
Erik Krogen	ff724d23b6	[SPARK-33214][TEST][HIVE] Stop HiveExternalCatalogVersionsSuite from using a hard-coded location to store localized Spark binaries ### What changes were proposed in this pull request? This PR changes `HiveExternalCatalogVersionsSuite` to, by default, use a standard temporary directory to store the Spark binaries that it localizes. It additionally adds a new System property, `spark.test.cache-dir`, which can be used to define a static location into which the Spark binary will be localized to allow for sharing between test executions. If the System property is used, the downloaded binaries won't be deleted after the test runs. ### Why are the changes needed? In SPARK-22356 (PR #19579), the `sparkTestingDir` used by `HiveExternalCatalogVersionsSuite` became hard-coded to enable re-use of the downloaded Spark tarball between test executions: ``` // For local test, you can set `sparkTestingDir` to a static value like `/tmp/test-spark`, to // avoid downloading Spark of different versions in each run. private val sparkTestingDir = new File("/tmp/test-spark") ``` However this doesn't work, since it gets deleted every time: ``` override def afterAll(): Unit = { try { Utils.deleteRecursively(wareHousePath) Utils.deleteRecursively(tmpDataDir) Utils.deleteRecursively(sparkTestingDir) } finally { super.afterAll() } } ``` It's bad that we're hard-coding to a `/tmp` directory, as in some cases this is not the proper place to store temporary files. We're not currently making any good use of it. ### Does this PR introduce _any_ user-facing change? Developer-facing changes only, as this is in a test. ### How was this patch tested? The test continues to execute as expected. Closes #30122 from xkrogen/xkrogen-SPARK-33214-hiveexternalversioncatalogsuite-fix. Authored-by: Erik Krogen <xkrogen@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-04 06:51:54 +00:00
Terry Kim	0ad35ba5f8	[SPARK-33321][SQL] Migrate ANALYZE TABLE commands to use UnresolvedTableOrView to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `ANALYZE TABLE` and `ANALYZE TABLE ... FOR COLUMNS` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). Note that `ANALYZE TABLE` is not supported for v2 tables. ### Why are the changes needed? The changes allow consistent resolution behavior when resolving the table/view identifier. For example, the following is the current behavior: ```scala sql("create temporary view t as select 1") sql("create database db") sql("create table db.t using csv as select 1") sql("use db") sql("ANALYZE TABLE t compute statistics") // Succeeds ``` With this change, ANALYZE TABLE above fails with the following: ``` org.apache.spark.sql.AnalysisException: t is a temp view not table or permanent view.; line 1 pos 0 at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.$anonfun$applyOrElse$40(Analyzer.scala:872) at scala.Option.map(Option.scala:230) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.applyOrElse(Analyzer.scala:870) at org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews$$anonfun$apply$7.applyOrElse(Analyzer.scala:856) ``` , which is expected since temporary view is resolved first and ANALYZE TABLE doesn't support a temporary view. ### Does this PR introduce _any_ user-facing change? After this PR, `ANALYZE TABLE t` is resolved to a temp view `t` instead of table `db.t`. ### How was this patch tested? Updated existing tests. Closes #30229 from imback82/parse_v1table. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-04 06:50:37 +00:00
ulysses	1740b29b3f	[SPARK-33323][SQL] Add query resolved check before convert hive relation ### What changes were proposed in this pull request? Add query.resolved before convert hive relation. ### Why are the changes needed? For better error msg. ``` CREATE TABLE t STORED AS PARQUET AS SELECT * FROM ( SELECT c3 FROM ( SELECT c1, c2 from values(1,2) t(c1, c2) ) ) ``` Before this PR, we get such error msg ``` org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to toAttribute on unresolved object, tree: * at org.apache.spark.sql.catalyst.analysis.Star.toAttribute(unresolved.scala:244) at org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicLogicalOperators.scala:52) at org.apache.spark.sql.catalyst.plans.logical.Project$$anonfun$output$1.apply(basicLogicalOperators.scala:52) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234) at scala.collection.immutable.List.foreach(List.scala:392) ``` ### Does this PR introduce _any_ user-facing change? Yes, error msg changed. ### How was this patch tested? Add test. Closes #30230 from ulysses-you/SPARK-33323. Authored-by: ulysses <youxiduo@weidian.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-04 05:01:39 +00:00
Wenchen Fan	034070a23a	Revert "[SPARK-33248][SQL] Add a configuration to control the legacy behavior of whether need to pad null value when value size less then schema size" This reverts commit `0c943cd2fb`.	2020-11-04 12:30:38 +08:00
Chao Sun	d900c6ff49	[SPARK-33293][SQL][FOLLOW-UP] Rename TableWriteExec to TableWriteExecHelper ### What changes were proposed in this pull request? Rename `TableWriteExec` in `WriteToDataSourceV2Exec.scala` to `TableWriteExecHelper`. ### Why are the changes needed? See [discussion](https://github.com/apache/spark/pull/30193#discussion_r516412653). The former is too general. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #30235 from sunchao/SPARK-33293-2. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-03 14:53:01 -08:00
neko	56c623e98c	[SPARK-33284][WEB-UI] In the Storage UI page, clicking any field to sort the table will cause the header content to be lost ### What changes were proposed in this pull request? In the old version of spark in the storage UI page, the sorting function is normal, but sorting in the new version will cause the header content to be lost, So I try to fix the bug. ### Why are the changes needed? The header field of the table on the page is similar to the following, note that each th contains the span attribute: ```html <thead> <tr> .... <th width="" class=""> <span data-toggle="tooltip" title="" data-original-title="StorageLevel displays where the persisted RDD is stored, format of the persisted RDD (serialized or de-serialized) andreplication factor of the persisted RDD"> Storage Level </span> </th> ..... </tr> </thead> ``` Since [PR#26136](https://github.com/apache/spark/pull/26136), if the `th` in the table itself contains the `span` attribute, the `span` will be deleted directly after clicking the sort, and the original header content will be lost. There are three problems in `sorttable.js`: 1. `sortrevind.class = "sorttable_sortrevind"` in [sorttab.js#107](`9d5e48ea95/core/src/main/resources/org/apache/spark/ui/static/sorttable.js (L107)`) and `sortfwdind.class = "sorttable_sortfwdind"` in [sorttab.js#125](`9d5e48ea95/core/src/main/resources/org/apache/spark/ui/static/sorttable.js (L125)`) sorttable_xx attribute should be assigned to`className` instead of `class`, as javascript uses `rowlists[j].className.search` rather than `rowlists[j].class.search` to determine whether the component has a sorting flag or not. 2. `rowlists[j].className.search(/\sorttable_sortrevind\b/)` in [sorttab.js#120](`9d5e48ea95/core/src/main/resources/org/apache/spark/ui/static/sorttable.js (L120)`) was wrong. The original intention is to search whether `className` contains the word `sorttable_sortrevind` , but the expression is wrong, it should be `\bsorttable_sortrevind\b` instead of `\sorttable_sortrevind\b` 3. The if check statement in the following code snippet ([sorttab.js#141](`9d5e48ea95/core/src/main/resources/org/apache/spark/ui/static/sorttable.js (L141)`)) was wrong. If the `search` function does not find the target, it will return -1, but Boolean(-1) is actually equals true. This statement will cause span to be deleted even if it does not contain `sorttable_sortfwdind` or `sorttable_sortrevind`. ```javascript rowlists = this.parentNode.getElementsByTagName("span"); for (var j=0; j < rowlists.length; j++) { if (rowlists[j].className.search(/\bsorttable_sortfwdind\b/) \|\| rowlists[j].className.search(/\sorttable_sortrevind\b/) ) { rowlists[j].parentNode.removeChild(rowlists[j]); } } ``` ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? The manual test result of the ui page is as below: ![fix sorted](https://user-images.githubusercontent.com/52202080/97543194-daeaa680-1a02-11eb-8b11-8109c3e4e9a3.gif) Closes #30182 from akiyamaneko/ui_storage_sort_error. Authored-by: neko <echohlne@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-11-03 08:49:52 -06:00
zero323	4c8ee8856c	[SPARK-33257][PYTHON][SQL] Support Column inputs in PySpark ordering functions (asc, desc) ### What changes were proposed in this pull request? This PR adds support for passing `Column`s as input to PySpark sorting functions. ### Why are the changes needed? According to SPARK-26979, PySpark functions should support both Column and str arguments, when possible. ### Does this PR introduce _any_ user-facing change? PySpark users can now provide both `Column` and `str` as an argument for `asc` and `desc` functions. ### How was this patch tested? New unit tests. Closes #30227 from zero323/SPARK-33257. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-03 22:50:59 +09:00
Dongjoon Hyun	27d8136934	[SPARK-33324][K8S][BUILD] Upgrade kubernetes-client to 4.11.1 ### What changes were proposed in this pull request? This PR aims to upgrade `Kubernetes-client` from 4.10.3 to 4.11.1. ### Why are the changes needed? This upgrades the dependency for Apache Spark 3.1.0. Since 4.12.0 is still new and has a breaking API changes, this PR chooses the latest compatible one. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the all CIs including K8s IT. Closes #30233 from dongjoon-hyun/SPARK-33324. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-02 22:23:26 -08:00
HyukjinKwon	3959f0d987	[SPARK-33250][PYTHON][DOCS] Migration to NumPy documentation style in SQL (pyspark.sql.*) ### What changes were proposed in this pull request? This PR proposes to migrate to [NumPy documentation style](https://numpydoc.readthedocs.io/en/latest/format.html), see also SPARK-33243. While I am migrating, I also fixed some Python type hints accordingly. ### Why are the changes needed? For better documentation as text itself, and generated HTMLs ### Does this PR introduce _any_ user-facing change? Yes, they will see a better format of HTMLs, and better text format. See SPARK-33243. ### How was this patch tested? Manually tested via running `./dev/lint-python`. Closes #30181 from HyukjinKwon/SPARK-33250. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-03 10:00:49 +09:00
Max Gekk	bdabf60fb4	[SPARK-33299][SQL][DOCS] Don't mention schemas in JSON format in docs for `from_json` ### What changes were proposed in this pull request? Remove the JSON formatted schema from comments for `from_json()` in Scala/Python APIs. Closes #30201 ### Why are the changes needed? Schemas in JSON format is internal (not documented). It shouldn't be recommenced for usage. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By linters. Closes #30226 from MaxGekk/from_json-common-schema-parsing-2. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-02 10:10:24 -08:00
Max Gekk	eecebd0302	[SPARK-33306][SQL][FOLLOWUP] Group DateType and TimestampType together in `needsTimeZone()` ### What changes were proposed in this pull request? In the PR, I propose to group `DateType` and `TimestampType` together in checking time zone needs in the `Cast.needsTimeZone()` method. ### Why are the changes needed? To improve code maintainability. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By the existing test `"SPARK-33306: Timezone is needed when cast Date to String"`. Closes #30223 from MaxGekk/WangGuangxin-SPARK-33306-followup. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-02 10:07:18 -08:00
Yuming Wang	789d19cab5	[SPARK-33319][SQL][TEST] Add all built-in SerDes to HiveSerDeReadWriteSuite ### What changes were proposed in this pull request? This pr add all built-in SerDes to `HiveSerDeReadWriteSuite`. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL#LanguageManualDDL-RowFormats&SerDe ### Why are the changes needed? We will upgrade Parquet, ORC and Avro, need to ensure compatibility. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? N/A Closes #30228 from wangyum/SPARK-33319. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-11-02 08:34:50 -08:00
Cheng Su	e52b858ef7	[SPARK-33027][SQL] Add DisableUnnecessaryBucketedScan rule to AQE ### What changes were proposed in this pull request? As a followup comment from https://github.com/apache/spark/pull/29804#issuecomment-700650620 , here we add add the physical plan rule DisableUnnecessaryBucketedScan into AQE AdaptiveSparkPlanExec.queryStagePreparationRules, to make auto bucketed scan work with AQE. The change is mostly in: * `AdaptiveSparkPlanExec.scala`: add physical plan rule `DisableUnnecessaryBucketedScan` * `DisableUnnecessaryBucketedScan.scala`: propagate logical plan link for the file source scan exec operator, otherwise we lose the logical plan link information when AQE is enabled, and will get exception [here](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/adaptive/AdaptiveSparkPlanExec.scala#L176). (for example, for query `SELECT * FROM bucketed_table` with AQE is enabled) * `DisableUnnecessaryBucketedScanSuite.scala`: add new test suite for AQE enabled - `DisableUnnecessaryBucketedScanWithoutHiveSupportSuiteAE`, and changed some of tests to use `AdaptiveSparkPlanHelper.find/collect`, to make the plan verification work when AQE enabled. ### Why are the changes needed? It's reasonable to add the support to allow disabling unnecessary bucketed scan with AQE is enabled, this helps optimize the query when AQE is enabled. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit test in `DisableUnnecessaryBucketedScanSuite`. Closes #30200 from c21/auto-bucket-aqe. Authored-by: Cheng Su <chengsu@fb.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-02 06:44:07 +00:00
Prashant Sharma	6226ccc092	[SPARK-33095] Follow up, support alter table column rename ### What changes were proposed in this pull request? Support rename column for mysql dialect. ### Why are the changes needed? At the moment, it does not work for mysql version 5.x. So, we should throw proper exception for that case. ### Does this PR introduce _any_ user-facing change? Yes, `column rename` with mysql dialect should work correctly. ### How was this patch tested? Added tests for rename column. Ran the tests to pass with both versions of mysql. * `export MYSQL_DOCKER_IMAGE_NAME=mysql:5.7.31` * `export MYSQL_DOCKER_IMAGE_NAME=mysql:8.0` Closes #30142 from ScrapCodes/mysql-dialect-rename. Authored-by: Prashant Sharma <prashsh1@in.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-11-02 05:03:41 +00:00
zero323	d71b2febaf	[SPARK-30663][SPARK-33313][TESTS][R] Drop testthat 1.x support and add testthat 3.x support ### What changes were proposed in this pull request? This PR modifies `R/pkg/tests/run-all.R` by: - Removing `testthat` 1.x support, as Jenkins has been upgraded to 2.x with SPARK-30637 and this code is no longer relevant. - Add `testthat` 3.x support to avoid AppVeyor failures. ### Why are the changes needed? Currently used internal API has been removed in the latest `testthat` release. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Tests executed against `testthat == 2.3.2` and `testthat == 3.0.0` Closes #30219 from zero323/SPARK-33313. Authored-by: zero323 <mszymkiewicz@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-02 08:54:08 +09:00
Gengliang Wang	2b6dfa5f7b	[SPARK-20044][UI] Support Spark UI behind front-end reverse proxy using a path prefix Revert proxy url ### What changes were proposed in this pull request? Allow to run the Spark web UI behind a reverse proxy with URLs prefixed by a context root, like www.mydomain.com/spark. In particular, this allows to access multiple Spark clusters through the same virtual host, only distinguishing them by context root, like www.mydomain.com/cluster1, www.mydomain.com/cluster2, and it allows to run the Spark UI in a common cookie domain (for SSO) with other services. ### Why are the changes needed? This PR is to take over https://github.com/apache/spark/pull/17455. After changes, Spark allows showing customized prefix URL in all the `href` links of the HTML pages. ### Does this PR introduce _any_ user-facing change? Yes, all the links of UI pages will be contains the value of `spark.ui.reverseProxyUrl` if it is configurated. ### How was this patch tested? New HTML Unit tests in MasterSuite Manual UI testing for master, worker and app UI with an nginx proxy Spark config: ``` spark.ui.port 8080 spark.ui.reverseProxy=true spark.ui.reverseProxyUrl=/path/to/spark/ ``` nginx config: ``` server { listen 9000; set $SPARK_MASTER http://127.0.0.1:8080; # split spark UI path into prefix and local path within master UI location ~ ^(/path/to/spark/) { # strip prefix when forwarding request rewrite /path/to/spark(/.*) $1 break; #rewrite /path/to/spark/ "/" ; # forward to spark master UI proxy_pass $SPARK_MASTER; proxy_intercept_errors on; error_page 301 302 307 = handle_redirects; } location handle_redirects { set $saved_redirect_location '$upstream_http_location'; proxy_pass $saved_redirect_location; } } ``` Closes #29820 from gengliangwang/revertProxyURL. Lead-authored-by: Gengliang Wang <gengliang.wang@databricks.com> Co-authored-by: Oliver Köth <okoeth@de.ibm.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-11-01 23:57:57 +08:00
Takuya UESHIN	b8a440f098	[SPARK-33277][PYSPARK][SQL] Use ContextAwareIterator to stop consuming after the task ends ### What changes were proposed in this pull request? As the Python evaluation consumes the parent iterator in a separate thread, it could consume more data from the parent even after the task ends and the parent is closed. Thus, we should use `ContextAwareIterator` to stop consuming after the task ends. ### Why are the changes needed? Python/Pandas UDF right after off-heap vectorized reader could cause executor crash. E.g.,: ```py spark.range(0, 100000, 1, 1).write.parquet(path) spark.conf.set("spark.sql.columnVector.offheap.enabled", True) def f(x): return 0 fUdf = udf(f, LongType()) spark.read.parquet(path).select(fUdf('id')).head() ``` This is because, the Python evaluation consumes the parent iterator in a separate thread and it consumes more data from the parent even after the task ends and the parent is closed. If an off-heap column vector exists in the parent iterator, it could cause segmentation fault which crashes the executor. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added tests, and manually. Closes #30177 from ueshin/issues/SPARK-33277/python_pandas_udf. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-01 20:28:12 +09:00
Daniel Himmelstein	56587f076d	[SPARK-33310][PYTHON] Relax pyspark typing for sql str functions ### What changes were proposed in this pull request? Relax pyspark typing for sql str functions. These functions all pass the first argument through `_to_java_column`, such that a string or Column object is acceptable. ### Why are the changes needed? Convenience & ensuring the typing reflects the functionality ### Does this PR introduce _any_ user-facing change? Yes, a backwards-compatible increase in functionality. But I think typing support is unreleased, so possibly no change to released versions. ### How was this patch tested? Not tested. I am newish to Python typing with stubs, so someone should confirm this is the correct way to fix this. Closes #30209 from dhimmel/patch-1. Authored-by: Daniel Himmelstein <daniel.himmelstein@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-11-01 19:09:12 +09:00
wangguangxin.cn	69c27f49ac	[SPARK-33306][SQL] Timezone is needed when cast date to string ### What changes were proposed in this pull request? When `spark.sql.legacy.typeCoercion.datetimeToString.enabled` is enabled, spark will cast date to string when compare date with string. In Spark3, timezone is needed when casting date to string as `72ad9dcd5d/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/Cast.scala (L309)`. Howerver, the timezone may not be set because `CastBase.needsTimeZone` returns false for this kind of casting. A simple way to reproduce this is ``` spark-shell --conf spark.sql.legacy.typeCoercion.datetimeToString.enabled=true ``` when we execute the following sql, ``` select a.d1 from (select to_date(concat('2000-01-0', id)) as d1 from range(1, 2)) a join (select concat('2000-01-0', id) as d2 from range(1, 2)) b on a.d1 = b.d2 ``` it will throw ``` java.util.NoSuchElementException: None.get at scala.None$.get(Option.scala:529) at scala.None$.get(Option.scala:527) at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId(datetimeExpressions.scala:56) at org.apache.spark.sql.catalyst.expressions.TimeZoneAwareExpression.zoneId$(datetimeExpressions.scala:56) at org.apache.spark.sql.catalyst.expressions.CastBase.zoneId$lzycompute(Cast.scala:253) at org.apache.spark.sql.catalyst.expressions.CastBase.zoneId(Cast.scala:253) at org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter$lzycompute(Cast.scala:287) at org.apache.spark.sql.catalyst.expressions.CastBase.dateFormatter(Cast.scala:287) ``` ### Why are the changes needed? As described above, it's a bug here. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Add more UT Closes #30213 from WangGuangxin/SPARK-33306. Authored-by: wangguangxin.cn <wangguangxin.cn@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-31 15:14:46 -07:00
Chao Sun	c51e5fc14b	[SPARK-33293][SQL] Refactor WriteToDataSourceV2Exec and reduce code duplication ### What changes were proposed in this pull request? Refactor `WriteToDataSourceV2Exec` via removing code duplication around write to table logic: - renamed `AtomicTableWriteExec` to `TableWriteExec` so that the table write logic in this trait can be modified and shared with `CreateTableAsSelectExec`, `ReplaceTableAsSelectExec`, `AtomicCreateTableAsSelectExec ` and `AtomicReplaceTableAsSelectExec`. - similar to the above, renamed `writeToStagedTable` to `writeToTable` in `TableWriteExec`. - extended `writeToTable` so that it can handle both staged table as well as non-staged table. ### Why are the changes needed? Simplify the logic and remove duplication, to make this piece of code easier to maintain. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass CIs with the existing test coverage. Closes #30193 from sunchao/SPARK-33293. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-10-31 10:01:31 -07:00

... 3 4 5 6 7 ...

28639 commits