ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
chenjuanni	2036a8cca7	[SPARK-29488][WEBUI] In Web UI, stage page has js error when sort table ### What changes were proposed in this pull request? In Web UI, stage page has js error when sort table. https://issues.apache.org/jira/browse/SPARK-29488 ### Why are the changes needed? In Web UI, follow the steps below, get js error "Uncaught TypeError: Failed to execute 'removeChild' on 'Node': parameter 1 is not of type 'Node'.". 1) Click "Summary Metrics..." 's tablehead "Min" 2) Click "Aggregated Metrics by Executor" 's tablehead "Task Time" 3) Click "Summary Metrics..." 's tablehead "Min"（the same as step 1.） ### Does this PR introduce any user-facing change? No. ### How was this patch tested? In Web UI, follow the steps below, no error occur. 1) Click "Summary Metrics..." 's tablehead "Min" 2) Click "Aggregated Metrics by Executor" 's tablehead "Task Time" 3) Click "Summary Metrics..." 's tablehead "Min"（the same as step 1.） ![image](https://user-images.githubusercontent.com/7802338/66899878-464b1b80-f02e-11e9-9660-6cdaab283491.png) Closes #26136 from cjn082030/SPARK-1. Authored-by: chenjuanni <chenjuanni@inspur.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-22 08:58:12 -05:00
Dilip Biswal	c1c64851ed	[SPARK-28793][DOC][SQL] Document CREATE FUNCTION in SQL Reference ### What changes were proposed in this pull request? Document CREATE FUNCTION statement in SQL Reference Guide. ### Why are the changes needed? Currently Spark lacks documentation on the supported SQL constructs causing confusion among users who sometimes have to look at the code to understand the usage. This is aimed at addressing this issue. ### Does this PR introduce any user-facing change? Yes. Before: There was no documentation for this. After. <img width="1260" alt="Screen Shot 2019-09-22 at 3 01 52 PM" src="https://user-images.githubusercontent.com/14225158/65395036-5bdc6680-dd4a-11e9-9873-0a1da88706a8.png"> <img width="1260" alt="Screen Shot 2019-09-22 at 3 02 11 PM" src="https://user-images.githubusercontent.com/14225158/65395037-5bdc6680-dd4a-11e9-964f-c02d23803b68.png"> <img width="1260" alt="Screen Shot 2019-09-22 at 3 02 39 PM" src="https://user-images.githubusercontent.com/14225158/65395038-5bdc6680-dd4a-11e9-831b-6ba1d041893d.png"> <img width="1260" alt="Screen Shot 2019-09-22 at 3 04 04 PM" src="https://user-images.githubusercontent.com/14225158/65395040-5bdc6680-dd4a-11e9-8226-250f77dfeaf3.png"> ### How was this patch tested? Tested using jykyll build --serve Closes #25894 from dilipbiswal/sql-ref-create-function. Authored-by: Dilip Biswal <dkbiswal@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-22 08:56:44 -05:00
Huaxin Gao	877993847c	[SPARK-28787][DOC][SQL] Document LOAD DATA statement in SQL Reference ### What changes were proposed in this pull request? Document LOAD DATA statement in SQL Reference ### Why are the changes needed? To complete the SQL Reference ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? Tested using jykyll build --serve Here are the screen shots: ![image](https://user-images.githubusercontent.com/13592258/64073167-e7cd0800-cc4e-11e9-9fcc-92fe4cb5a942.png) ![image](https://user-images.githubusercontent.com/13592258/64073169-ee5b7f80-cc4e-11e9-9a36-cc023bcd32b1.png) ![image](https://user-images.githubusercontent.com/13592258/64073170-f4516080-cc4e-11e9-9101-2609a01fe6fe.png) Closes #25522 from huaxingao/spark-28787. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-22 08:55:37 -05:00
Liang-Chi Hsieh	b4844eea1f	[SPARK-29517][SQL] TRUNCATE TABLE should look up catalog/table like v2 commands ### What changes were proposed in this pull request? Add TruncateTableStatement and make TRUNCATE TABLE go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g. ``` USE my_catalog DESC t // success and describe the table t from my_catalog TRUNCATE TABLE t // report table not found as there is no table t in the session catalog ``` ### Does this PR introduce any user-facing change? yes. When running TRUNCATE TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog. ### How was this patch tested? Unit tests. Closes #26174 from viirya/SPARK-29517. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-22 19:17:28 +08:00
Yuanjian Li	bb49c80c89	[SPARK-21492][SQL] Fix memory leak in SortMergeJoin ### What changes were proposed in this pull request? We shall have a new mechanism that the downstream operators may notify its parents that they may release the output data stream. In this PR, we implement the mechanism as below: - Add function named `cleanupResources` in SparkPlan, which default call children's `cleanupResources` function, the operator which need a resource cleanup should rewrite this with the self cleanup and also call `super.cleanupResources`, like SortExec in this PR. - Add logic support on the trigger side, in this PR is SortMergeJoinExec, which make sure and call the `cleanupResources` to do the cleanup job for all its upstream(children) operator. ### Why are the changes needed? Bugfix for SortMergeJoin memory leak, and implement a general framework for SparkPlan resource cleanup. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? UT: Add new test suite JoinWithResourceCleanSuite to check both standard and code generation scenario. Integrate Test: Test with driver/executor default memory set 1g, local mode 10 thread. The below test(thanks taosaildrone for providing this test [here](https://github.com/apache/spark/pull/23762#issuecomment-463303175)) will pass with this PR. ``` from pyspark.sql.functions import rand, col spark.conf.set("spark.sql.join.preferSortMergeJoin", "true") spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1) # spark.conf.set("spark.sql.sortMergeJoinExec.eagerCleanupResources", "true") r1 = spark.range(1, 1001).select(col("id").alias("timestamp1")) r1 = r1.withColumn('value', rand()) r2 = spark.range(1000, 1001).select(col("id").alias("timestamp2")) r2 = r2.withColumn('value2', rand()) joined = r1.join(r2, r1.timestamp1 == r2.timestamp2, "inner") joined = joined.coalesce(1) joined.explain() joined.show() ``` Closes #26164 from xuanyuanking/SPARK-21492. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-22 19:08:09 +08:00
Yuming Wang	3163b6b43b	[SPARK-29516][SQL][TEST] Test ThriftServerQueryTestSuite asynchronously ### What changes were proposed in this pull request? This PR test `ThriftServerQueryTestSuite` in an asynchronous way. ### Why are the changes needed? The default value of `spark.sql.hive.thriftServer.async` is `true`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? ``` build/sbt "hive-thriftserver/test-only *.ThriftServerQueryTestSuite" -Phive-thriftserver build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite test -Phive-thriftserver ``` Closes #26172 from wangyum/SPARK-29516. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com>	2019-10-22 03:20:49 -07:00
Huaxin Gao	868d851dac	[SPARK-29232][ML] Update the parameter maps of the DecisionTreeRegression/Classification Models ### What changes were proposed in this pull request? The trees (Array[```DecisionTreeRegressionModel```]) in ```RandomForestRegressionModel``` only contains the default parameter value. Need to update the parameter maps for these trees. Same issues in ```RandomForestClassifier```, ```GBTClassifier``` and ```GBTRegressor``` ### Why are the changes needed? User wants to access each individual tree and build the trees back up for the random forest estimator. This doesn't work because trees don't have the correct parameter values ### Does this PR introduce any user-facing change? Yes. Now the trees in ```RandomForestRegressionModel```, ```RandomForestClassifier```, ```GBTClassifier``` and ```GBTRegressor``` have the correct parameter values. ### How was this patch tested? Add tests Closes #26154 from huaxingao/spark-29232. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-10-22 17:49:44 +08:00
HyukjinKwon	811d563fbf	[SPARK-29536][PYTHON] Upgrade cloudpickle to 1.1.1 to support Python 3.8 ### What changes were proposed in this pull request? Inline cloudpickle in PySpark to cloudpickle 1.1.1. See https://github.com/cloudpipe/cloudpickle/blob/v1.1.1/cloudpickle/cloudpickle.py https://github.com/cloudpipe/cloudpickle/pull/269 was added for Python 3.8 support (fixed from 1.1.0). Using 1.2.2 seems breaking PyPy 2 due to cloudpipe/cloudpickle#278 so this PR currently uses 1.1.1. Once we drop Python 2, we can switch to the highest version. ### Why are the changes needed? positional-only arguments was newly introduced from Python 3.8 (see https://docs.python.org/3/whatsnew/3.8.html#positional-only-parameters) Particularly the newly added argument to `types.CodeType` was the problem (https://docs.python.org/3/whatsnew/3.8.html#changes-in-the-python-api): > `types.CodeType` has a new parameter in the second position of the constructor (posonlyargcount) to support positional-only arguments defined in PEP 570. The first argument (argcount) now represents the total number of positional arguments (including positional-only arguments). The new `replace()` method of `types.CodeType` can be used to make the code future-proof. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually tested. Note that the optional dependency PyArrow looks not yet supporting Python 3.8; therefore, it was not tested. See "Details" below. <details> <p> ```bash cd python ./run-tests --python-executables=python3.8 ``` ``` Running PySpark tests. Output is in /Users/hyukjin.kwon/workspace/forked/spark/python/unit-tests.log Will test against the following Python executables: ['python3.8'] Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming'] Starting test(python3.8): pyspark.ml.tests.test_algorithms Starting test(python3.8): pyspark.ml.tests.test_feature Starting test(python3.8): pyspark.ml.tests.test_base Starting test(python3.8): pyspark.ml.tests.test_evaluation Finished test(python3.8): pyspark.ml.tests.test_base (12s) Starting test(python3.8): pyspark.ml.tests.test_image Finished test(python3.8): pyspark.ml.tests.test_evaluation (14s) Starting test(python3.8): pyspark.ml.tests.test_linalg Finished test(python3.8): pyspark.ml.tests.test_feature (23s) Starting test(python3.8): pyspark.ml.tests.test_param Finished test(python3.8): pyspark.ml.tests.test_image (22s) Starting test(python3.8): pyspark.ml.tests.test_persistence Finished test(python3.8): pyspark.ml.tests.test_param (25s) Starting test(python3.8): pyspark.ml.tests.test_pipeline Finished test(python3.8): pyspark.ml.tests.test_linalg (37s) Starting test(python3.8): pyspark.ml.tests.test_stat Finished test(python3.8): pyspark.ml.tests.test_pipeline (7s) Starting test(python3.8): pyspark.ml.tests.test_training_summary Finished test(python3.8): pyspark.ml.tests.test_stat (21s) Starting test(python3.8): pyspark.ml.tests.test_tuning Finished test(python3.8): pyspark.ml.tests.test_persistence (45s) Starting test(python3.8): pyspark.ml.tests.test_wrapper Finished test(python3.8): pyspark.ml.tests.test_algorithms (83s) Starting test(python3.8): pyspark.mllib.tests.test_algorithms Finished test(python3.8): pyspark.ml.tests.test_training_summary (32s) Starting test(python3.8): pyspark.mllib.tests.test_feature Finished test(python3.8): pyspark.ml.tests.test_wrapper (20s) Starting test(python3.8): pyspark.mllib.tests.test_linalg Finished test(python3.8): pyspark.mllib.tests.test_feature (32s) Starting test(python3.8): pyspark.mllib.tests.test_stat Finished test(python3.8): pyspark.mllib.tests.test_algorithms (70s) Starting test(python3.8): pyspark.mllib.tests.test_streaming_algorithms Finished test(python3.8): pyspark.mllib.tests.test_stat (37s) Starting test(python3.8): pyspark.mllib.tests.test_util Finished test(python3.8): pyspark.mllib.tests.test_linalg (70s) Starting test(python3.8): pyspark.sql.tests.test_arrow Finished test(python3.8): pyspark.sql.tests.test_arrow (1s) ... 53 tests were skipped Starting test(python3.8): pyspark.sql.tests.test_catalog Finished test(python3.8): pyspark.mllib.tests.test_util (15s) Starting test(python3.8): pyspark.sql.tests.test_column Finished test(python3.8): pyspark.sql.tests.test_catalog (24s) Starting test(python3.8): pyspark.sql.tests.test_conf Finished test(python3.8): pyspark.sql.tests.test_column (21s) Starting test(python3.8): pyspark.sql.tests.test_context Finished test(python3.8): pyspark.ml.tests.test_tuning (125s) Starting test(python3.8): pyspark.sql.tests.test_dataframe Finished test(python3.8): pyspark.sql.tests.test_conf (9s) Starting test(python3.8): pyspark.sql.tests.test_datasources Finished test(python3.8): pyspark.sql.tests.test_context (29s) Starting test(python3.8): pyspark.sql.tests.test_functions Finished test(python3.8): pyspark.sql.tests.test_datasources (32s) Starting test(python3.8): pyspark.sql.tests.test_group Finished test(python3.8): pyspark.sql.tests.test_dataframe (39s) ... 3 tests were skipped Starting test(python3.8): pyspark.sql.tests.test_pandas_udf Finished test(python3.8): pyspark.sql.tests.test_pandas_udf (1s) ... 6 tests were skipped Starting test(python3.8): pyspark.sql.tests.test_pandas_udf_cogrouped_map Finished test(python3.8): pyspark.sql.tests.test_pandas_udf_cogrouped_map (0s) ... 14 tests were skipped Starting test(python3.8): pyspark.sql.tests.test_pandas_udf_grouped_agg Finished test(python3.8): pyspark.sql.tests.test_pandas_udf_grouped_agg (1s) ... 15 tests were skipped Starting test(python3.8): pyspark.sql.tests.test_pandas_udf_grouped_map Finished test(python3.8): pyspark.sql.tests.test_pandas_udf_grouped_map (1s) ... 20 tests were skipped Starting test(python3.8): pyspark.sql.tests.test_pandas_udf_scalar Finished test(python3.8): pyspark.sql.tests.test_pandas_udf_scalar (1s) ... 49 tests were skipped Starting test(python3.8): pyspark.sql.tests.test_pandas_udf_window Finished test(python3.8): pyspark.sql.tests.test_pandas_udf_window (1s) ... 14 tests were skipped Starting test(python3.8): pyspark.sql.tests.test_readwriter Finished test(python3.8): pyspark.sql.tests.test_functions (29s) Starting test(python3.8): pyspark.sql.tests.test_serde Finished test(python3.8): pyspark.sql.tests.test_group (20s) Starting test(python3.8): pyspark.sql.tests.test_session Finished test(python3.8): pyspark.mllib.tests.test_streaming_algorithms (126s) Starting test(python3.8): pyspark.sql.tests.test_streaming Finished test(python3.8): pyspark.sql.tests.test_serde (25s) Starting test(python3.8): pyspark.sql.tests.test_types Finished test(python3.8): pyspark.sql.tests.test_readwriter (38s) Starting test(python3.8): pyspark.sql.tests.test_udf Finished test(python3.8): pyspark.sql.tests.test_session (32s) Starting test(python3.8): pyspark.sql.tests.test_utils Finished test(python3.8): pyspark.sql.tests.test_utils (17s) Starting test(python3.8): pyspark.streaming.tests.test_context Finished test(python3.8): pyspark.sql.tests.test_types (45s) Starting test(python3.8): pyspark.streaming.tests.test_dstream Finished test(python3.8): pyspark.sql.tests.test_udf (44s) Starting test(python3.8): pyspark.streaming.tests.test_kinesis Finished test(python3.8): pyspark.streaming.tests.test_kinesis (0s) ... 2 tests were skipped Starting test(python3.8): pyspark.streaming.tests.test_listener Finished test(python3.8): pyspark.streaming.tests.test_context (28s) Starting test(python3.8): pyspark.tests.test_appsubmit Finished test(python3.8): pyspark.sql.tests.test_streaming (60s) Starting test(python3.8): pyspark.tests.test_broadcast Finished test(python3.8): pyspark.streaming.tests.test_listener (11s) Starting test(python3.8): pyspark.tests.test_conf Finished test(python3.8): pyspark.tests.test_conf (17s) Starting test(python3.8): pyspark.tests.test_context Finished test(python3.8): pyspark.tests.test_broadcast (39s) Starting test(python3.8): pyspark.tests.test_daemon Finished test(python3.8): pyspark.tests.test_daemon (5s) Starting test(python3.8): pyspark.tests.test_join Finished test(python3.8): pyspark.tests.test_context (31s) Starting test(python3.8): pyspark.tests.test_profiler Finished test(python3.8): pyspark.tests.test_join (9s) Starting test(python3.8): pyspark.tests.test_rdd Finished test(python3.8): pyspark.tests.test_profiler (12s) Starting test(python3.8): pyspark.tests.test_readwrite Finished test(python3.8): pyspark.tests.test_readwrite (23s) ... 3 tests were skipped Starting test(python3.8): pyspark.tests.test_serializers Finished test(python3.8): pyspark.tests.test_appsubmit (94s) Starting test(python3.8): pyspark.tests.test_shuffle Finished test(python3.8): pyspark.streaming.tests.test_dstream (110s) Starting test(python3.8): pyspark.tests.test_taskcontext Finished test(python3.8): pyspark.tests.test_rdd (42s) Starting test(python3.8): pyspark.tests.test_util Finished test(python3.8): pyspark.tests.test_serializers (11s) Starting test(python3.8): pyspark.tests.test_worker Finished test(python3.8): pyspark.tests.test_shuffle (12s) Starting test(python3.8): pyspark.accumulators Finished test(python3.8): pyspark.tests.test_util (7s) Starting test(python3.8): pyspark.broadcast Finished test(python3.8): pyspark.accumulators (8s) Starting test(python3.8): pyspark.conf Finished test(python3.8): pyspark.broadcast (8s) Starting test(python3.8): pyspark.context Finished test(python3.8): pyspark.tests.test_worker (19s) Starting test(python3.8): pyspark.ml.classification Finished test(python3.8): pyspark.conf (4s) Starting test(python3.8): pyspark.ml.clustering Finished test(python3.8): pyspark.context (22s) Starting test(python3.8): pyspark.ml.evaluation Finished test(python3.8): pyspark.tests.test_taskcontext (49s) Starting test(python3.8): pyspark.ml.feature Finished test(python3.8): pyspark.ml.clustering (43s) Starting test(python3.8): pyspark.ml.fpm Finished test(python3.8): pyspark.ml.evaluation (27s) Starting test(python3.8): pyspark.ml.image Finished test(python3.8): pyspark.ml.image (8s) Starting test(python3.8): pyspark.ml.linalg.__init__ Finished test(python3.8): pyspark.ml.linalg.__init__ (0s) Starting test(python3.8): pyspark.ml.recommendation Finished test(python3.8): pyspark.ml.classification (63s) Starting test(python3.8): pyspark.ml.regression Finished test(python3.8): pyspark.ml.fpm (23s) Starting test(python3.8): pyspark.ml.stat Finished test(python3.8): pyspark.ml.stat (30s) Starting test(python3.8): pyspark.ml.tuning Finished test(python3.8): pyspark.ml.regression (51s) Starting test(python3.8): pyspark.mllib.classification Finished test(python3.8): pyspark.ml.feature (93s) Starting test(python3.8): pyspark.mllib.clustering Finished test(python3.8): pyspark.ml.tuning (39s) Starting test(python3.8): pyspark.mllib.evaluation Finished test(python3.8): pyspark.mllib.classification (38s) Starting test(python3.8): pyspark.mllib.feature Finished test(python3.8): pyspark.mllib.evaluation (25s) Starting test(python3.8): pyspark.mllib.fpm Finished test(python3.8): pyspark.mllib.clustering (64s) Starting test(python3.8): pyspark.mllib.linalg.__init__ Finished test(python3.8): pyspark.ml.recommendation (131s) Starting test(python3.8): pyspark.mllib.linalg.distributed Finished test(python3.8): pyspark.mllib.linalg.__init__ (0s) Starting test(python3.8): pyspark.mllib.random Finished test(python3.8): pyspark.mllib.feature (36s) Starting test(python3.8): pyspark.mllib.recommendation Finished test(python3.8): pyspark.mllib.fpm (31s) Starting test(python3.8): pyspark.mllib.regression Finished test(python3.8): pyspark.mllib.random (16s) Starting test(python3.8): pyspark.mllib.stat.KernelDensity Finished test(python3.8): pyspark.mllib.stat.KernelDensity (1s) Starting test(python3.8): pyspark.mllib.stat._statistics Finished test(python3.8): pyspark.mllib.stat._statistics (25s) Starting test(python3.8): pyspark.mllib.tree Finished test(python3.8): pyspark.mllib.regression (44s) Starting test(python3.8): pyspark.mllib.util Finished test(python3.8): pyspark.mllib.recommendation (49s) Starting test(python3.8): pyspark.profiler Finished test(python3.8): pyspark.mllib.linalg.distributed (53s) Starting test(python3.8): pyspark.rdd Finished test(python3.8): pyspark.profiler (14s) Starting test(python3.8): pyspark.serializers Finished test(python3.8): pyspark.mllib.tree (30s) Starting test(python3.8): pyspark.shuffle Finished test(python3.8): pyspark.shuffle (2s) Starting test(python3.8): pyspark.sql.avro.functions Finished test(python3.8): pyspark.mllib.util (30s) Starting test(python3.8): pyspark.sql.catalog Finished test(python3.8): pyspark.serializers (17s) Starting test(python3.8): pyspark.sql.column Finished test(python3.8): pyspark.rdd (31s) Starting test(python3.8): pyspark.sql.conf Finished test(python3.8): pyspark.sql.conf (7s) Starting test(python3.8): pyspark.sql.context Finished test(python3.8): pyspark.sql.avro.functions (19s) Starting test(python3.8): pyspark.sql.dataframe Finished test(python3.8): pyspark.sql.catalog (16s) Starting test(python3.8): pyspark.sql.functions Finished test(python3.8): pyspark.sql.column (27s) Starting test(python3.8): pyspark.sql.group Finished test(python3.8): pyspark.sql.context (26s) Starting test(python3.8): pyspark.sql.readwriter Finished test(python3.8): pyspark.sql.group (52s) Starting test(python3.8): pyspark.sql.session Finished test(python3.8): pyspark.sql.dataframe (73s) Starting test(python3.8): pyspark.sql.streaming Finished test(python3.8): pyspark.sql.functions (75s) Starting test(python3.8): pyspark.sql.types Finished test(python3.8): pyspark.sql.readwriter (57s) Starting test(python3.8): pyspark.sql.udf Finished test(python3.8): pyspark.sql.types (13s) Starting test(python3.8): pyspark.sql.window Finished test(python3.8): pyspark.sql.session (32s) Starting test(python3.8): pyspark.streaming.util Finished test(python3.8): pyspark.streaming.util (1s) Starting test(python3.8): pyspark.util Finished test(python3.8): pyspark.util (0s) Finished test(python3.8): pyspark.sql.streaming (30s) Finished test(python3.8): pyspark.sql.udf (27s) Finished test(python3.8): pyspark.sql.window (22s) Tests passed in 855 seconds ``` </p> </details> Closes #26194 from HyukjinKwon/SPARK-29536. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-22 16:18:34 +09:00
denglingang	467c3f610f	[SPARK-29529][DOCS] Remove unnecessary orc version and hive version in doc ### What changes were proposed in this pull request? This PR remove unnecessary orc version and hive version in doc. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A. Closes #26146 from denglingang/SPARK-24576. Lead-authored-by: denglingang <chitin1027@gmail.com> Co-authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-22 14:49:23 +09:00
angerszhu	484f93e255	[SPARK-29530][SQL] Make SQLConf in SQL parse process thread safe ### What changes were proposed in this pull request? As I have comment in [SPARK-29516](https://github.com/apache/spark/pull/26172#issuecomment-544364977) SparkSession.sql() method parse process not under current sparksession's conf, so some configuration about parser is not valid in multi-thread situation. In this pr, we add a SQLConf parameter to AbstractSqlParser and initial it with SessionState's conf. Then for each SparkSession's parser process. It will use's it's own SessionState's SQLConf and to be thread safe ### Why are the changes needed? Fix bug ### Does this PR introduce any user-facing change? NO ### How was this patch tested? NO Closes #26187 from AngersZhuuuu/SPARK-29530. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-22 10:38:06 +08:00
wuyi	3d567a357c	[MINOR][SQL] Avoid unnecessary invocation on checkAndGlobPathIfNecessary ### What changes were proposed in this pull request? Only invoke `checkAndGlobPathIfNecessary()` when we have to use `InMemoryFileIndex`. ### Why are the changes needed? Avoid unnecessary function invocation. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass Jenkins. Closes #26196 from Ngone51/dev-avoid-unnecessary-invocation-on-globpath. Authored-by: wuyi <ngone_5451@163.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-21 21:10:21 -05:00
DylanGuedes	bb4400c23a	[SPARK-29108][SQL][TESTS] Port window.sql (Part 2) ### What changes were proposed in this pull request? This PR ports window.sql from PostgreSQL regression tests https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/sql/window.sql from lines 320~562 The expected results can be found in the link: https://github.com/postgres/postgres/blob/REL_12_STABLE/src/test/regress/expected/window.out ## How was this patch tested? Pass the Jenkins. ### Why are the changes needed? To ensure compatibility with PGSQL ### Does this PR introduce any user-facing change? No ### How was this patch tested? Comparison with PgSQL results. Closes #26121 from DylanGuedes/spark-29108. Authored-by: DylanGuedes <djmgguedes@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-22 10:49:40 +09:00
Maxim Gekk	eef11ba9ef	[SPARK-29518][SQL][TEST] Benchmark `date_part` for `INTERVAL` ### What changes were proposed in this pull request? I extended `ExtractBenchmark` to support the `INTERVAL` type of the `source` parameter of the `date_part` function. ### Why are the changes needed? - To detect performance issues while changing implementation of the `date_part` function in the future. - To find out current performance bottlenecks in `date_part` for the `INTERVAL` type ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running the benchmark and print out produced values per each `field` value. Closes #26175 from MaxGekk/extract-interval-benchmark. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-22 10:47:54 +09:00
Maxim Gekk	6ffec5e6a6	[SPARK-29533][SQL][TEST] Benchmark casting strings to intervals ### What changes were proposed in this pull request? Added new benchmark `IntervalBenchmark` to measure performance of interval related functions. In the PR, I added benchmarks for casting strings to interval. In particular, interval strings with `interval` prefix and without it because there is special code for this `da576a737c/common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java (L100-L103)` . And also I added benchmarks for different number of units in interval strings, for example 1 unit is `interval 10 years`, 2 units w/o interval is `10 years 5 months`, and etc. ### Why are the changes needed? - To find out current performance issues in casting to intervals - The benchmark can be used while refactoring/re-implementing `CalendarInterval.fromString()` or `CalendarInterval.fromCaseInsensitiveString()`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running the benchmark via the command: ```shell SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.IntervalBenchmark" ``` Closes #26189 from MaxGekk/interval-from-string-benchmark. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-22 10:47:04 +09:00
fuwhu	31a5dea48f	[SPARK-29531][SQL][TEST] refine ThriftServerQueryTestSuite.blackList to reuse black list in SQLQueryTestSuite ### What changes were proposed in this pull request? This pr refine the code in ThriftServerQueryTestSuite.blackList to reuse the black list of SQLQueryTestSuite instead of duplicating all test cases from SQLQueryTestSuite.blackList. ### Why are the changes needed? To reduce code duplication. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #26188 from fuwhu/SPARK-TBD. Authored-by: fuwhu <bestwwg@163.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com>	2019-10-21 05:19:27 -07:00
Dongjoon Hyun	5fc363b307	[SPARK-29528][BUILD][TEST-MAVEN] Upgrade scala-maven-plugin to 4.2.4 for Scala 2.13.1 ### What changes were proposed in this pull request? This PR upgrades `scala-maven-plugin` to `4.2.4` for Scala `2.13.1`. ### Why are the changes needed? Scala 2.13.1 seems to break the binary compatibility. We need to upgrade `scala-maven-plugin` to bring the the following fixes for the latest Scala 2.13.1. - https://github.com/davidB/scala-maven-plugin/issues/363 - https://github.com/sbt/zinc/issues/698 ### Does this PR introduce any user-facing change? No. ### How was this patch tested? For now, we don't support Scala-2.13. This PR at least needs to pass the existing Jenkins with Maven to get prepared for Scala-2.13. Closes #26185 from dongjoon-hyun/SPARK-29528. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-10-21 19:05:27 +09:00
Yuming Wang	e99a9f78ea	[SPARK-29498][SQL] CatalogTable to HiveTable should not change the table's ownership ### What changes were proposed in this pull request? `CatalogTable` to `HiveTable` will change the table's ownership. How to reproduce: ```scala import org.apache.spark.sql.catalyst.TableIdentifier import org.apache.spark.sql.catalyst.catalog.{CatalogStorageFormat, CatalogTable, CatalogTableType} import org.apache.spark.sql.types.{LongType, StructType} val identifier = TableIdentifier("spark_29498", None) val owner = "SPARK-29498" val newTable = CatalogTable( identifier, tableType = CatalogTableType.EXTERNAL, storage = CatalogStorageFormat( locationUri = None, inputFormat = None, outputFormat = None, serde = None, compressed = false, properties = Map.empty), owner = owner, schema = new StructType().add("i", LongType, false), provider = Some("hive")) spark.sessionState.catalog.createTable(newTable, false) // The owner is not SPARK-29498 println(spark.sessionState.catalog.getTableMetadata(identifier).owner) ``` This PR makes it set the `HiveTable`'s owner to `CatalogTable`'s owner if it's owner is not empty when converting `CatalogTable` to `HiveTable`. ### Why are the changes needed? We should not change the ownership of the table when converting `CatalogTable` to `HiveTable`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? unit test Closes #26160 from wangyum/SPARK-29498. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-21 15:53:36 +08:00
Kent Yao	5b4d9170ed	[SPARK-27879][SQL] Add support for bit_and and bit_or aggregates ### What changes were proposed in this pull request? ``` bit_and(expression) -- The bitwise AND of all non-null input values, or null if none bit_or(expression) -- The bitwise OR of all non-null input values, or null if none ``` More details: https://www.postgresql.org/docs/9.3/functions-aggregate.html ### Why are the changes needed? Postgres, Mysql and many other popular db support them. ### Does this PR introduce any user-facing change? add two bit agg ### How was this patch tested? add ut Closes #26155 from yaooqinn/SPARK-27879. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-21 14:32:31 +08:00
DB Tsai	f4d5aa4213	[SPARK-29434][CORE] Improve the MapStatuses Serialization Performance ### What changes were proposed in this pull request? Instead of using GZIP for compressing the serialized `MapStatuses`, ZStd provides better compression rate and faster compression time. The original approach is serializing and writing data directly into `GZIPOutputStream` as one step; however, the compression time is faster if a bigger chuck of the data is processed by the codec at once. As a result, in this PR, the serialized data is written into an uncompressed byte array first, and then the data is compressed. For smaller `MapStatues`, we find it's 2x faster. Here is the benchmark result. #### 20k map outputs, and each has 500 blocks 1. ZStd two steps in this PR: 0.402 ops/ms, 89,066 bytes 2. ZStd one step as the original approach: 0.370 ops/ms, 89,069 bytes 3. GZip: 0.092 ops/ms, 217,345 bytes #### 20k map outputs, and each has 5 blocks 1. ZStd two steps in this PR: 0.9 ops/ms, 75,449 bytes 2. ZStd one step as the original approach: 0.38 ops/ms, 75,452 bytes 3. GZip: 0.21 ops/ms, 160,094 bytes ### Why are the changes needed? Decrease the time for serializing the `MapStatuses` in large scale job. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #26085 from dbtsai/mapStatus. Lead-authored-by: DB Tsai <d_tsai@apple.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-20 13:56:23 -07:00
Yuming Wang	0f65b49f55	[SPARK-29525][SQL][TEST] Fix the associated location already exists in SQLQueryTestSuite ### What changes were proposed in this pull request? This PR fix Fix the associated location already exists in `SQLQueryTestSuite`: ``` build/sbt "~sql/test-only SQLQueryTestSuite -- -z postgreSQL/join.sql" ... [info] - postgreSQL/join.sql FAILED * (35 seconds, 420 milliseconds) [info] postgreSQL/join.sql [info] Expected "[]", but got "[org.apache.spark.sql.AnalysisException [info] Can not create the managed table('`default`.`tt3`'). The associated location('file:/root/spark/sql/core/spark-warehouse/org.apache.spark.sql.SQLQueryTestSuite/tt3') already exists.;]" Result did not match for query #108 ``` ### Why are the changes needed? Fix bug. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #26181 from wangyum/TestError. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-20 13:31:59 -07:00
shahid	4a6005c795	[SPARK-29235][ML][PYSPARK] Support avgMetrics in read/write of CrossValidatorModel ### What changes were proposed in this pull request? Currently pyspark doesn't write/read `avgMetrics` in `CrossValidatorModel`, whereas scala supports it. ### Why are the changes needed? Test step to reproduce it: ``` dataset = spark.createDataFrame([(Vectors.dense([0.0]), 0.0), (Vectors.dense([0.4]), 1.0), (Vectors.dense([0.5]), 0.0), (Vectors.dense([0.6]), 1.0), (Vectors.dense([1.0]), 1.0)] * 10, ["features", "label"]) lr = LogisticRegression() grid = ParamGridBuilder().addGrid(lr.maxIter, [0, 1]).build() evaluator = BinaryClassificationEvaluator() cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator,parallelism=2) cvModel = cv.fit(dataset) cvModel.write().save("/tmp/model") cvModel2 = CrossValidatorModel.read().load("/tmp/model") print(cvModel.avgMetrics) # prints non empty result as expected print(cvModel2.avgMetrics) # Bug: prints an empty result. ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually tested Before patch: ``` >>> cvModel.write().save("/tmp/model_0") >>> cvModel2 = CrossValidatorModel.read().load("/tmp/model_0") >>> print(cvModel2.avgMetrics) [] ``` After patch: ``` >>> cvModel2 = CrossValidatorModel.read().load("/tmp/model_2") >>> print(cvModel2.avgMetrics[0]) 0.5 ``` Closes #26038 from shahidki31/avgMetrics. Authored-by: shahid <shahidki31@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-19 15:23:57 -05:00
Terry Kim	ab92e1715e	[SPARK-29512][SQL] REPAIR TABLE should look up catalog/table like v2 commands ### What changes were proposed in this pull request? Add RepairTableStatement and make REPAIR TABLE go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g. ``` USE my_catalog DESC t // success and describe the table t from my_catalog MSCK REPAIR TABLE t // report table not found as there is no table t in the session catalog ``` ### Does this PR introduce any user-facing change? yes. When running MSCK REPAIR TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog. ### How was this patch tested? New unit tests Closes #26168 from imback82/repair_table. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>	2019-10-18 22:43:58 -07:00
Wenchen Fan	2437878299	[SPARK-29502][SQL] typed interval expression should fail for invalid format ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/25241 . The typed interval expression should fail for invalid format. ### Why are the changes needed? Te be consistent with the typed timestamp/date expression ### Does this PR introduce any user-facing change? Yes. But this feature is not released yet. ### How was this patch tested? updated test Closes #26151 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com>	2019-10-18 16:12:03 -07:00
Dongjoon Hyun	e4b4a35de2	[SPARK-29466][WEBUI] Show `Duration` for running drivers in Standalone master web UI ### What changes were proposed in this pull request? This PR aims to add a new column `Duration` for running drivers in Apache Spark `Standalone` master web UI in order to improve UX. This help users like the other `Duration` columns in the `Running` and `Completed` application tables. ### Why are the changes needed? When we use `--supervise`, the drivers can survive longer. Technically, the `Duration` column is not the same. (Please see the image below.) ### Does this PR introduce any user-facing change? Yes. The red box is added newly. <img width="1312" alt="Screen Shot 2019-10-14 at 12 53 43 PM" src="https://user-images.githubusercontent.com/9700541/66779127-50301b80-ee82-11e9-853f-72222cd011ac.png"> ### How was this patch tested? Manual since this is a UI column. After starting standalone cluster and jobs, kill the `DriverWrapper` and see the UI. ``` $ sbin/start-master.sh $ sbin/start-slave.sh spark://$(hostname):7077 $ bin/spark-submit --master spark://(hostname):7077 --deploy-mode cluster --supervise --class org.apache.spark.examples.JavaSparkPi examples/target/scala-2.12/jars/spark-examples_2.12-3.0.0-SNAPSHOT.jar 1000 $ jps 41521 DriverWrapper ... $ kill -9 41521 // kill the `DriverWrapper`. ``` Closes #26113 from dongjoon-hyun/SPARK-29466. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-18 15:39:44 -07:00
Rahul Mahadev	4cfce3e5d0	[SPARK-29494][SQL] Fix for ArrayOutofBoundsException while converting string to timestamp ### What changes were proposed in this pull request? * Adding an additional check in `stringToTimestamp` to handle cases where the input has trailing ':' * Added a test to make sure this works. ### Why are the changes needed? In a couple of scenarios while converting from String to Timestamp `DateTimeUtils.stringToTimestamp` throws an array out of bounds exception if there is trailing ':'. The behavior of this method requires it to return `None` in case the format of the string is incorrect. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added a test in the `DateTimeTestUtils` suite to test if my fix works. Closes #26143 from rahulsmahadev/SPARK-29494. Lead-authored-by: Rahul Mahadev <rahul.mahadev@databricks.com> Co-authored-by: Rahul Shivu Mahadev <51690557+rahulsmahadev@users.noreply.github.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-18 16:45:25 -05:00
DB Tsai	23f45f1822	[SPARK-29515][CORE] MapStatuses SerDeser Benchmark <!-- Thanks for sending a pull request! Here are some tips for you: 1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html 2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html 3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'. 4. Be sure to keep the PR description updated to reflect all changes. 5. Please write your PR title to summarize what this PR proposes. 6. If possible, provide a concise example to reproduce the issue for a faster review. --> ### What changes were proposed in this pull request? Add benchmark code for MapStatuses serialization & deserialization performance. ### Why are the changes needed? For comparing the performance differences against optimization. ### Does this PR introduce any user-facing change? <!-- If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible. If no, write 'No'. --> No ### How was this patch tested? <!-- If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible. If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future. If tests were not added, please describe why they were not added and/or why it was difficult to add. --> No test is required. Closes #26169 from dbtsai/benchmark. Lead-authored-by: DB Tsai <d_tsai@apple.com> Co-authored-by: Dongjoon Hyun <dhyun@apple.com> Co-authored-by: DB Tsai <dbtsai@dbtsai.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2019-10-18 21:30:36 +00:00
angerszhu	9a3dccae72	[SPARK-29379][SQL] SHOW FUNCTIONS show '!=', '<>' , 'between', 'case' ### What changes were proposed in this pull request? Current Spark SQL `SHOW FUNCTIONS` don't show `!=`, `<>`, `between`, `case` But these expressions is truly functions. We should show it in SQL `SHOW FUNCTIONS` ### Why are the changes needed? SHOW FUNCTIONS show '!=', '<>' , 'between', 'case' ### Does this PR introduce any user-facing change? SHOW FUNCTIONS show '!=', '<>' , 'between', 'case' ### How was this patch tested? UT Closes #26053 from AngersZhuuuu/SPARK-29379. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-19 00:19:56 +08:00
Yuming Wang	9e42c52c77	[MINOR][DOCS] Fix incorrect EqualNullSafe symbol in sql-migration-guide.md ### What changes were proposed in this pull request? This PR fixes the incorrect `EqualNullSafe` symbol in `sql-migration-guide.md`. ### Why are the changes needed? Fix documentation error. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A Closes #26163 from wangyum/EqualNullSafe-symbol. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-18 10:58:17 -05:00
Maxim Gekk	77fe8a8e7c	[SPARK-28420][SQL] Support the `INTERVAL` type in `date_part()` ### What changes were proposed in this pull request? The `date_part()` function can accept the `source` parameter of the `INTERVAL` type (`CalendarIntervalType`). The following values of the `field` parameter are supported: - `"MILLENNIUM"` (`"MILLENNIA"`, `"MIL"`, `"MILS"`) - number of millenniums in the given interval. It is `YEAR / 1000`. - `"CENTURY"` (`"CENTURIES"`, `"C"`, `"CENT"`) - number of centuries in the interval calculated as `YEAR / 100`. - `"DECADE"` (`"DECADES"`, `"DEC"`, `"DECS"`) - decades in the `YEAR` part of the interval calculated as `YEAR / 10`. - `"YEAR"` (`"Y"`, `"YEARS"`, `"YR"`, `"YRS"`) - years in a values of `CalendarIntervalType`. It is `MONTHS / 12`. - `"QUARTER"` (`"QTR"`) - a quarter of year calculated as `MONTHS / 3 + 1` - `"MONTH"` (`"MON"`, `"MONS"`, `"MONTHS"`) - the months part of the interval calculated as `CalendarInterval.months % 12` - `"DAY"` (`"D"`, `"DAYS"`) - total number of days in `CalendarInterval.microseconds` - `"HOUR"` (`"H"`, `"HOURS"`, `"HR"`, `"HRS"`) - the hour part of the interval. - `"MINUTE"` (`"M"`, `"MIN"`, `"MINS"`, `"MINUTES"`) - the minute part of the interval. - `"SECOND"` (`"S"`, `"SEC"`, `"SECONDS"`, `"SECS"`) - the seconds part with fractional microsecond part. - `"MILLISECONDS"` (`"MSEC"`, `"MSECS"`, `"MILLISECON"`, `"MSECONDS"`, `"MS"`) - the millisecond part of the interval with fractional microsecond part. - `"MICROSECONDS"` (`"USEC"`, `"USECS"`, `"USECONDS"`, `"MICROSECON"`, `"US"`) - the total number of microseconds in the `second`, `millisecond` and `microsecond` parts of the given interval. - `"EPOCH"` - the total number of seconds in the interval including the fractional part with microsecond precision. Here we assume 365.25 days per year (leap year every four years). For example: ```sql > SELECT date_part('days', interval 1 year 10 months 5 days); 5 > SELECT date_part('seconds', interval 30 seconds 1 milliseconds 1 microseconds); 30.001001 ``` ### Why are the changes needed? To maintain feature parity with PostgreSQL (https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT) ### Does this PR introduce any user-facing change? No ### How was this patch tested? - Added new test suite `IntervalExpressionsSuite` - Add new test cases to `date_part.sql` Closes #25981 from MaxGekk/extract-from-intervals. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-18 23:54:59 +08:00
jiake	c3a0d02a40	[SPARK-28560][SQL][FOLLOWUP] resolve the remaining comments for PR#25295 ### What changes were proposed in this pull request? A followup of [#25295](https://github.com/apache/spark/pull/25295). 1) change the logWarning to logDebug in `OptimizeLocalShuffleReader`. 2) update the test to check whether query stage reuse can work well with local shuffle reader. ### Why are the changes needed? make code robust ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing tests Closes #26157 from JkSelf/followup-25295. Authored-by: jiake <ke.a.jia@intel.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-18 23:16:58 +08:00
Terry Kim	39af51dbc6	[SPARK-29014][SQL] DataSourceV2: Fix current/default catalog usage ### What changes were proposed in this pull request? The handling of the catalog across plans should be as follows ([SPARK-29014](https://issues.apache.org/jira/browse/SPARK-29014)): * The current catalog should be used when no catalog is specified * The default catalog is the catalog current is initialized to * If the default catalog is not set, then current catalog is the built-in Spark session catalog. This PR addresses the issue where current catalog usage is not followed as describe above. ### Why are the changes needed? It is a bug as described in the previous section. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Unit tests added. Closes #26120 from imback82/cleanup_catalog. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-18 22:45:42 +08:00
Wenchen Fan	74351468de	[SPARK-29482][SQL] ANALYZE TABLE should look up catalog/table like v2 commands ### What changes were proposed in this pull request? Add `AnalyzeTableStatement` and `AnalyzeColumnStatement`, and make ANALYZE TABLE go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g. ``` USE my_catalog DESC t // success and describe the table t from my_catalog ANALYZE TABLE t // report table not found as there is no table t in the session catalog ``` ### Does this PR introduce any user-facing change? yes. When running ANALYZE TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog. ### How was this patch tested? new tests Closes #26129 from cloud-fan/analyze-table. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2019-10-18 12:55:49 +02:00
zhengruifeng	dba673f0e3	[SPARK-29489][ML][PYSPARK] ml.evaluation support log-loss ### What changes were proposed in this pull request? `ml.MulticlassClassificationEvaluator` & `mllib.MulticlassMetrics` support log-loss ### Why are the changes needed? log-loss is an important classification metric and is widely used in practice ### Does this PR introduce any user-facing change? Yes, add new option ("logloss") and a related param `eps` ### How was this patch tested? added testsuites & local tests refering to sklearn Closes #26135 from zhengruifeng/logloss. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-10-18 17:57:13 +08:00
Huaxin Gao	6f8c001c8d	[SPARK-29381][FOLLOWUP][PYTHON][ML] Add 'private' _XXXParams classes for classification & regression ### What changes were proposed in this pull request? Add private _XXXParams classes for classification & regression ### Why are the changes needed? To keep parity between scala and python ### Does this PR introduce any user-facing change? Yes. Add gettters/setters for the following Model classes ``` LinearSVCModel: get/setRegParam get/setMaxIte get/setFitIntercept get/setTol get/setStandardization get/setWeightCol get/setAggregationDepth get/setThreshold LogisticRegressionModel: get/setRegParam get/setElasticNetParam get/setMaxIter get/setFitIntercept get/setTol get/setStandardization get/setWeightCol get/setAggregationDepth get/setThreshold NaiveBayesModel: get/setWeightCol LinearRegressionModel: get/setRegParam get/setElasticNetParam get/setMaxIter get/setTol get/setFitIntercept get/setStandardization get/setWeight get/setSolver get/setAggregationDepth get/setLoss GeneralizedLinearRegressionModel: get/setFitIntercept get/setMaxIter get/setTol get/setRegParam get/setWeightCol get/setSolver ``` ### How was this patch tested? Add a few doctest Closes #26142 from huaxingao/spark-29381. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2019-10-18 17:26:54 +08:00
Liang-Chi Hsieh	5692680e37	[SPARK-29295][SQL] Insert overwrite to Hive external table partition should delete old data ### What changes were proposed in this pull request? This patch proposes to delete old Hive external partition directory even the partition does not exist in Hive, when insert overwrite Hive external table partition. ### Why are the changes needed? When insert overwrite to a Hive external table partition, if the partition does not exist, Hive will not check if the external partition directory exists or not before copying files. So if users drop the partition, and then do insert overwrite to the same partition, the partition will have both old and new data. For example: ```scala withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> "false") { // test is an external Hive table. sql("INSERT OVERWRITE TABLE test PARTITION(name='n1') SELECT 1") sql("ALTER TABLE test DROP PARTITION(name='n1')") sql("INSERT OVERWRITE TABLE test PARTITION(name='n1') SELECT 2") sql("SELECT id FROM test WHERE name = 'n1' ORDER BY id") // Got both 1 and 2. } ``` ### Does this PR introduce any user-facing change? Yes. This fix a correctness issue when users drop partition on a Hive external table partition and then insert overwrite it. ### How was this patch tested? Added test. Closes #25979 from viirya/SPARK-29295. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-18 16:35:44 +08:00
Kent Yao	ef4c298cc9	[SPARK-29405][SQL] Alter table / Insert statements should not change a table's ownership ### What changes were proposed in this pull request? In this change, we give preference to the original table's owner if it is not empty. ### Why are the changes needed? When executing 'insert into/overwrite ...' DML, or 'alter table set tblproperties ...' DDL, spark would change the ownership of the table the one who runs the spark application. ### Does this PR introduce any user-facing change? NO ### How was this patch tested? Compare with the behavior of Apache Hive Closes #26068 from yaooqinn/SPARK-29405. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-18 16:21:31 +08:00
stczwd	78b0cbe265	[SPARK-29444] Add configuration to support JacksonGenrator to keep fields with null values ### Why are the changes needed? As mentioned in jira, sometimes we need to be able to support the retention of null columns when writing JSON. For example, sparkmagic(used widely in jupyter with livy) will generate sql query results based on DataSet.toJSON and parse JSON to pandas DataFrame to display. If there is a null column, it is easy to have some column missing or even the query result is empty. The loss of the null column in the first row, may cause parsing exceptions or loss of entire column data. ### Does this PR introduce any user-facing change? Example in spark-shell. scala> spark.sql("select null as a, 1 as b").toJSON.collect.foreach(println) {"b":1} scala> spark.sql("set spark.sql.jsonGenerator.struct.ignore.null=false") res2: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.sql("select null as a, 1 as b").toJSON.collect.foreach(println) {"a":null,"b":1} ### How was this patch tested? Add new test to JacksonGeneratorSuite Closes #26098 from stczwd/json. Lead-authored-by: stczwd <qcsd2011@163.com> Co-authored-by: Jackey Lee <qcsd2011@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-18 16:06:54 +08:00
Dilip Biswal	ec5d698d99	[SPARK-29092][SQL] Report additional information about DataSourceScanExec in EXPLAIN FORMATTED # What changes were proposed in this pull request? Currently we report only output attributes of a scan while doing EXPLAIN FORMATTED. This PR implements the ```verboseStringWithOperatorId``` in DataSourceScanExec to report additional information about a scan such as pushed down filters, partition filters, location etc. SQL ``` EXPLAIN FORMATTED SELECT key, max(val) FROM explain_temp1 WHERE key > 0 GROUP BY key ORDER BY key ``` Before ``` == Physical Plan == * Sort (9) +- Exchange (8) +- * HashAggregate (7) +- Exchange (6) +- * HashAggregate (5) +- * Project (4) +- * Filter (3) +- * ColumnarToRow (2) +- Scan parquet default.explain_temp1 (1) (1) Scan parquet default.explain_temp1 Output: [key#x, val#x] .... .... .... ``` After ``` == Physical Plan == * Sort (9) +- Exchange (8) +- * HashAggregate (7) +- Exchange (6) +- * HashAggregate (5) +- * Project (4) +- * Filter (3) +- * ColumnarToRow (2) +- Scan parquet default.explain_temp1 (1) (1) Scan parquet default.explain_temp1 Output: [key#x, val#x] Batched: true DataFilters: [isnotnull(key#x), (key#x > 0)] Format: Parquet Location: InMemoryFileIndex[file:/tmp/apache/spark/spark-warehouse/explain_temp1] PushedFilters: [IsNotNull(key), GreaterThan(key,0)] ReadSchema: struct<key:int,val:int> ... ... ... ``` ### Why are the changes needed? ### Does this PR introduce any user-facing change? ### How was this patch tested? Closes #26042 from dilipbiswal/verbose_string_datasrc_scanexec. Authored-by: Dilip Biswal <dkbiswal@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-18 15:53:13 +08:00
Yuanjian Li	8616109061	[SPARK-9853][CORE][FOLLOW-UP] Regularize all the shuffle configurations related to adaptive execution ### What changes were proposed in this pull request? 1. Regularize all the shuffle configurations related to adaptive execution. 2. Add default value for `BlockStoreShuffleReader.shouldBatchFetch`. ### Why are the changes needed? It's a follow-up PR for #26040. Regularize the existing `spark.sql.adaptive.shuffle` namespace in SQLConf. ### Does this PR introduce any user-facing change? Rename one released user config `spark.sql.adaptive.minNumPostShufflePartitions` to `spark.sql.adaptive.shuffle.minNumPostShufflePartitions`, other changed configs is not released yet. ### How was this patch tested? Existing UT. Closes #26147 from xuanyuanking/SPARK-9853. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-18 15:39:35 +08:00
Huaxin Gao	901ff92969	[SPARK-29464][PYTHON][ML] PySpark ML should expose Params.clear() to unset a user supplied Param ### What changes were proposed in this pull request? change PySpark ml ```Params._clear``` to ```Params.clear``` ### Why are the changes needed? PySpark ML currently has a private _clear() method that will unset a param. This should be made public to match the Scala API and give users a way to unset a user supplied param. ### Does this PR introduce any user-facing change? Yes. PySpark ml ```Params._clear``` ---> ```Params.clear``` ### How was this patch tested? Add test. Closes #26130 from huaxingao/spark-29464. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Bryan Cutler <cutlerb@gmail.com>	2019-10-17 17:02:31 -07:00
Ivan Gozali	00347a3c78	[SPARK-28762][CORE] Read JAR main class if JAR is not located in local file system ### What changes were proposed in this pull request? JIRA: https://issues.apache.org/jira/browse/SPARK-28762 TL;DR: Automatically read the `Main-Class` from a JAR's manifest even if the JAR isn't in the local file system (i.e. in S3 or HDFS). ### Why are the changes needed? When deploying a fat JAR (e.g. using `sbt-assembly`) to S3/HDFS, users might choose to include the main class for the JAR in its manifest. This change allows the user to `spark-submit` the JAR without having to specify the main class again via the `--class` argument. ### Does this PR introduce any user-facing change? Yes. Previously, if the primary resource is a JAR and isn't located in the local file system, it will fail with the error: ``` $ spark-submit s3a://nonexistent.jar Exception in thread "main" org.apache.spark.SparkException: Cannot load main class from JAR s3a://nonexistent.jar with URI s3a. Please specify a class through --class. ... ``` With this PR, the main class will be read from the manifest, assuming the classpath contains the appropriate JAR to read the file system. ### How was this patch tested? Added some tests in `core/src/test/scala/org/apache/spark/deploy/SparkSubmitSuite.scala`. Closes #25910 from igozali/SPARK-28762. Authored-by: Ivan Gozali <gozaliivan@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-10-17 14:36:01 -07:00
igor.calabria	78bdcfade1	[SPARK-27812][K8S] Bump K8S client version to 4.6.1 ### What changes were proposed in this pull request? Updated kubernetes client. ### Why are the changes needed? https://issues.apache.org/jira/browse/SPARK-27812 https://issues.apache.org/jira/browse/SPARK-27927 We need this fix https://github.com/fabric8io/kubernetes-client/pull/1768 that was released on version 4.6 of the client. The root cause of the problem is better explained in https://github.com/apache/spark/pull/25785 ### Does this PR introduce any user-facing change? Nope, it should be transparent to users ### How was this patch tested? This patch was tested manually using a simple pyspark job ```python from pyspark.sql import SparkSession if __name__ == '__main__': spark = SparkSession.builder.getOrCreate() ``` The expected behaviour of this "job" is that both python's and jvm's process exit automatically after the main runs. This is the case for spark versions <= 2.4. On version 2.4.3, the jvm process hangs because there's a non daemon thread running ``` "OkHttp WebSocket https://10.96.0.1/..." #121 prio=5 os_prio=0 tid=0x00007fb27c005800 nid=0x24b waiting on condition [0x00007fb300847000] "OkHttp WebSocket https://10.96.0.1/..." #117 prio=5 os_prio=0 tid=0x00007fb28c004000 nid=0x247 waiting on condition [0x00007fb300e4b000] ``` This is caused by a bug on `kubernetes-client` library, which is fixed on the version that we are upgrading to. When the mentioned job is run with this patch applied, the behaviour from spark <= 2.4.3 is restored and both processes terminate successfully Closes #26093 from igorcalabria/k8s-client-update. Authored-by: igor.calabria <igor.calabria@ubee.in> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-17 12:23:24 -07:00
Jungtaek Lim (HeartSaVioR)	100fc58da5	[SPARK-28869][CORE] Roll over event log files ### What changes were proposed in this pull request? This patch is a part of [SPARK-28594](https://issues.apache.org/jira/browse/SPARK-28594) and design doc for SPARK-28594 is linked here: https://docs.google.com/document/d/12bdCC4nA58uveRxpeo8k7kGOI2NRTXmXyBOweSi4YcY/edit?usp=sharing This patch proposes adding new feature to event logging, rolling event log files via configured file size. Previously event logging is done with single file and related codebase (`EventLoggingListener`/`FsHistoryProvider`) is tightly coupled with it. This patch adds layer on both reader (`EventLogFileReader`) and writer (`EventLogFileWriter`) to decouple implementation details between "handling events" and "how to read/write events from/to file". This patch adds two properties, `spark.eventLog.rollLog` and `spark.eventLog.rollLog.maxFileSize` which provides configurable behavior of rolling log. The feature is disabled by default, as we only expect huge event log for huge/long-running application. For other cases single event log file would be sufficient and still simpler. ### Why are the changes needed? This is a part of SPARK-28594 which addresses event log growing infinitely for long-running application. This patch itself also provides some option for the situation where event log file gets huge and consume their storage. End users may give up replaying their events and want to delete the event log file, but given application is still running and writing the file, it's not safe to delete the file. End users will be able to delete some of old files after applying rolling over event log. ### Does this PR introduce any user-facing change? No, as the new feature is turned off by default. ### How was this patch tested? Added unit tests, as well as basic manual tests. Basic manual tests - ran SHS, ran structured streaming query with roll event log enabled, verified split files are generated as well as SHS can load these files, with handling app status as incomplete/complete. Closes #25670 from HeartSaVioR/SPARK-28869. Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com> Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>	2019-10-17 11:15:25 -07:00
Marcelo Vanzin	2f0a38cb50	[SPARK-29398][CORE] Support dedicated thread pools for RPC endpoints The current RPC backend in Spark supports single- and multi-threaded message delivery to endpoints, but they all share the same underlying thread pool. So an RPC endpoint that blocks a dispatcher thread can negatively affect other endpoints. This can be more pronounced with configurations that limit the number of RPC dispatch threads based on configuration and / or running environment. And exposing the RPC layer to other code (for example with something like SPARK-29396) could make it easy to affect normal Spark operation with a badly written RPC handler. This change adds a new RPC endpoint type that tells the RPC env to create dedicated dispatch threads, so that those effects are minimised. Other endpoints will still need CPU to process their messages, but they won't be able to actively block the dispatch thread of these isolated endpoints. As part of the change, I've changed the most important Spark endpoints (the driver, executor and block manager endpoints) to be isolated from others. This means a couple of extra threads are created on the driver and executor for these endpoints. Tested with existing unit tests, which hammer the RPC system extensively, and also by running applications on a cluster (with a prototype of SPARK-29396). Closes #26059 from vanzin/SPARK-29398. Authored-by: Marcelo Vanzin <vanzin@cloudera.com> Signed-off-by: Imran Rashid <irashid@cloudera.com>	2019-10-17 13:14:32 -05:00
maruilei	f800fa3831	[SPARK-29436][K8S] Support executor for selecting scheduler through scheduler name in the case of k8s multi-scheduler scenario ### What changes were proposed in this pull request? Support executor for selecting scheduler through scheduler name in the case of k8s multi-scheduler scenario. ### Why are the changes needed? If there is no such function, spark can not support the case of k8s multi-scheduler scenario. ### Does this PR introduce any user-facing change? Yes, users can add scheduler name through configuration. ### How was this patch tested? Manually tested with spark + k8s cluster Closes #26088 from merrily01/SPARK-29436. Authored-by: maruilei <maruilei@jd.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-17 07:24:13 -07:00
Jiajia Li	dc0bc7a6eb	[MINOR][DOCS] Fix some typos ### What changes were proposed in this pull request? This PR proposes a few typos: 1. Sparks => Spark's 2. parallize => parallelize 3. doesnt => doesn't Closes #26140 from plusplusjiajia/fix-typos. Authored-by: Jiajia Li <jiajia.li@intel.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-10-17 07:22:01 -07:00
Kent Yao	4b902d3b45	[SPARK-29491][SQL] Add bit_count function support ### What changes were proposed in this pull request? BIT_COUNT(N) - Returns the number of bits that are set in the argument N as an unsigned 64-bit integer, or NULL if the argument is NULL ### Why are the changes needed? Supported by MySQL，Microsoft SQL Server ，etc. ### Does this PR introduce any user-facing change? add a built-in function ### How was this patch tested? add uts Closes #26139 from yaooqinn/SPARK-29491. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-17 20:22:38 +08:00
Yuanjian Li	239ee3f561	[SPARK-9853][CORE] Optimize shuffle fetch of continuous partition IDs This PR takes over #19788. After we split the shuffle fetch protocol from `OpenBlock` in #24565, this optimization can be extended in the new shuffle protocol. Credit to yucai, closes #19788. ### What changes were proposed in this pull request? This PR adds the support for continuous shuffle block fetching in batch: - Shuffle client changes: - Add new feature tag `spark.shuffle.fetchContinuousBlocksInBatch`, implement the decision logic in `BlockStoreShuffleReader`. - Merge the continuous shuffle block ids in batch if needed in ShuffleBlockFetcherIterator. - Shuffle server changes: - Add support in `ExternalBlockHandler` for the external shuffle service side. - Make `ShuffleBlockResolver.getBlockData` accept getting block data by range. - Protocol changes: - Add new block id type `ShuffleBlockBatchId` represent continuous shuffle block ids. - Extend `FetchShuffleBlocks` and `OneForOneBlockFetcher`. - After the new shuffle fetch protocol completed in #24565, the backward compatibility for external shuffle service can be controlled by `spark.shuffle.useOldFetchProtocol`. ### Why are the changes needed? In adaptive execution, one reducer may fetch multiple continuous shuffle blocks from one map output file. However, as the original approach, each reducer needs to fetch those 10 reducer blocks one by one. This way needs many IO and impacts performance. This PR is to support fetching those continuous shuffle blocks in one IO (batch way). See below example: The shuffle block is stored like below: ![image](https://user-images.githubusercontent.com/2989575/51654634-c37fbd80-1fd3-11e9-935e-5652863676c3.png) The ShuffleId format is s"shuffle_$shuffleId_$mapId_$reduceId", referring to BlockId.scala. In adaptive execution, one reducer may want to read output for reducer 5 to 14, whose block Ids are from shuffle_0_x_5 to shuffle_0_x_14. Before this PR, Spark needs 10 disk IOs + 10 network IOs for each output file. After this PR, Spark only needs 1 disk IO and 1 network IO. This way can reduce IO dramatically. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Add new UT. Integrate test with setting `spark.sql.adaptive.enabled=true`. Closes #26040 from xuanyuanking/SPARK-9853. Lead-authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Co-authored-by: yucai <yyu1@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-10-17 14:47:56 +08:00
lajin	fda4070ea9	[SPARK-29283][SQL] Error message is hidden when query from JDBC, especially enabled adaptive execution ### What changes were proposed in this pull request? When adaptive execution is enabled, the Spark users who connected from JDBC always get adaptive execution error whatever the under root cause is. It's very confused. We have to check the driver log to find out why. ```shell 0: jdbc:hive2://localhost:10000> SELECT * FROM testData join testData2 ON key = v; SELECT * FROM testData join testData2 ON key = v; Error: Error running query: org.apache.spark.SparkException: Adaptive execution failed due to stage materialization failures. (state=,code=0) 0: jdbc:hive2://localhost:10000> ``` For example, a job queried from JDBC failed due to HDFS missing block. User still get the error message `Adaptive execution failed due to stage materialization failures`. The easiest way to reproduce is changing the code of `AdaptiveSparkPlanExec`, to let it throws out an exception when it faces `StageSuccess`. ```scala case class AdaptiveSparkPlanExec( events.drainTo(rem) (Seq(nextMsg) ++ rem.asScala).foreach { case StageSuccess(stage, res) => // stage.resultOption = Some(res) val ex = new SparkException("Wrapper Exception", new IllegalArgumentException("Root cause is IllegalArgumentException for Test")) errors.append( new SparkException(s"Failed to materialize query stage: ${stage.treeString}", ex)) case StageFailure(stage, ex) => errors.append( new SparkException(s"Failed to materialize query stage: ${stage.treeString}", ex)) ``` ### Why are the changes needed? To make the error message more user-friend and more useful for query from JDBC. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually test query: ```shell 0: jdbc:hive2://localhost:10000> CREATE TEMPORARY VIEW testData (key, value) AS SELECT explode(array(1, 2, 3, 4)), cast(substring(rand(), 3, 4) as string); CREATE TEMPORARY VIEW testData (key, value) AS SELECT explode(array(1, 2, 3, 4)), cast(substring(rand(), 3, 4) as string); +---------+--+ \| Result \| +---------+--+ +---------+--+ No rows selected (0.225 seconds) 0: jdbc:hive2://localhost:10000> CREATE TEMPORARY VIEW testData2 (k, v) AS SELECT explode(array(1, 1, 2, 2)), cast(substring(rand(), 3, 4) as int); CREATE TEMPORARY VIEW testData2 (k, v) AS SELECT explode(array(1, 1, 2, 2)), cast(substring(rand(), 3, 4) as int); +---------+--+ \| Result \| +---------+--+ +---------+--+ No rows selected (0.043 seconds) ``` Before: ```shell 0: jdbc:hive2://localhost:10000> SELECT * FROM testData join testData2 ON key = v; SELECT * FROM testData join testData2 ON key = v; Error: Error running query: org.apache.spark.SparkException: Adaptive execution failed due to stage materialization failures. (state=,code=0) 0: jdbc:hive2://localhost:10000> ``` After: ```shell 0: jdbc:hive2://localhost:10000> SELECT * FROM testData join testData2 ON key = v; SELECT * FROM testData join testData2 ON key = v; Error: Error running query: java.lang.IllegalArgumentException: Root cause is IllegalArgumentException for Test (state=,code=0) 0: jdbc:hive2://localhost:10000> ``` Closes #25960 from LantaoJin/SPARK-29283. Authored-by: lajin <lajin@ebay.com> Signed-off-by: Yuming Wang <wgyumg@gmail.com>	2019-10-16 19:51:56 -07:00
Fokko Driesprong	8eb8f7478c	[SPARK-29483][BUILD] Bump Jackson to 2.10.0 ### What changes were proposed in this pull request? Release blog: https://medium.com/cowtowncoder/jackson-2-10-features-cd880674d8a2 Fixes the following CVE's: https://www.cvedetails.com/cve/CVE-2019-16942/ https://www.cvedetails.com/cve/CVE-2019-16943/ Looking back, there were 3 major goals for this minor release: - Resolve the growing problem of “endless CVE patches”, a stream of fixes for reported CVEs related to “Polymorphic Deserialization” problem (described in “On Jackson CVEs… ”) that resulted in security tools forcing Jackson upgrades. 2.10 now includes “Safe Default Typing” that is hoped to resolve this problem. - Evolve 2.x API towards 3.0, based on changes that were done in master, within limits of 2.x API backwards-compatibility requirements. - Add JDK support for versions beyond Java 8: specifically add“module-info.class” for JDK9+, defining proper module definitions for Jackson components Full changelog: https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.10 Improved Scala 2.13 support: https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.10#scala ### Why are the changes needed? Patches CVE's reported by the vulnerability scanner. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Ran `mvn clean install -DskipTests` locally. Closes #26131 from Fokko/SPARK-29483. Authored-by: Fokko Driesprong <fokko@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-10-16 15:38:54 -07:00

... 2 3 4 5 6 ...

25642 commits