ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Sean Owen	40ef01283d	[SPARK-29802][BUILD] Use python3 in build scripts ### What changes were proposed in this pull request? Use `/usr/bin/env python3` consistently instead of `/usr/bin/env python` in build scripts, to reliably select Python 3. ### Why are the changes needed? Scripts no longer work with Python 2. ### Does this PR introduce _any_ user-facing change? No, should be all build system changes. ### How was this patch tested? Existing tests / NA Closes #29151 from srowen/SPARK-29909.2. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-19 11:02:37 +09:00
Sean Owen	ee624821a9	[SPARK-29292][YARN][K8S][MESOS] Fix Scala 2.13 compilation for remaining modules ### What changes were proposed in this pull request? See again the related PRs like https://github.com/apache/spark/pull/28971 This completes fixing compilation for 2.13 for all but `repl`, which is a separate task. ### Why are the changes needed? Eventually, we need to support a Scala 2.13 build, perhaps in Spark 3.1. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. (2.13 was not tested; this is about getting it to compile without breaking 2.12) Closes #29147 from srowen/SPARK-29292.4. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-18 15:08:00 -07:00
Sudharshann D	f9f9309bec	[SPARK-31579][SQL] replaced floorDiv to Div ### What changes were proposed in this pull request? Replaced floorDiv to just / in `localRebaseGregorianToJulianDays()` in `spark/sql/catalyst/util/RebaseDateTime.scala` ### Why are the changes needed? Easier to understand the logic/code and a little more efficiency. ### Does this PR introduce _any_ user-facing change? NO ### How was this patch tested? Proof of concept [here](https://github.com/apache/spark/pull/28573/files). The operation `utcCal.getTimeInMillis / MILLIS_PER_DAY` results in an interger value already. Closes #29008 from Sudhar287/SPARK-31579. Authored-by: Sudharshann D <sudhar287@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-07-18 13:04:58 -05:00
Prakhar Jain	0678afe393	[SPARK-21040][CORE] Speculate tasks which are running on decommission executors ### What changes were proposed in this pull request? This PR adds functionality to consider the running tasks on decommission executors based on some config. In spark-on-cloud , we sometimes already know that an executor won't be alive for more than fix amount of time. Ex- In AWS Spot nodes, once we get the notification, we know that a node will be gone in 120 seconds. So if the running tasks on the decommissioning executors may run beyond currentTime+120 seconds, then they are candidate for speculation. ### Why are the changes needed? Currently when an executor is decommission, we stop scheduling new tasks on those executors but the already running tasks keeps on running on them. Based on the cloud, we might know beforehand that an executor won't be alive for more than a preconfigured time. Different cloud providers gives different timeouts before they take away the nodes. For Ex- In case of AWS spot nodes, an executor won't be alive for more than 120 seconds. We can utilize this information in cloud environments and take better decisions about speculating the already running tasks on decommission executors. ### Does this PR introduce _any_ user-facing change? Yes. This PR adds a new config "spark.executor.decommission.killInterval" which they can explicitly set based on the cloud environment where they are running. ### How was this patch tested? Added UT. Closes #28619 from prakharjain09/SPARK-21040-speculate-decommission-exec-tasks. Authored-by: Prakhar Jain <prakharjain09@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com>	2020-07-17 16:11:02 -07:00
William Hyun	7dc1d8917d	[SPARK-32353][TEST] Update docker/spark-test and clean up unused stuff ### What changes were proposed in this pull request? This PR aims to update the docker/spark-test and clean up unused stuff. ### Why are the changes needed? Since Spark 3.0.0, Java 11 is supported. We had better use the latest Java and OS. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually do the following as described in https://github.com/apache/spark/blob/master/external/docker/spark-test/README.md . ``` docker run -v $SPARK_HOME:/opt/spark spark-test-master docker run -v $SPARK_HOME:/opt/spark spark-test-worker spark://<master_ip>:7077 ``` Closes #29150 from williamhyun/docker. Authored-by: William Hyun <williamhyun3@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-17 12:05:45 -07:00
zhengruifeng	3a60b41949	[SPARK-32298][ML] tree models prediction optimization ### What changes were proposed in this pull request? use while-loop instead of the recursive way ### Why are the changes needed? 3% ~ 10% faster ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes #29095 from zhengruifeng/tree_pred_opt. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-07-17 12:00:49 -05:00
williamhyun	5daf244d0f	[SPARK-32329][TESTS] Rename HADOOP2_MODULE_PROFILES to HADOOP_MODULE_PROFILES ### What changes were proposed in this pull request? This PR aims to rename `HADOOP2_MODULE_PROFILES` to `HADOOP_MODULE_PROFILES` because Hadoop 3 is now the default. ### Why are the changes needed? Hadoop 3 is now the default. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GitHub Action dependency test. Closes #29128 from williamhyun/williamhyun-patch-3. Authored-by: williamhyun <62487364+williamhyun@users.noreply.github.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-07-17 11:59:19 -05:00
Yaroslav Tkachenko	34baed8139	[SPARK-30616][SQL] Introduce TTL config option for SQL Metadata Cache ### What changes were proposed in this pull request? New `spark.sql.metadataCacheTTLSeconds` option that adds time-to-live cache behaviour to the existing caches in `FileStatusCache` and `SessionCatalog`. ### Why are the changes needed? Currently Spark [caches file listing for tables](https://spark.apache.org/docs/2.4.4/sql-data-sources-parquet.html#metadata-refreshing) and requires issuing `REFRESH TABLE` any time the file listing has changed outside of Spark. Unfortunately, simply submitting `REFRESH TABLE` commands could be very cumbersome. Assuming frequently added files, hundreds of tables and dozens of users querying the data (and expecting up-to-date results), manually refreshing metadata for each table is not a solution. This is a pretty common use-case for streaming ingestion of data, which can be done outside of Spark (with tools like Kafka Connect, etc.). A similar feature exists in Presto: `hive.file-status-cache-expire-time` can be found [here](https://prestosql.io/docs/current/connector/hive.html#hive-configuration-properties). ### Does this PR introduce _any_ user-facing change? Yes, it's controlled with the new `spark.sql.metadataCacheTTLSeconds` option. When it's set to `-1` (by default), the behaviour of caches doesn't change, so it stays _backwards-compatible_. Otherwise, you can specify a value in seconds, for example `spark.sql.metadataCacheTTLSeconds: 60` means 1-minute cache TTL. ### How was this patch tested? Added new tests in: - FileIndexSuite - SessionCatalogSuite Closes #28852 from sap1ens/SPARK-30616-metadata-cache-ttl. Authored-by: Yaroslav Tkachenko <sapiensy@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-17 13:40:54 +00:00
Devesh Agrawal	ffdbbae1d4	[SPARK-32215] Expose a (protected) /workers/kill endpoint on the MasterWebUI ### What changes were proposed in this pull request? This PR allows an external agent to inform the Master that certain hosts are being decommissioned. ### Why are the changes needed? The current decommissioning is triggered by the Worker getting getting a SIGPWR (out of band possibly by some cleanup hook), which then informs the Master about it. This approach may not be feasible in some environments that cannot trigger a clean up hook on the Worker. In addition, when a large number of worker nodes are being decommissioned then the master will get a flood of messages. So we add a new post endpoint `/workers/kill` on the MasterWebUI that allows an external agent to inform the master about all the nodes being decommissioned in bulk. The list of nodes is specified by providing a list of hostnames. All workers on those hosts will be decommissioned. This API is merely a new entry point into the existing decommissioning logic. It does not change how the decommissioning request is handled in its core. ### Does this PR introduce _any_ user-facing change? Yes, a new endpoint `/workers/kill` is added to the MasterWebUI. By default only requests originating from an IP address local to the MasterWebUI are allowed. ### How was this patch tested? Added unit tests Closes #29015 from agrawaldevesh/master_decom_endpoint. Authored-by: Devesh Agrawal <devesh.agrawal@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-17 06:04:34 +00:00
Kent Yao	efa70b8755	[SPARK-32145][SQL][FOLLOWUP] Fix type in the error log of SparkOperation ### What changes were proposed in this pull request? Fix typo error in the error log of SparkOperation trait, reported by https://github.com/apache/spark/pull/28963#discussion_r454954542 ### Why are the changes needed? fix error in thrift server driver log ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Passing GitHub actions Closes #29140 from yaooqinn/SPARK-32145-F. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-17 04:50:26 +00:00
HyukjinKwon	ea9e8f365a	[SPARK-32094][PYTHON] Update cloudpickle to v1.5.0 ### What changes were proposed in this pull request? This PR aims to upgrade PySpark's embedded cloudpickle to the latest cloudpickle v1.5.0 (See https://github.com/cloudpipe/cloudpickle/blob/v1.5.0/cloudpickle/cloudpickle.py) ### Why are the changes needed? There are many bug fixes. For example, the bug described in the JIRA: dill unpickling fails because they define `types.ClassType`, which is undefined in dill. This results in the following error: ``` Traceback (most recent call last): File "/usr/local/lib/python3.6/site-packages/apache_beam/internal/pickler.py", line 279, in loads return dill.loads(s) File "/usr/local/lib/python3.6/site-packages/dill/_dill.py", line 317, in loads return load(file, ignore) File "/usr/local/lib/python3.6/site-packages/dill/_dill.py", line 305, in load obj = pik.load() File "/usr/local/lib/python3.6/site-packages/dill/_dill.py", line 577, in _load_type return _reverse_typemap[name] KeyError: 'ClassType' ``` See also https://github.com/cloudpipe/cloudpickle/issues/82. This was fixed for cloudpickle 1.3.0+ (https://github.com/cloudpipe/cloudpickle/pull/337), but PySpark's cloudpickle.py doesn't have this change yet. More notably, now it supports C pickle implementation with Python 3.8 which hugely improve performance. This is already adopted in another project such as Ray. ### Does this PR introduce _any_ user-facing change? Yes, as described above, the bug fixes. Internally, users also could leverage the fast cloudpickle backed by C pickle. ### How was this patch tested? Jenkins will test it out. Closes #29114 from HyukjinKwon/SPARK-32094. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-17 11:49:18 +09:00
Frank Yin	9747e8fc9d	[SPARK-31831][SQL][TESTS][FOLLOWUP] Put mocks for HiveSessionImplSuite in hive version related subdirectories ### What changes were proposed in this pull request? This patch fixes the build issue on Hive 1.2 profile brought by #29069, via putting mocks for HiveSessionImplSuite in hive version related subdirectories, so that maven build will pick up the proper source code according to the profile. ### Why are the changes needed? #29069 fixed the flakiness of HiveSessionImplSuite, but given the patch relied on the default profile (Hive 2.3) it broke the build with Hive 1.2 profile. This patch addresses both Hive versions. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manually confirmed the test suite via below command: > Hive 1.2 ``` build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.thriftserver.HiveSessionImplSuite test -Phive-1.2 -Phadoop-2.7 -Phive-thriftserver ``` > Hive 2.3 ``` build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.thriftserver.HiveSessionImplSuite test -Phive-2.3 -Phadoop-3.2 -Phive-thriftserver ``` Closes #29129 from frankyin-factual/hive-tests. Authored-by: Frank Yin <frank@factual.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-07-17 11:14:25 +09:00
Dongjoon Hyun	fb51925123	[SPARK-32335][K8S][TESTS] Remove Python2 test from K8s IT ### What changes were proposed in this pull request? This PR aims to remove Python 2 test case from K8s IT. ### Why are the changes needed? Since Apache Spark 3.1.0 dropped Python 2.7, 3.4 and 3.5 support officially via SPARK-32138, K8s IT fails. ``` KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - All pods have the same service account by default - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark with Python2 to test a pyfiles example * FAILED * The code passed to eventually never returned normally. Attempted 113 times over 2.0014854648999996 minutes. Last failure message: false was not true. (KubernetesSuite.scala:370) - Run PySpark with Python3 to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - PVs with local storage - Launcher client dependencies - Test basic decommissioning - Run SparkR on simple dataframe.R example Run completed in 11 minutes, 15 seconds. Total number of tests run: 20 Suites: completed 2, aborted 0 Tests: succeeded 19, failed 1, canceled 0, ignored 0, pending 0 * 1 TEST FAILED * ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass Jenkins K8s IT. Closes #29136 from dongjoon-hyun/SPARK-32335. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-16 11:21:14 -07:00
Huaxin Gao	383f5e9cbe	[SPARK-32310][ML][PYSPARK] ML params default value parity in classification, regression, clustering and fpm ### What changes were proposed in this pull request? set params default values in trait ...Params in both Scala and Python. I will do this in two PRs. I will change classification, regression, clustering and fpm in this PR. Will change the rest in another PR. ### Why are the changes needed? Make ML has the same default param values between estimator and its corresponding transformer, and also between Scala and Python. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests Closes #29112 from huaxingao/set_default. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>	2020-07-16 11:12:29 -07:00
dzlab	d5c672af58	[SPARK-32315][ML] Provide an explanation error message when calling require ### What changes were proposed in this pull request? Small improvement in the error message shown to user https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala#L537-L538 ### Why are the changes needed? The current behavior is an exception is thrown but without any specific details on the cause ``` Caused by: java.lang.IllegalArgumentException: requirement failedCaused by: java.lang.IllegalArgumentException: requirement failed at scala.Predef$.require(Predef.scala:212) at org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:508) at org.apache.spark.mllib.clustering.EuclideanDistanceMeasure$.fastSquaredDistance(DistanceMeasure.scala:232) at org.apache.spark.mllib.clustering.EuclideanDistanceMeasure.isCenterConverged(DistanceMeasure.scala:190) at org.apache.spark.mllib.clustering.KMeans$$anonfun$runAlgorithm$4.apply(KMeans.scala:336) at org.apache.spark.mllib.clustering.KMeans$$anonfun$runAlgorithm$4.apply(KMeans.scala:334) at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245) at scala.collection.MapLike$MappedValues$$anonfun$foreach$3.apply(MapLike.scala:245) at scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:733) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130) at scala.collection.mutable.HashMap$$anonfun$foreach$1.apply(HashMap.scala:130) at scala.collection.mutable.HashTable$class.foreachEntry(HashTable.scala:236) at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40) at scala.collection.mutable.HashMap.foreach(HashMap.scala:130) at scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:732) at scala.collection.MapLike$MappedValues.foreach(MapLike.scala:245) at org.apache.spark.mllib.clustering.KMeans.runAlgorithm(KMeans.scala:334) at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:251) at org.apache.spark.mllib.clustering.KMeans.run(KMeans.scala:233) ``` ### Does this PR introduce _any_ user-facing change? Yes, this PR adds an explanation message to be shown to user when requirement check is not meant ### How was this patch tested? manually Closes #29115 from dzlab/patch/SPARK-32315. Authored-by: dzlab <dzlabs@outlook.com> Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>	2020-07-16 09:25:52 -07:00
Maxim Gekk	c1f160e097	[SPARK-30648][SQL] Support filters pushdown in JSON datasource ### What changes were proposed in this pull request? In the PR, I propose to support pushed down filters in JSON datasource. The reason of pushing a filter up to `JacksonParser` is to apply the filter as soon as all its attributes become available i.e. converted from JSON field values to desired values according to the schema. This allows to skip parsing of the rest of JSON record and conversions of other values if the filter returns `false`. This can improve performance when pushed filters are highly selective and conversion of JSON string fields to desired values are comparably expensive ( for example, the conversion to `TIMESTAMP` values). The main idea behind of `JsonFilters` is to group pushdown filters by their references, convert the grouped filters to expressions, and then compile to predicates. The predicates are indexed by schema field positions. Each predicate has a state with reference counter to non-set row fields. As soon as the counter reaches `0`, it can be applied to the row because all its dependencies has been set. Before processing new row, predicate's reference counter is reset to total number of predicate references (dependencies in a row). The common code shared between `CSVFilters` and `JsonFilters` is moved to the `StructFilters` class and its companion object. ### Why are the changes needed? The changes improve performance on synthetic benchmarks up to 27 times on JDK 8 and 25 times on JDK 11: ``` OpenJDK 64-Bit Server VM 1.8.0_242-8u242-b08-0ubuntu3~18.04-b08 on Linux 4.15.0-1044-aws Intel(R) Xeon(R) CPU E5-2670 v2 2.50GHz Filters pushdown: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative ------------------------------------------------------------------------------------------------------------------------ w/o filters 25230 25255 22 0.0 252299.6 1.0X pushdown disabled 25248 25282 33 0.0 252475.6 1.0X w/ filters 905 911 8 0.1 9047.9 27.9X ``` ### Does this PR introduce any user-facing change? No ### How was this patch tested? - Added new test suites `JsonFiltersSuite` and `JacksonParserSuite`. - By new end-to-end and case sensitivity tests in `JsonSuite`. - By `CSVFiltersSuite`, `UnivocityParserSuite` and `CSVSuite`. - Re-running `CSVBenchmark` and `JsonBenchmark` using Amazon EC2: \| Item \| Description \| \| ---- \| ----\| \| Region \| us-west-2 (Oregon) \| \| Instance \| r3.xlarge (spot instance) \| \| AMI \| ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) \| \| Java \| OpenJDK8/11 installed by`sudo add-apt-repository ppa:openjdk-r/ppa` & `sudo apt install openjdk-11-jdk`\| and `./dev/run-benchmarks`: ```python #!/usr/bin/env python3 import os from sparktestsupport.shellutils import run_cmd benchmarks = [ ['sql/test', 'org.apache.spark.sql.execution.datasources.csv.CSVBenchmark'], ['sql/test', 'org.apache.spark.sql.execution.datasources.json.JsonBenchmark'] ] print('Set SPARK_GENERATE_BENCHMARK_FILES=1') os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1' for b in benchmarks: print("Run benchmark: %s" % b[1]) run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])]) ``` Closes #27366 from MaxGekk/json-filters-pushdown. Lead-authored-by: Maxim Gekk <max.gekk@gmail.com> Co-authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-17 00:01:13 +09:00
SaurabhChawla	6be8b935a4	[SPARK-32234][SQL] Spark sql commands are failing on selecting the orc tables ### What changes were proposed in this pull request? Spark sql commands are failing on selecting the orc tables Steps to reproduce Example 1 - Prerequisite - This is the location(/Users/test/tpcds_scale5data/date_dim) for orc data which is generated by the hive. ``` val table = """CREATE TABLE `date_dim` ( `d_date_sk` INT, `d_date_id` STRING, `d_date` TIMESTAMP, `d_month_seq` INT, `d_week_seq` INT, `d_quarter_seq` INT, `d_year` INT, `d_dow` INT, `d_moy` INT, `d_dom` INT, `d_qoy` INT, `d_fy_year` INT, `d_fy_quarter_seq` INT, `d_fy_week_seq` INT, `d_day_name` STRING, `d_quarter_name` STRING, `d_holiday` STRING, `d_weekend` STRING, `d_following_holiday` STRING, `d_first_dom` INT, `d_last_dom` INT, `d_same_day_ly` INT, `d_same_day_lq` INT, `d_current_day` STRING, `d_current_week` STRING, `d_current_month` STRING, `d_current_quarter` STRING, `d_current_year` STRING) USING orc LOCATION '/Users/test/tpcds_scale5data/date_dim'""" spark.sql(table).collect val u = """select date_dim.d_date_id from date_dim limit 5""" spark.sql(u).collect ``` Example 2 ``` val table = """CREATE TABLE `test_orc_data` ( `_col1` INT, `_col2` STRING, `_col3` INT) USING orc""" spark.sql(table).collect spark.sql("insert into test_orc_data values(13, '155', 2020)").collect val df = """select _col2 from test_orc_data limit 5""" spark.sql(df).collect ``` Its Failing with below error ``` org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 2.0 failed 1 times, most recent failure: Lost task 0.0 in stage 2.0 (TID 2, 192.168.0.103, executor driver): java.lang.ArrayIndexOutOfBoundsException: 1 at org.apache.spark.sql.execution.datasources.orc.OrcColumnarBatchReader.initBatch(OrcColumnarBatchReader.java:156) at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat.$anonfun$buildReaderWithPartitionValues$7(OrcFileFormat.scala:258) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.org$apache$spark$sql$execution$datasources$FileScanRDD$$anon$$readCurrentFile(FileScanRDD.scala:141) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:203) at org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:116) at org.apache.spark.sql.execution.FileSourceScanExec$$anon$1.hasNext(DataSourceScanExec.scala:620) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.columnartorow_nextBatch_0$(Unknown Source) at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source) at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:729) at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:343) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2(RDD.scala:895) at org.apache.spark.rdd.RDD.$anonfun$mapPartitionsInternal$2$adapted(RDD.scala:895) at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:52) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:372) at org.apache.spark.rdd.RDD.iterator(RDD.scala:336) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:133) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:445) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1489) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:448) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)` ``` The reason behind this initBatch is not getting the schema that is needed to find out the column value in OrcFileFormat.scala ``` batchReader.initBatch( TypeDescription.fromString(resultSchemaString) ``` ### Why are the changes needed? Spark sql queries for orc tables are failing ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test is added for this .Also Tested through spark shell and spark submit the failing queries Closes #29045 from SaurabhChawla100/SPARK-32234. Lead-authored-by: SaurabhChawla <saurabhc@qubole.com> Co-authored-by: SaurabhChawla <s.saurabhtim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-16 13:11:47 +00:00
Kent Yao	bdeb626c5a	[SPARK-32272][SQL] Add SQL standard command SET TIME ZONE ### What changes were proposed in this pull request? This PR adds the SQL standard command - `SET TIME ZONE` to the current default time zone displacement for the current SQL-session, which is the same as the existing `set spark.sql.session.timeZone=xxx'. All in all, this PR adds syntax as following, ``` SET TIME ZONE LOCAL; SET TIME ZONE 'valid time zone'; -- zone offset or region SET TIME ZONE INTERVAL XXXX; -- xxx must in [-18, + 18] hours, * this range is bigger than ansi [-14, + 14] ``` ### Why are the changes needed? ANSI compliance and supply pure SQL users a way to retrieve all supported TimeZones ### Does this PR introduce _any_ user-facing change? yes, add new syntax. ### How was this patch tested? add unit tests. and locally verified reference doc ![image](https://user-images.githubusercontent.com/8326978/87510244-c8dc3680-c6a5-11ea-954c-b098be84afee.png) Closes #29064 from yaooqinn/SPARK-32272. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-16 13:01:53 +00:00
Warren Zhu	db47c6e340	[SPARK-32125][UI] Support get taskList by status in Web UI and SHS Rest API ### What changes were proposed in this pull request? Support fetching taskList by status as below: ``` /applications/[app-id]/stages/[stage-id]/[stage-attempt-id]/taskList?status=failed ``` ### Why are the changes needed? When there're large number of tasks in one stage, current api is hard to get taskList by status ### Does this PR introduce _any_ user-facing change? Yes. Updated monitoring doc. ### How was this patch tested? Added tests in `HistoryServerSuite` Closes #28942 from warrenzhu25/SPARK-32125. Authored-by: Warren Zhu <zhonzh@microsoft.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-07-16 11:31:24 +08:00
Sean Owen	c28a6fa511	[SPARK-29292][SQL][ML] Update rest of default modules (Hive, ML, etc) for Scala 2.13 compilation ### What changes were proposed in this pull request? Same as https://github.com/apache/spark/pull/29078 and https://github.com/apache/spark/pull/28971 . This makes the rest of the default modules (i.e. those you get without specifying `-Pyarn` etc) compile under Scala 2.13. It does not close the JIRA, as a result. this also of course does not demonstrate that tests pass yet in 2.13. Note, this does not fix the `repl` module; that's separate. ### Why are the changes needed? Eventually, we need to support a Scala 2.13 build, perhaps in Spark 3.1. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. (2.13 was not tested; this is about getting it to compile without breaking 2.12) Closes #29111 from srowen/SPARK-29292.3. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-15 13:26:28 -07:00
Huaxin Gao	b05f309bc9	[SPARK-32140][ML][PYSPARK] Add training summary to FMClassificationModel ### What changes were proposed in this pull request? Add training summary for FMClassificationModel... ### Why are the changes needed? so that user can get the training process status, such as loss value of each iteration and total iteration number. ### Does this PR introduce _any_ user-facing change? Yes FMClassificationModel.summary FMClassificationModel.evaluate ### How was this patch tested? new tests Closes #28960 from huaxingao/fm_summary. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>	2020-07-15 10:13:03 -07:00
Erik Krogen	cf22d947fb	[SPARK-32036] Replace references to blacklist/whitelist language with more appropriate terminology, excluding the blacklisting feature ### What changes were proposed in this pull request? This PR will remove references to these "blacklist" and "whitelist" terms besides the blacklisting feature as a whole, which can be handled in a separate JIRA/PR. This touches quite a few files, but the changes are straightforward (variable/method/etc. name changes) and most quite self-contained. ### Why are the changes needed? As per discussion on the Spark dev list, it will be beneficial to remove references to problematic language that can alienate potential community members. One such reference is "blacklist" and "whitelist". While it seems to me that there is some valid debate as to whether these terms have racist origins, the cultural connotations are inescapable in today's world. ### Does this PR introduce _any_ user-facing change? In the test file `HiveQueryFileTest`, a developer has the ability to specify the system property `spark.hive.whitelist` to specify a list of Hive query files that should be tested. This system property has been renamed to `spark.hive.includelist`. The old property has been kept for compatibility, but will log a warning if used. I am open to feedback from others on whether keeping a deprecated property here is unnecessary given that this is just for developers running tests. ### How was this patch tested? Existing tests should be suitable since no behavior changes are expected as a result of this PR. Closes #28874 from xkrogen/xkrogen-SPARK-32036-rename-blacklists. Authored-by: Erik Krogen <ekrogen@linkedin.com> Signed-off-by: Thomas Graves <tgraves@apache.org>	2020-07-15 11:40:55 -05:00
Dongjoon Hyun	8950dcbb1c	[SPARK-32318][SQL][TESTS] Add a test case to EliminateSortsSuite for ORDER BY in DISTRIBUTE BY ### What changes were proposed in this pull request? This PR aims to add a test case to EliminateSortsSuite to protect a valid use case which is using ORDER BY in DISTRIBUTE BY statement. ### Why are the changes needed? ```scala scala> scala.util.Random.shuffle((1 to 100000).map(x => (x % 2, x))).toDF("a", "b").repartition(2).createOrReplaceTempView("t") scala> sql("select * from (select * from t order by b) distribute by a").write.orc("/tmp/master") $ ls -al /tmp/master/ total 56 drwxr-xr-x 10 dongjoon wheel 320 Jul 14 22:12 ./ drwxrwxrwt 15 root wheel 480 Jul 14 22:12 ../ -rw-r--r-- 1 dongjoon wheel 8 Jul 14 22:12 ._SUCCESS.crc -rw-r--r-- 1 dongjoon wheel 12 Jul 14 22:12 .part-00000-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc.crc -rw-r--r-- 1 dongjoon wheel 16 Jul 14 22:12 .part-00043-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc.crc -rw-r--r-- 1 dongjoon wheel 16 Jul 14 22:12 .part-00191-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc.crc -rw-r--r-- 1 dongjoon wheel 0 Jul 14 22:12 _SUCCESS -rw-r--r-- 1 dongjoon wheel 119 Jul 14 22:12 part-00000-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc -rw-r--r-- 1 dongjoon wheel 932 Jul 14 22:12 part-00043-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc -rw-r--r-- 1 dongjoon wheel 939 Jul 14 22:12 part-00191-2cd3a50e-eded-49a4-b7cf-94e3f090b8c1-c000.snappy.orc ``` The following was found during SPARK-32276. If Spark optimizer removes the inner `ORDER BY`, the file size increases. ```scala scala> scala.util.Random.shuffle((1 to 100000).map(x => (x % 2, x))).toDF("a", "b").repartition(2).createOrReplaceTempView("t") scala> sql("select * from (select * from t order by b) distribute by a").write.orc("/tmp/SPARK-32276") $ ls -al /tmp/SPARK-32276/ total 632 drwxr-xr-x 10 dongjoon wheel 320 Jul 14 22:08 ./ drwxrwxrwt 14 root wheel 448 Jul 14 22:08 ../ -rw-r--r-- 1 dongjoon wheel 8 Jul 14 22:08 ._SUCCESS.crc -rw-r--r-- 1 dongjoon wheel 12 Jul 14 22:08 .part-00000-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc.crc -rw-r--r-- 1 dongjoon wheel 1188 Jul 14 22:08 .part-00043-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc.crc -rw-r--r-- 1 dongjoon wheel 1188 Jul 14 22:08 .part-00191-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc.crc -rw-r--r-- 1 dongjoon wheel 0 Jul 14 22:08 _SUCCESS -rw-r--r-- 1 dongjoon wheel 119 Jul 14 22:08 part-00000-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc -rw-r--r-- 1 dongjoon wheel 150735 Jul 14 22:08 part-00043-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc -rw-r--r-- 1 dongjoon wheel 150741 Jul 14 22:08 part-00191-ba5049f9-b835-49b7-9fdb-bdd11b9891cb-c000.snappy.orc ``` ### Does this PR introduce _any_ user-facing change? No. This only improves the test coverage. ### How was this patch tested? Pass the GitHub Action or Jenkins. Closes #29118 from dongjoon-hyun/SPARK-32318. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-15 07:43:56 -07:00
Dilip Biswal	e4499932da	[SPARK-31480][SQL] Improve the EXPLAIN FORMATTED's output for DSV2's Scan Node ### What changes were proposed in this pull request? Improve the EXPLAIN FORMATTED output of DSV2 Scan nodes (file based ones). Before ``` == Physical Plan == * Project (4) +- * Filter (3) +- * ColumnarToRow (2) +- BatchScan (1) (1) BatchScan Output [2]: [value#7, id#8] Arguments: [value#7, id#8], ParquetScan(org.apache.spark.sql.test.TestSparkSession17477bbb,Configuration: core-default.xml, core-site.xml, mapred-default.xml, mapred-site.xml, yarn-default.xml, yarn-site.xml, hdfs-default.xml, hdfs-site.xml, __spark_hadoop_conf__.xml,org.apache.spark.sql.execution.datasources.InMemoryFileIndexa6c363ce,StructType(StructField(value,IntegerType,true)),StructType(StructField(value,IntegerType,true)),StructType(StructField(id,IntegerType,true)),[Lorg.apache.spark.sql.sources.Filter;40fee459,org.apache.spark.sql.util.CaseInsensitiveStringMapfeca1ec6,Vector(isnotnull(id#8), (id#8 > 1)),List(isnotnull(value#7), (value#7 > 2))) (2) ... (3) ... (4) ... ``` After ``` == Physical Plan == * Project (4) +- * Filter (3) +- * ColumnarToRow (2) +- BatchScan (1) (1) BatchScan Output [2]: [value#7, id#8] DataFilters: [isnotnull(value#7), (value#7 > 2)] Format: parquet Location: InMemoryFileIndex[....] PartitionFilters: [isnotnull(id#8), (id#8 > 1)] PushedFilers: [IsNotNull(id), IsNotNull(value), GreaterThan(id,1), GreaterThan(value,2)] ReadSchema: struct<value:int> (2) ... (3) ... (4) ... ``` ### Why are the changes needed? The old format is not very readable. This improves the readability of the plan. ### Does this PR introduce any user-facing change? Yes. the explain output will be different. ### How was this patch tested? Added a test case in ExplainSuite. Closes #28425 from dilipbiswal/dkb_dsv2_explain. Lead-authored-by: Dilip Biswal <dkbiswal@gmail.com> Co-authored-by: Dilip Biswal <dkbiswal@apache.org> Signed-off-by: Dilip Biswal <dkbiswal@apache.org>	2020-07-15 01:28:39 -07:00
Dongjoon Hyun	2527fbc896	Revert "[SPARK-32276][SQL] Remove redundant sorts before repartition nodes" This reverts commit `af8e65fca9`.	2020-07-14 22:14:31 -07:00
Jungtaek Lim (HeartSaVioR)	542aefb4c4	[SPARK-31985][SS] Remove incomplete/undocumented stateful aggregation in continuous mode ### What changes were proposed in this pull request? This removes the undocumented and incomplete feature of "stateful aggregation" in continuous mode, which would reduce 1100+ lines of code. ### Why are the changes needed? The work for the feature had been stopped for over an year, and no one asked/requested for the availability of such feature in community. Current state for the feature is that it only works with `coalesce(1)` which force the query to read and process, and write in "a" task, which doesn't make sense in production. The remaining code increases the work on DSv2 changes as well - that's why I don't simply propose reverting relevant commits - the code path has been changed due to DSv2 evolution. ### Does this PR introduce _any_ user-facing change? Technically no, because it's never documented and can't be used in production in current shape. ### How was this patch tested? Existing tests. Closes #29077 from HeartSaVioR/SPARK-31985. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-07-15 13:40:43 +09:00
Anton Okolnychyi	af8e65fca9	[SPARK-32276][SQL] Remove redundant sorts before repartition nodes ### What changes were proposed in this pull request? This PR removes redundant sorts before repartition nodes with shuffles and repartitionByExpression with deterministic expressions. ### Why are the changes needed? It looks like our `EliminateSorts` rule can be extended further to remove sorts before repartition nodes that shuffle data as such repartition operations change the ordering and distribution of data. That's why it seems safe to perform the following rewrites: - `Repartition -> Sort -> Scan` as `Repartition -> Scan` - `Repartition -> Project -> Sort -> Scan` as `Repartition -> Project -> Scan` We don't apply this optimization to coalesce as it uses `DefaultPartitionCoalescer` that may preserve the ordering of data if there is no locality info in the parent RDD. At the same time, there is no guarantee that will happen. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? More test cases. Closes #29089 from aokolnychyi/spark-32276. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-14 21:17:33 -07:00
HyukjinKwon	6bdd710c4d	[SPARK-32316][TESTS][INFRA] Test PySpark with Python 3.8 in Github Actions ### What changes were proposed in this pull request? This PR aims to test PySpark with Python 3.8 in Github Actions. In the script side, it is already ready: `4ad9bfd53b/python/run-tests.py (L161)` This PR includes small related fixes together: 1. Install Python 3.8 2. Only install one Python implementation instead of installing many for SQL and Yarn test cases because they need one Python executable in their test cases that is higher than Python 2. 3. Do not install Python 2 which is not needed anymore after we dropped Python 2 at SPARK-32138 4. Remove a comment about installing PyPy3 on Jenkins - SPARK-32278. It is already installed. ### Why are the changes needed? Currently, only PyPy3 and Python 3.6 are being tested with PySpark in Github Actions. We should test the latest version of Python as well because some optimizations can be only enabled with Python 3.8+. See also https://github.com/apache/spark/pull/29114 ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Was not tested. Github Actions build in this PR will test it out. Closes #29116 from HyukjinKwon/test-python3.8-togehter. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-14 20:44:09 -07:00
HyukjinKwon	03b5707b51	[MINOR][R] Match collectAsArrowToR with non-streaming collectAsArrowToPython ### What changes were proposed in this pull request? This PR proposes to port forward #29098 to `collectAsArrowToR`. `collectAsArrowToR` follows `collectAsArrowToPython` in branch-2.4 due to the limitation of ARROW-4512. SparkR vectorization currently cannot use streaming format. ### Why are the changes needed? For simplicity and consistency. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The same code is being tested in `collectAsArrowToPython` of branch-2.4. Closes #29100 from HyukjinKwon/minor-parts. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-15 08:46:20 +09:00
HyukjinKwon	676d92ecce	[SPARK-32301][PYTHON][TESTS] Add a test case for toPandas to work with empty partitioned Spark DataFrame ### What changes were proposed in this pull request? This PR proposes to port the test case from https://github.com/apache/spark/pull/29098 to branch-3.0 and master. In the master and branch-3.0, this was fixed together at `ecaa495b1f` but no partition case is not being tested. ### Why are the changes needed? To improve test coverage. ### Does this PR introduce _any_ user-facing change? No, test-only. ### How was this patch tested? Unit test was forward-ported. Closes #29099 from HyukjinKwon/SPARK-32300-1. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-15 08:44:48 +09:00
HyukjinKwon	902e1342a3	[SPARK-32303][PYTHON][TESTS] Remove leftover from editable mode installation in PIP test ### What changes were proposed in this pull request? Currently the Jenkins PIP packaging test fails as below intermediately: ``` Installing dist into virtual env Processing ./python/dist/pyspark-3.1.0.dev0.tar.gz Collecting py4j==0.10.9 (from pyspark==3.1.0.dev0) Downloading `6a4fb90cd2/py4j-0.10.9-py2.py3-none-any.whl` (198kB) Installing collected packages: py4j, pyspark Found existing installation: py4j 0.10.9 Uninstalling py4j-0.10.9: Successfully uninstalled py4j-0.10.9 Found existing installation: pyspark 3.1.0.dev0 Exception: Traceback (most recent call last): File "/home/anaconda/envs/py36/lib/python3.6/site-packages/pip/_internal/cli/base_command.py", line 179, in main status = self.run(options, args) File "/home/anaconda/envs/py36/lib/python3.6/site-packages/pip/_internal/commands/install.py", line 393, in run use_user_site=options.use_user_site, File "/home/anaconda/envs/py36/lib/python3.6/site-packages/pip/_internal/req/__init__.py", line 50, in install_given_reqs auto_confirm=True File "/home/anaconda/envs/py36/lib/python3.6/site-packages/pip/_internal/req/req_install.py", line 816, in uninstall uninstalled_pathset = UninstallPathSet.from_dist(dist) File "/home/anaconda/envs/py36/lib/python3.6/site-packages/pip/_internal/req/req_uninstall.py", line 505, in from_dist '(at %s)' % (link_pointer, dist.project_name, dist.location) AssertionError: Egg-link /home/jenkins/workspace/SparkPullRequestBuilder3/python does not match installed ``` - https://github.com/apache/spark/pull/29099#issuecomment-658073453 (amp-jenkins-worker-04) - https://github.com/apache/spark/pull/29090#issuecomment-657819973 (amp-jenkins-worker-03) Seems like the previous installation of editable mode affects other PRs. This PR simply works around by removing the symbolic link from the previous editable installation. This is a common workaround up to my knowledge. ### Why are the changes needed? To recover the Jenkins build. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Jenkins build will test it out. Closes #29102 from HyukjinKwon/SPARK-32303. Lead-authored-by: HyukjinKwon <gurwls223@apache.org> Co-authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-14 16:43:16 -07:00
Baohe Zhang	90b0c26b22	[SPARK-31608][CORE][WEBUI] Add a new type of KVStore to make loading UI faster ### What changes were proposed in this pull request? Add a new class HybridStore to make the history server faster when loading event files. When rebuilding the application state from event logs, HybridStore will write data to InMemoryStore at first and use a background thread to dump data to LevelDB once the writing to InMemoryStore is completed. HybridStore is to make content serving faster by using more memory. It's only safe to enable it when the cluster is not having a heavy load. ### Why are the changes needed? HybridStore can greatly reduce the event logs loading time, especially for large log files. In general, it has 4x - 6x UI loading speed improvement for large log files. The detailed result is shown in comments. ### Does this PR introduce any user-facing change? This PR adds new configs `spark.history.store.hybridStore.enabled` and `spark.history.store.hybridStore.maxMemoryUsage`. ### How was this patch tested? A test suite for HybridStore is added. I also manually tested it on 3.1.0 on mac os. This is a follow-up for the work done by Hieu Huynh in 2019. Closes #28412 from baohe-zhang/SPARK-31608. Authored-by: Baohe Zhang <baohe.zhang@verizonmedia.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-07-15 07:51:13 +09:00
Fokko Driesprong	c602d79f89	[SPARK-32311][PYSPARK][TESTS] Remove duplicate import ### What changes were proposed in this pull request? `datetime` is already imported a few lines below :) `ce27cc54c1/python/pyspark/sql/tests/test_pandas_udf_scalar.py (L24)` ### Why are the changes needed? This is the last instance of the duplicate import. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual. Closes #29109 from Fokko/SPARK-32311. Authored-by: Fokko Driesprong <fokko@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-14 12:46:11 -07:00
yangjie01	5e0cb3ee16	[SPARK-32305][BUILD] Make `mvn clean` remove `metastore_db` and `spark-warehouse` ### What changes were proposed in this pull request? Add additional configuration to `maven-clean-plugin` to ensure cleanup `metastore_db` and `spark-warehouse` directory when execute `mvn clean` command. ### Why are the changes needed? Now Spark support two version of build-in hive and there are some test generated meta data not in target dir like `metastore_db`, they don't clean up automatically when we run `mvn clean` command. So if we run `mvn clean test -pl sql/hive -am -Phadoop-2.7 -Phive -Phive-1.2 ` , the `metastore_db` dir will created and meta data will remains after test complete. Then we need manual cleanup `metastore_db` directory to ensure `mvn clean test -pl sql/hive -am -Phadoop-2.7 -Phive` command use hive2.3 profile can succeed because the residual metastore data is not compatible. `spark-warehouse` will also cause test failure in some data residual scenarios because test case thinks that meta data should not exist. This pr is used to simplify manual cleanup `metastore_db` and `spark-warehouse` directory operation. ### How was this patch tested? Manual execute `mvn clean test -pl sql/hive -am -Phadoop-2.7 -Phive -Phive-1.2`, then execute `mvn clean test -pl sql/hive -am -Phadoop-2.7 -Phive`, both commands should succeed. Closes #29103 from LuciferYang/add-clean-directory. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-14 12:40:47 -07:00
Fokko Driesprong	2a0faca830	[SPARK-32309][PYSPARK] Import missing sys import # What changes were proposed in this pull request? While seeing if we can use mypy for checking the Python types, I've stumbled across this missing import: `34fa913311/python/pyspark/ml/feature.py (L5773-L5774)` ### Why are the changes needed? The `import` is required because it's used. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual. Closes #29108 from Fokko/SPARK-32309. Authored-by: Fokko Driesprong <fokko@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-14 12:29:56 -07:00
yi.wu	a47b69a88a	[SPARK-32307][SQL] ScalaUDF's canonicalized expression should exclude inputEncoders ### What changes were proposed in this pull request? Override `canonicalized` to empty the `inputEncoders` for the canonicalized `ScalaUDF`. ### Why are the changes needed? The following fails on `branch-3.0` currently, not on Apache Spark 3.0.0 release. ```scala spark.udf.register("key", udf((m: Map[String, String]) => m.keys.head.toInt)) Seq(Map("1" -> "one", "2" -> "two")).toDF("a").createOrReplaceTempView("t") checkAnswer(sql("SELECT key(a) AS k FROM t GROUP BY key(a)"), Row(1) :: Nil) [info] org.apache.spark.sql.AnalysisException: expression 't.`a`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;; [info] Aggregate [UDF(a#6)], [UDF(a#6) AS k#8] [info] +- SubqueryAlias t [info] +- Project [value#3 AS a#6] [info] +- LocalRelation [value#3] [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:49) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:48) [info] at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:130) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidAggregateExpression$1(CheckAnalysis.scala:257) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10(CheckAnalysis.scala:259) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10$adapted(CheckAnalysis.scala:259) [info] at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) [info] at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) [info] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidAggregateExpression$1(CheckAnalysis.scala:259) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10(CheckAnalysis.scala:259) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$10$adapted(CheckAnalysis.scala:259) [info] at scala.collection.immutable.List.foreach(List.scala:392) [info] at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkValidAggregateExpression$1(CheckAnalysis.scala:259) ... ``` We use the rule`ResolveEncodersInUDF` to resolve `inputEncoders` and the original`ScalaUDF` instance will be updated to a new `ScalaUDF` instance with the resolved encoders at the end. Note, during encoder resolving, types like `map`, `array` will be resolved to new expression(e.g. `MapObjects`, `CatalystToExternalMap`). However, `ExpressionEncoder` can't be canonicalized. Thus, the canonicalized `ScalaUDF`s become different even if their original `ScalaUDF`s are the same. Finally, it fails the `checkValidAggregateExpression` when this `ScalaUDF` is used as a group expression. ### Does this PR introduce _any_ user-facing change? Yes, users will not hit the exception after this fix. ### How was this patch tested? Added tests. Closes #29106 from Ngone51/spark-32307. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-14 12:19:01 -07:00
Sean Owen	d6a68e0b67	[SPARK-29292][STREAMING][SQL][BUILD] Get streaming, catalyst, sql compiling for Scala 2.13 ### What changes were proposed in this pull request? Continuation of https://github.com/apache/spark/pull/28971 which lets streaming, catalyst and sql compile for 2.13. Same idea. ### Why are the changes needed? Eventually, we need to support a Scala 2.13 build, perhaps in Spark 3.1. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing tests. (2.13 was not tested; this is about getting it to compile without breaking 2.12) Closes #29078 from srowen/SPARK-29292.2. Authored-by: Sean Owen <srowen@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-14 02:06:50 -07:00
Liang-Chi Hsieh	cc9371d885	[SPARK-32258][SQL] Not duplicate normalization on children for float/double If/CaseWhen/Coalesce ### What changes were proposed in this pull request? This is followup to #29061. See https://github.com/apache/spark/pull/29061#discussion_r453458611. Basically this moves If/CaseWhen/Coalesce case patterns after float/double case so we don't duplicate normalization on children for float/double If/CaseWhen/Coalesce. ### Why are the changes needed? Simplify expression tree. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Modify unit tests. Closes #29091 from viirya/SPARK-32258-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-14 05:51:59 +00:00
Peter Toth	24be81689c	[SPARK-32241][SQL] Remove empty children of union ### What changes were proposed in this pull request? This PR removes the empty child relations of a `Union`. E.g. the query `SELECT c FROM t UNION ALL SELECT c FROM t WHERE false` has the following plan before this PR: ``` == Physical Plan == Union :- (1) Project [value#219 AS c#222] : +- (1) LocalTableScan [value#219] +- LocalTableScan <empty>, [c#224] ``` and after this PR: ``` == Physical Plan == (1) Project [value#219 AS c#222] +- (1) LocalTableScan [value#219] ``` ### Why are the changes needed? To have a simpler plan. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added new UTs. Closes #29053 from peter-toth/SPARK-32241-remove-empty-children-of-union. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-14 04:26:29 +00:00
HyukjinKwon	4ad9bfd53b	[SPARK-32138] Drop Python 2.7, 3.4 and 3.5 ### What changes were proposed in this pull request? This PR aims to drop Python 2.7, 3.4 and 3.5. Roughly speaking, it removes all the widely known Python 2 compatibility workarounds such as `sys.version` comparison, `__future__`. Also, it removes the Python 2 dedicated codes such as `ArrayConstructor` in Spark. ### Why are the changes needed? 1. Unsupport EOL Python versions 2. Reduce maintenance overhead and remove a bit of legacy codes and hacks for Python 2. 3. PyPy2 has a critical bug that causes a flaky test, SPARK-28358 given my testing and investigation. 4. Users can use Python type hints with Pandas UDFs without thinking about Python version 5. Users can leverage one latest cloudpickle, https://github.com/apache/spark/pull/28950. With Python 3.8+ it can also leverage C pickle. ### Does this PR introduce _any_ user-facing change? Yes, users cannot use Python 2.7, 3.4 and 3.5 in the upcoming Spark version. ### How was this patch tested? Manually tested and also tested in Jenkins. Closes #28957 from HyukjinKwon/SPARK-32138. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-07-14 11:22:44 +09:00
Holden Karau	90ac9f975b	[SPARK-32004][ALL] Drop references to slave ### What changes were proposed in this pull request? This change replaces the world slave with alternatives matching the context. ### Why are the changes needed? There is no need to call things slave, we might as well use better clearer names. ### Does this PR introduce _any_ user-facing change? Yes, the ouput JSON does change. To allow backwards compatibility this is an additive change. The shell scripts for starting & stopping workers are renamed, and for backwards compatibility old scripts are added to call through to the new ones while printing a deprecation message to stderr. ### How was this patch tested? Existing tests. Closes #28864 from holdenk/SPARK-32004-drop-references-to-slave. Lead-authored-by: Holden Karau <hkarau@apple.com> Co-authored-by: Holden Karau <holden@pigscanfly.ca> Signed-off-by: Holden Karau <hkarau@apple.com>	2020-07-13 14:05:33 -07:00
Hyukjin Kwon	27ef3629dd	[SPARK-32292][SPARK-32252][INFRA] Run the relevant tests only in GitHub Actions ### What changes were proposed in this pull request? This PR mainly proposes to run only relevant tests just like Jenkins PR builder does. Currently, GitHub Actions always run full tests which wastes the resources. In addition, this PR also fixes 3 more issues very closely related together while I am here. 1. The main idea here is: It reuses the existing logic embedded in `dev/run-tests.py` which Jenkins PR builder use in order to run only the related test cases. 2. While I am here, I fixed SPARK-32292 too to run the doc tests. It was because other references were not available when it is cloned via `checkoutv2`. With `fetch-depth: 0`, the history is available. 3. In addition, it fixes the `dev/run-tests.py` to match with `python/run-tests.py` in terms of its options. Environment variables such as `TEST_ONLY_XXX` were moved as proper options. For example, ```bash dev/run-tests.py --modules sql,core ``` which is consistent with `python/run-tests.py`, for example, ```bash python/run-tests.py --modules pyspark-core,pyspark-ml ``` 4. Lastly, also fixed the formatting issue in module specification in the matrix: ```diff - network_common, network_shuffle, repl, launcher + network-common, network-shuffle, repl, launcher, ``` which incorrectly runs build/test the modules. ### Why are the changes needed? By running only related tests, we can hugely save the resources and avoid unrelated flaky tests, etc. Also, now it runs the doctest of `dev/run-tests.py` properly, the usages are similar between `dev/run-tests.py` and `python/run-tests.py`, and run `network-common`, `network-shuffle`, `launcher` and `examples` modules too. ### Does this PR introduce _any_ user-facing change? No, dev-only. ### How was this patch tested? Manually tested in my own forked Spark: https://github.com/HyukjinKwon/spark/pull/7 https://github.com/HyukjinKwon/spark/pull/8 https://github.com/HyukjinKwon/spark/pull/9 https://github.com/HyukjinKwon/spark/pull/10 https://github.com/HyukjinKwon/spark/pull/11 https://github.com/HyukjinKwon/spark/pull/12 Closes #29086 from HyukjinKwon/SPARK-32292. Authored-by: Hyukjin Kwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-13 08:31:39 -07:00
angerszhu	5521afbd22	[SPARK-32220][SQL][FOLLOW-UP] SHUFFLE_REPLICATE_NL Hint should not change Non-Cartesian Product join result ### What changes were proposed in this pull request? follow comment https://github.com/apache/spark/pull/29035#discussion_r453468999 Explain for pr ### Why are the changes needed? add comment ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #29084 from AngersZhuuuu/follow-spark-32220. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-13 08:23:25 -07:00
angerszhu	6d499647b3	[SPARK-32105][SQL] Refactor current ScriptTransformationExec code ### What changes were proposed in this pull request? * Renamed hive transform scrip class `hive/execution/ScriptTransformationExec` to `hive/execution/HiveScriptTransformationExec` (don't rename file) * Extract class `BaseScriptTransformationExec ` about common code used across `SparkScriptTransformationExec(next pr add this)` and `HiveScriptTransformationExec` * Extract class `BaseScriptTransformationWriterThread` of writing data thread across `SparkScriptTransformationWriterThread(added next for support transform in sql/core )` and `HiveScriptTransformationWriterThread` , * `HiveScriptTransformationWriterThread` additionally supports Hive serde format * Rename current `Script` strategies in hive module to `HiveScript`, in next pr will add `SparkScript` strategies for support transform in sql/core. Todo List; - Support transform in sql/core base on `BaseScriptTransformationExec`, which would run script operator in SQL mode (without Hive). The output of script would be read as a string and column values are extracted by using a delimiter (default : tab character) - For Hive, by default only serde's must be used, and without hive we can run without serde - Cleanup past hacks that are observed (and people suggest / report)， such as - [Solve string value error about Date/Timestamp in ScriptTransform](https://issues.apache.org/jira/browse/SPARK-31947) - [support use transform with aggregation](https://issues.apache.org/jira/browse/SPARK-28227) - [support array/map as transform's input](https://issues.apache.org/jira/browse/SPARK-22435) - Use code-gen projection to serialize rows to output stream() ### Why are the changes needed? Support run transform in SQL mode without hive ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? Added UT Closes #27983 from AngersZhuuuu/follow_spark_15694. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-07-13 08:58:25 +00:00
Liang-Chi Hsieh	b6229df16c	[SPARK-32258][SQL] NormalizeFloatingNumbers directly normalizes IF/CaseWhen/Coalesce child expressions ### What changes were proposed in this pull request? This patch proposes to let `NormalizeFloatingNumbers` rule directly normalizes on certain children expressions. It could simplify expression tree. ### Why are the changes needed? Currently NormalizeFloatingNumbers rule treats some expressions as black box but we can optimize it a bit by normalizing directly the inner children expressions. Also see https://github.com/apache/spark/pull/28962#discussion_r448526240. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests. Closes #29061 from viirya/SPARK-32258. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-12 15:34:43 -07:00
Dongjoon Hyun	bc3d4bacb5	[SPARK-32245][INFRA][FOLLOWUP] Reenable Github Actions on commit ### What changes were proposed in this pull request? This PR reenables GitHub Action on every commit as a next step. ### Why are the changes needed? We carefully enabled GitHub Action on every PRs, and it looks good so far. As we saw at https://github.com/apache/spark/pull/29072, GitHub Action is already triggered at every commits on every PRs. Enabling GitHub Action on `master` branch commit doesn't make a big difference. And, we need to start to test at every commit as a next step. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual. Closes #29076 from dongjoon-hyun/reenable_gha_commit. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-12 14:50:47 -07:00
Frank Yin	ad90cbff42	[SPARK-31831][SQL][TESTS] Use subclasses for mock in HiveSessionImplSuite ### What changes were proposed in this pull request? Fix flaky test org.apache.spark.sql.hive.thriftserver.HiveSessionImplSuite by using subclasses to avoid classloader issue. ### Why are the changes needed? It causes build instability. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? It is a fix for a flaky test, but need to run multiple times against Jenkins. Closes #29069 from frankyin-factual/hive-tests. Authored-by: Frank Yin <frank@factual.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-07-13 05:04:47 +09:00
HyukjinKwon	c4b0639f83	[SPARK-32270][SQL] Use TextFileFormat in CSV's schema inference with a different encoding ### What changes were proposed in this pull request? This PR proposes to use text datasource in CSV's schema inference. This shares the same reasons of SPARK-18362, SPARK-19885 and SPARK-19918 - we're currently using Hadoop RDD when the encoding is different, which is unnecessary. This PR completes SPARK-18362, and address the comment at https://github.com/apache/spark/pull/15813#discussion_r90751405. We should better keep the code paths consistent with existing CSV and JSON datasources as well, but this CSV schema inference with the encoding specified is different from UTF-8 alone. There can be another story that this PR might lead to a bug fix: Spark session configurations, say Hadoop configurations, are not respected during CSV schema inference when the encoding is different (but it has to be set to Spark context for schema inference when the encoding is different). ### Why are the changes needed? For consistency, potentially better performance, and fixing a potentially very corner case bug. ### Does this PR introduce _any_ user-facing change? Virtually no. ### How was this patch tested? Existing tests should cover. Closes #29063 from HyukjinKwon/SPARK-32270. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-12 09:44:27 -07:00
Chuliang Xiao	c56c84af47	[MINOR][DOCS] Fix typo in PySpark example in ml-datasource.md ### What changes were proposed in this pull request? This PR changes `true` to `True` in the python code. ### Why are the changes needed? The previous example is a syntax error. ### Does this PR introduce _any_ user-facing change? Yes, but this is doc-only typo fix. ### How was this patch tested? Manually run the example. Closes #29073 from ChuliangXiao/patch-1. Authored-by: Chuliang Xiao <ChuliangX@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-07-12 09:01:41 -07:00
Michael Chirico	6ae400ccbe	[MINOR][SQL][DOCS] consistency in argument naming for time functions ### What changes were proposed in this pull request? Rename documented argument `format` as `fmt`, to match the same argument name in several other SQL date/time functions, to wit, `date_format`, `date_trunc`, `trunc`, `to_date`, and `to_timestamp` all use `fmt`. Also `format_string` and `printf` use the same abbreviation in their argument `strfmt`. ### Why are the changes needed? Consistency -- I was trying to scour the documentation for functions with arguments using Java string formatting, it would have been nice to rely on searching for `fmt` instead of my more manual approach. ### Does this PR introduce _any_ user-facing change? In the documentation only ### How was this patch tested? No tests Closes #29007 from MichaelChirico/sql-doc-format-fmt. Authored-by: Michael Chirico <michael.chirico@grabtaxi.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-07-12 09:53:27 -05:00

1 2 3 4 5 ...

27729 commits