ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Kent Yao	44bd36ad7b	[SPARK-31234][SQL] ResetCommand should reset config to sc.conf only ### What changes were proposed in this pull request? Currently, ResetCommand clear all configurations, including sql configs, static sql configs and spark context level configs. for example: ```sql spark-sql> set xyz=abc; xyz abc spark-sql> set; spark.app.id local-1585055396930 spark.app.name SparkSQL::10.242.189.214 spark.driver.host 10.242.189.214 spark.driver.port 65094 spark.executor.id driver spark.jars spark.master local[*] spark.sql.catalogImplementation hive spark.sql.hive.version 1.2.1 spark.submit.deployMode client xyz abc spark-sql> reset; spark-sql> set; spark-sql> set spark.sql.hive.version; spark.sql.hive.version 1.2.1 spark-sql> set spark.app.id; spark.app.id <undefined> ``` In this PR, we restore spark confs to RuntimeConfig after it is cleared ### Why are the changes needed? reset command overkills configs which are static. ### Does this PR introduce any user-facing change? yes, the ResetCommand do not change static configs now ### How was this patch tested? add ut Closes #28003 from yaooqinn/SPARK-31234. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-26 15:03:16 +08:00
Huaxin Gao	ee6f8991a7	[SPARK-30934][ML][FOLLOW-UP] Update ml-guide to include MulticlassClassificationEvaluator weight support in highlights ### What changes were proposed in this pull request? Update ml-guide to include ```MulticlassClassificationEvaluator``` weight support in highlights ### Why are the changes needed? ```MulticlassClassificationEvaluator``` weight support is very important, so should include it in highlights ### Does this PR introduce any user-facing change? Yes after: ![image](https://user-images.githubusercontent.com/13592258/77614952-6ccd8680-6eeb-11ea-9354-fa20004132df.png) ### How was this patch tested? manually build and check Closes #28031 from huaxingao/highlights-followup. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-03-26 14:24:53 +08:00
Huaxin Gao	d81df56f2d	[SPARK-31223][ML] Set seed in np.random to regenerate test data ### What changes were proposed in this pull request? https://issues.apache.org/jira/browse/SPARK-31223 set seed in np.random when generating test data...... ### Why are the changes needed? so the same set of test data can be regenerated later. ### Does this PR introduce any user-facing change? No ### How was this patch tested? exiting tests Closes #27994 from huaxingao/spark-31223. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-03-26 13:53:31 +08:00
Maxim Gekk	cec9604eae	[SPARK-31237][SQL][TESTS] Replace 3-letter time zones by zone offsets ### What changes were proposed in this pull request? In the PR, I propose to add a few `ZoneId` constant values to the `DateTimeTestUtils` object, and reuse the constants in tests. Proposed the following constants: - PST = -08:00 - UTC = +00:00 - CEST = +02:00 - CET = +01:00 - JST = +09:00 - MIT = -09:30 - LA = America/Los_Angeles ### Why are the changes needed? All proposed constant values (except `LA`) are initialized by zone offsets according to their definitions. This will allow to avoid: - Using of 3-letter time zones that have been already deprecated in JDK, see _Three-letter time zone IDs_ in https://docs.oracle.com/javase/8/docs/api/java/util/TimeZone.html - Incorrect mapping of 3-letter time zones to zone offsets, see SPARK-31237. For example, `PST` is mapped to `America/Los_Angeles` instead of the `-08:00` zone offset. Also this should improve stability and maintainability of test suites. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running affected test suites. Closes #28001 from MaxGekk/replace-pst. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-26 13:36:00 +08:00
Yuanjian Li	0fe203e703	[SPARK-30623][CORE] Spark external shuffle allow disable of separate event loop group ### What changes were proposed in this pull request? Fix the regression caused by #22173. The original PR changes the logic of handling `ChunkFetchReqeust` from async to sync, that's causes the shuffle benchmark regression. This PR fixes the regression back to the async mode by reusing the config `spark.shuffle.server.chunkFetchHandlerThreadsPercent`. When the user sets the config, ChunkFetchReqeust will be processed in a separate event loop group, otherwise, the code path is exactly the same as before. ### Why are the changes needed? Fix the shuffle performance regression described in https://github.com/apache/spark/pull/22173#issuecomment-572459561 ### Does this PR introduce any user-facing change? Yes, this PR disable the separate event loop for FetchRequest by default. ### How was this patch tested? Existing UT. Closes #27665 from xuanyuanking/SPARK-24355-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-26 12:37:48 +08:00
Kent Yao	b024a8a69e	[MINOR][DOCS] Fix some links for python api doc ### What changes were proposed in this pull request? the link for `partition discovery` is malformed, because for releases, there will contains` /docs/<version>/` in the full URL. ### Why are the changes needed? fix doc ### Does this PR introduce any user-facing change? no ### How was this patch tested? `SKIP_SCALADOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll serve` locally verified Closes #28017 from yaooqinn/doc. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-26 13:06:21 +09:00
HyukjinKwon	178d472e1d	[SPARK-31231][BUILD][FOLLOW-UP] Set the upper bound (before 46.1.0) for setuptools in pip package test ## What changes were proposed in this pull request? This PR is a followup of apache/spark#27995. Rather then pining setuptools version, it sets upper bound so Python 3.5 with branch-2.4 tests can pass too. ## Why are the changes needed? To make the CI build stable ## Does this PR introduce any user-facing change? No, dev-only change. ## How was this patch tested? Jenkins will test. Closes #28005 from HyukjinKwon/investigate-pip-packaging-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-26 12:33:17 +09:00
Kent Yao	336621e277	[SPARK-31258][BUILD] Pin the avro version in SBT ### What changes were proposed in this pull request? add arvo dep in SparkBuild ### Why are the changes needed? fix sbt unidoc like https://github.com/apache/spark/pull/28017#issuecomment-603828597 ```scala [warn] Multiple main classes detected. Run 'show discoveredMainClasses' to see the list [warn] Multiple main classes detected. Run 'show discoveredMainClasses' to see the list [info] Main Scala API documentation to /home/jenkins/workspace/SparkPullRequestBuilder6/target/scala-2.12/unidoc... [info] Main Java API documentation to /home/jenkins/workspace/SparkPullRequestBuilder6/target/javaunidoc... [error] /home/jenkins/workspace/SparkPullRequestBuilder6/core/src/main/scala/org/apache/spark/serializer/GenericAvroSerializer.scala:123: value createDatumWriter is not a member of org.apache.avro.generic.GenericData [error] writerCache.getOrElseUpdate(schema, GenericData.get.createDatumWriter(schema)) [error] ^ [info] No documentation generated with unsuccessful compiler run [error] one error found ``` ### Does this PR introduce any user-facing change? no ### How was this patch tested? pass jenkins and verify manually with `sbt dependencyTree` ```scala kentyaohulk  ~/spark   dep  build/sbt dependencyTree \| grep avro \| grep -v Resolving [info] +-org.apache.avro:avro-mapred:1.8.2 [info] \| +-org.apache.avro:avro-ipc:1.8.2 [info] \| \| +-org.apache.avro:avro:1.8.2 [info] +-org.apache.avro:avro:1.8.2 [info] \| \| +-org.apache.avro:avro:1.8.2 [info] org.apache.spark:spark-avro_2.12:3.1.0-SNAPSHOT [S] [info] \| \| \| +-org.apache.avro:avro-mapred:1.8.2 [info] \| \| \| \| +-org.apache.avro:avro-ipc:1.8.2 [info] \| \| \| \| \| +-org.apache.avro:avro:1.8.2 [info] \| \| \| +-org.apache.avro:avro:1.8.2 [info] \| \| \| \| \| +-org.apache.avro:avro:1.8.2 ``` Closes #28020 from yaooqinn/dep. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-26 10:48:11 +09:00
Dongjoon Hyun	f206bbde3a	[SPARK-31244][K8S][TEST] Use Minio instead of Ceph in K8S DepsTestsSuite ### What changes were proposed in this pull request? This PR (SPARK-31244) replaces `Ceph` with `Minio` in K8S `DepsTestSuite`. ### Why are the changes needed? Currently, `DepsTestsSuite` is using `ceph` for S3 storage. However, the used version and all new releases are broken on new `minikube` releases. We had better use more robust and small one. ``` $ minikube version minikube version: v1.8.2 $ minikube -p minikube docker-env \| source $ docker run -it --rm -e NETWORK_AUTO_DETECT=4 -e RGW_FRONTEND_PORT=8000 -e SREE_PORT=5001 -e CEPH_DEMO_UID=nano -e CEPH_DAEMON=demo ceph/daemon:v4.0.3-stable-4.0-nautilus-centos-7-x86_64 /bin/sh 2020-03-25 04:26:21 /opt/ceph-container/bin/entrypoint.sh: ERROR- it looks like we have not been able to discover the network settings $ docker run -it --rm -e NETWORK_AUTO_DETECT=4 -e RGW_FRONTEND_PORT=8000 -e SREE_PORT=5001 -e CEPH_DEMO_UID=nano -e CEPH_DAEMON=demo ceph/daemon:v4.0.11-stable-4.0-nautilus-centos-7 /bin/sh 2020-03-25 04:20:30 /opt/ceph-container/bin/entrypoint.sh: ERROR- it looks like we have not been able to discover the network settings ``` Also, the image size is unnecessarily big (almost `1GB`) and growing while `minio` is `55.8MB` with the same features. ``` $ docker images \| grep ceph ceph/daemon v4.0.3-stable-4.0-nautilus-centos-7-x86_64 a6a05ccdf924 6 months ago 852MB ceph/daemon v4.0.11-stable-4.0-nautilus-centos-7 87f695550d8e 12 hours ago 901MB $ docker images \| grep minio minio/minio latest 95c226551ea6 5 days ago 55.8MB ``` ### Does this PR introduce any user-facing change? No. (This is a test case change) ### How was this patch tested? Pass the existing Jenkins K8s integration test job and test with the latest minikube. ``` $ minikube version minikube version: v1.8.2 $ kubectl version --short Client Version: v1.17.4 Server Version: v1.17.4 $ NO_MANUAL=1 ./dev/make-distribution.sh --r --pip --tgz -Pkubernetes $ resource-managers/kubernetes/integration-tests/dev/dev-run-integration-tests.sh --spark-tgz $PWD/spark-.tgz ... KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - All pods have the same service account by default - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark with Python2 to test a pyfiles example - Run PySpark with Python3 to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - PVs with local storage FAILED * // This is irrelevant to this PR. - Launcher client dependencies // This is the fixed test case by this PR. - Test basic decommissioning - Run SparkR on simple dataframe.R example Run completed in 12 minutes, 4 seconds. ... ``` The following is the working snapshot of `DepsTestSuite` test. ``` $ kubectl get all -ncf9438dd8a65436686b1196a6b73000f NAME READY STATUS RESTARTS AGE pod/minio-0 1/1 Running 0 70s pod/spark-test-app-8494bddca3754390b9e59a2ef47584eb 1/1 Running 0 55s NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE service/minio-s3 NodePort 10.109.54.180 <none> 9000:30678/TCP 70s service/spark-test-app-fd916b711061c7b8-driver-svc ClusterIP None <none> 7078/TCP,7079/TCP,4040/TCP 55s NAME READY AGE statefulset.apps/minio 1/1 70s ``` Closes #28015 from dongjoon-hyun/SPARK-31244. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-25 12:38:15 -07:00
Wenchen Fan	4f274a4de9	[SPARK-31147][SQL] Forbid CHAR type in non-Hive-Serde tables ### What changes were proposed in this pull request? Spark introduced CHAR type for hive compatibility but it only works for hive tables. CHAR type is never documented and is treated as STRING type for non-Hive tables. However, this leads to confusing behaviors Apache Spark 3.0.0-preview2 ``` spark-sql> CREATE TABLE t(a CHAR(3)); spark-sql> INSERT INTO TABLE t SELECT 'a '; spark-sql> SELECT a, length(a) FROM t; a 2 ``` Apache Spark 2.4.5 ``` spark-sql> CREATE TABLE t(a CHAR(3)); spark-sql> INSERT INTO TABLE t SELECT 'a '; spark-sql> SELECT a, length(a) FROM t; a 3 ``` According to the SQL standard, `CHAR(3)` should guarantee all the values are of length 3. Since `CHAR(3)` is treated as STRING so Spark doesn't guarantee it. This PR forbids CHAR type in non-Hive tables as it's not supported correctly. ### Why are the changes needed? avoid confusing/wrong behavior ### Does this PR introduce any user-facing change? yes, now users can't create/alter non-Hive tables with CHAR type. ### How was this patch tested? new tests Closes #27902 from cloud-fan/char. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-25 09:25:55 -07:00
Takeshi Yamamuro	da49f50621	[SPARK-25121][SQL][FOLLOWUP] Add more unit tests for multi-part identifiers in join strategy hints ### What changes were proposed in this pull request? This pr intends to add unit tests for the other join hints (`MERGEJOIN`, `SHUFFLE_HASH`, and `SHUFFLE_REPLICATE_NL`). This is a followup PR of #27935. ### Why are the changes needed? For better test coverage. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added unit tests. Closes #28013 from maropu/SPARK-25121-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-25 08:37:28 -07:00
Maxim Gekk	27d53de10f	[SPARK-31232][SQL][DOCS] Specify formats of `spark.sql.session.timeZone` ### What changes were proposed in this pull request? In the PR, I propose to update the doc for `spark.sql.session.timeZone`, and restrict format of config's values to 2 forms: 1. Geographical regions, such as `America/Los_Angeles`. 2. Fixed offsets - a fully resolved offset from UTC. For example, `-08:00`. ### Why are the changes needed? Other formats such as three-letter time zone IDs are ambitious, and depend on the locale. For example, `CST` could be U.S. `Central Standard Time` and `China Standard Time`. Such formats have been already deprecated in JDK, see [Three-letter time zone IDs](https://docs.oracle.com/javase/8/docs/api/java/util/TimeZone.html). ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running `./dev/scalastyle`, and manual testing. Closes #27999 from MaxGekk/doc-session-time-zone. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-25 16:32:28 +08:00
samsetegne	44431d4b1a	[SPARK-30822][SQL] Remove semicolon at the end of a sql query # What changes were proposed in this pull request? This change proposes ignoring a terminating semicolon from queries submitted by the user (if included) instead of raising a parse exception. # Why are the changes needed? When a user submits a directly executable SQL statement terminated with a semicolon, they receive an `org.apache.spark.sql.catalyst.parser.ParseException` of `extraneous input ';' expecting <EOF>`. SQL-92 describes a direct SQL statement as having the format of `<directly executable statement> <semicolon>` and the majority of SQL implementations either require the semicolon as a statement terminator, or make it optional (meaning not raising an exception when it's included, seemingly in recognition that it's a common behavior). # Does this PR introduce any user-facing change? No # How was this patch tested? Unit test added to `PlanParserSuite` ``` sbt> project catalyst sbt> testOnly *PlanParserSuite [info] - case insensitive (565 milliseconds) [info] - explain (9 milliseconds) [info] - set operations (41 milliseconds) [info] - common table expressions (31 milliseconds) [info] - simple select query (47 milliseconds) [info] - hive-style single-FROM statement (11 milliseconds) [info] - multi select query (32 milliseconds) [info] - query organization (41 milliseconds) [info] - insert into (12 milliseconds) [info] - aggregation (24 milliseconds) [info] - limit (11 milliseconds) [info] - window spec (11 milliseconds) [info] - lateral view (17 milliseconds) [info] - joins (62 milliseconds) [info] - sampled relations (11 milliseconds) [info] - sub-query (11 milliseconds) [info] - scalar sub-query (9 milliseconds) [info] - table reference (2 milliseconds) [info] - table valued function (8 milliseconds) [info] - SPARK-20311 range(N) as alias (2 milliseconds) [info] - SPARK-20841 Support table column aliases in FROM clause (3 milliseconds) [info] - SPARK-20962 Support subquery column aliases in FROM clause (4 milliseconds) [info] - SPARK-20963 Support aliases for join relations in FROM clause (3 milliseconds) [info] - inline table (23 milliseconds) [info] - simple select query with !> and !< (5 milliseconds) [info] - select hint syntax (34 milliseconds) [info] - SPARK-20854: select hint syntax with expressions (12 milliseconds) [info] - SPARK-20854: multiple hints (4 milliseconds) [info] - TRIM function (16 milliseconds) [info] - OVERLAY function (16 milliseconds) [info] - precedence of set operations (18 milliseconds) [info] - create/alter view as insert into table (4 milliseconds) [info] - Invalid insert constructs in the query (10 milliseconds) [info] - relation in v2 catalog (3 milliseconds) [info] - CTE with column alias (2 milliseconds) [info] - statement containing terminal semicolons (3 milliseconds) [info] ScalaTest [info] Run completed in 3 seconds, 129 milliseconds. [info] Total number of tests run: 36 [info] Suites: completed 1, aborted 0 [info] Tests: succeeded 36, failed 0, canceled 0, ignored 0, pending 0 [info] All tests passed. [info] Passed: Total 36, Failed 0, Errors 0, Passed 36 ``` ### Current behavior: #### scala ```scala scala> val df = sql("select 1") // df: org.apache.spark.sql.DataFrame = [1: int] scala> df.show() // +---+ // \| 1\| // +---+ // \| 1\| // +---+ scala> val df = sql("select 1;") // org.apache.spark.sql.catalyst.parser.ParseException: // extraneous input ';' expecting <EOF>(line 1, pos 8) // == SQL == // select 1; // --------^^^ // at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263) // at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130) // at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:52) // at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:76) // at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605) // at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) // at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605) // ... 47 elided ``` #### pyspark ```python df = spark.sql('select 1') df.show() #+---+ #\| 1\| #+---+ #\| 1\| #+---+ df = spark.sql('select 1;') # Traceback (most recent call last): # File "<stdin>", line 1, in <module> # File "/Users/ssetegne/spark/python/pyspark/sql/session.py", line 646, in sql # return DataFrame(self._jsparkSession.sql(sqlQuery), self._wrapped) # File "/Users/ssetegne/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1286, in # __call__ # File "/Users/ssetegne/spark/python/pyspark/sql/utils.py", line 102, in deco # raise converted # pyspark.sql.utils.ParseException: # extraneous input ';' expecting <EOF>(line 1, pos 8) # == SQL == # select 1; # --------^^^ ``` ### Behavior after proposed fix: #### scala ```scala scala> val df = sql("select 1") // df: org.apache.spark.sql.DataFrame = [1: int] scala> df.show() // +---+ // \| 1\| // +---+ // \| 1\| // +---+ scala> val df = sql("select 1;") // df: org.apache.spark.sql.DataFrame = [1: int] scala> df.show() // +---+ // \| 1\| // +---+ // \| 1\| // +---+ ``` #### pyspark ```python df = spark.sql('select 1') df.show() #+---+ #\| 1 \| #+---+ #\| 1 \| #+---+ df = spark.sql('select 1;') df.show() #+---+ #\| 1 \| #+---+ #\| 1 \| #+---+ ``` Closes #27567 from samsetegne/semicolon. Authored-by: samsetegne <samuelsetegne@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-25 15:00:15 +08:00
Xingbo Jiang	a03fbfbdd5	[SPARK-31207][CORE] Ensure the total number of blocks to fetch equals to the sum of local/hostLocal/remote blocks ### What changes were proposed in this pull request? Assert the number of blocks to fetch equals the number of local blocks + the number of hostLocal blocks + the number of remote blocks in ShuffleBlockFetcherIterator. Also refactor the code a bit to make it easier to follow. ### Why are the changes needed? When the numbers don't match it means something is going wrong, we should fail fast. ### Does this PR introduce any user-facing change? No. This is basically code refactoring. ### How was this patch tested? Tested with existing test suites. Closes #27972 from jiangxb1987/BlockFetcher. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-25 13:19:43 +08:00
Xingbo Jiang	c2c5b2df50	[SPARK-31239][CORE][TEST] Increase await duration in `WorkerDecommissionSuite`.`verify a task with all workers decommissioned succeeds` ### What changes were proposed in this pull request? The test case has been flaky because the execution time sometimes exceeds the await duration. Increase the await duration to avoid flakiness. ### How was this patch tested? Tested locally and it didn't fail anymore. Closes #28007 from jiangxb1987/DecomTest. Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-25 13:43:35 +09:00
Kousuke Saruta	999c9ed10c	[SPARK-31081][UI][SQL] Make display of stageId/stageAttemptId/taskId of sql metrics toggleable ### What changes were proposed in this pull request? This is another solution for `SPARK-31081` and #27849 . I added a checkbox which can toggle display of stageId/taskid in the SQL's DAG page. Mainly, I implemented the toggleable texts in boxes with HTML label feature provided by `dagre-d3`. The additional metrics are enclosed by `<span>` and control the appearance of the text. But the exception is additional metrics in clusters. We can use HTML label for cluster but layout will be broken so I choosed normal text label for clusters. Due to that, this solution contains a little bit tricky code in`spark-sql-viz.js` to manipulate the metric texts and generate DOMs. ### Why are the changes needed? It makes metrics harder to read after #26843 and user may not interest in extra info(stageId/StageAttemptId/taskId ) when they do not need debug. #27849 control the appearance by a new configuration property but providing a checkbox is more flexible. ### Does this PR introduce any user-facing change? Yes. [Additional metrics shown] ![with-checked](https://user-images.githubusercontent.com/4736016/77244214-0f6cd780-6c56-11ea-9275-a30758dd5339.png) [Additional metrics hidden] ![without-chedked](https://user-images.githubusercontent.com/4736016/77244219-14ca2200-6c56-11ea-9874-33a466085fce.png) ### How was this patch tested? Tested manually with a simple DataFrame operation. * The appearance of additional metrics in the boxes are controlled by the newly added checkbox. * No error found with JS-debugger. * Checked/not-checked state is preserved after reloading. Closes #27927 from sarutak/SPARK-31081. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-03-24 13:37:13 -07:00
Kousuke Saruta	88864c0615	[SPARK-31161][WEBUI] Refactor the on-click timeline action in streagming-page.js ### What changes were proposed in this pull request? Refactor `streaming-page.js` by making on-click timeline action customizable. ### Why are the changes needed? In the current implementation, `streaming-page.js` is used from Streaming page and Structured Streaming page but the implementation of the on-click timeline action is strongly dependent on Streamng page. Structured Streaming page doesn't define the on-click action for now but it's better to remove the dependncy for the future. Originally, I make this change to fix `SPARK-31128` but #27883 resolved it. So, now this is just for refactoring. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manual tests with following code and confirmed there are no regression and no error in the debug console in Firefox. For Structured Streaming: ``` spark.readStream.format("socket").options(Map("host"->"localhost", "port"->"8765")).load.writeStream.format("console").start ``` And then, visited Structured Streaming page and there were no error in the debug console when I clicked a point in the timeline. For Spark Streaming: ``` import org.apache.spark.streaming._ val ssc = new StreamingContext(sc, Seconds(1)) ssc.socketTextStream("localhost", 8765) dstream.foreachRDD(rdd => rdd.foreach(println)) ssc.start ``` And then, visited Streaming page and confirmed scrolling down and hilighting work well and there were no error in the debug console when I clicked a point in the timeline. Closes #27921 from sarutak/followup-SPARK-29543-fix-oncick. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-03-24 13:00:46 -05:00
yi.wu	f6ff7d0cf8	[SPARK-30127][SQL] Support case class parameter for typed Scala UDF ### What changes were proposed in this pull request? To support case class parameter for typed Scala UDF, e.g. ``` case class TestData(key: Int, value: String) val f = (d: TestData) => d.key * d.value.toInt val myUdf = udf(f) val df = Seq(("data", TestData(50, "2"))).toDF("col1", "col2") checkAnswer(df.select(myUdf(Column("col2"))), Row(100) :: Nil) ``` ### Why are the changes needed? Currently, Spark UDF can only work on data types like java.lang.String, o.a.s.sql.Row, Seq[_], etc. This is inconvenient if user want to apply an operation on one column, and the column is struct type. You must access data from a Row object, instead of domain object like Dataset operations. It will be great if UDF can work on types that are supported by Dataset, e.g. case class. And here's benchmark result of using case class comparing to row: ```scala // case class: 58ms 65ms 59ms 64ms 61ms // row: 59ms 64ms 73ms 84ms 69ms val f1 = (d: TestData) => s"${d.key}, ${d.value}" val f2 = (r: Row) => s"${r.getInt(0)}, ${r.getString(1)}" val udf1 = udf(f1) // set spark.sql.legacy.allowUntypedScalaUDF=true val udf2 = udf(f2, StringType) val df = spark.range(100000).selectExpr("cast (id as int) as id") .select(struct('id, lit("str")).as("col")) df.cache().collect() // warmup to exclude some extra influence df.select(udf1('col)).write.mode(SaveMode.Overwrite).format("noop").save() df.select(udf2('col)).write.mode(SaveMode.Overwrite).format("noop").save() start = System.currentTimeMillis() df.select(udf1('col)).write.mode(SaveMode.Overwrite).format("noop").save() println(System.currentTimeMillis() - start) start = System.currentTimeMillis() df.select(udf2('col)).write.mode(SaveMode.Overwrite).format("noop").save() println(System.currentTimeMillis() - start) ``` ### Does this PR introduce any user-facing change? Yes. User now could be able to use typed Scala UDF with case class as input parameter. ### How was this patch tested? Added unit tests. Closes #27937 from Ngone51/udf_caseclass_support. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-24 23:03:57 +08:00
Maxim Gekk	1fd4607d84	[SPARK-31221][SQL] Rebase any date-times in conversions to/from Java types ### What changes were proposed in this pull request? In the PR, I propose to apply rebasing for all dates/timestamps in conversion functions `fromJavaDate()`, `toJavaDate()`, `toJavaTimestamp()` and `fromJavaTimestamp()`. The rebasing is performed via building a local date-time in an original calendar, extracting date-time fields from the result, and creating new local date-time in the target calendar. ### Why are the changes needed? The changes are need to be compatible with previous Spark version (2.4.5 and earlier versions) not only before the Gregorian cutover date `1582-10-15` but also for dates after the date. For instance, Gregorian calendar implementation in Java 7 `java.util.GregorianCalendar` is not accurate in resolving time zone offsets as Gregorian calendar introduced since Java 8. ### Does this PR introduce any user-facing change? Yes, this PR can introduce behavior changes for dates after `1582-10-15`, in particular conversions of zone ids to zone offsets will be much more accurate. ### How was this patch tested? By existing test suites `DateTimeUtilsSuite`, `DateFunctionsSuite`, `DateExpressionsSuite`, `CollectionExpressionsSuite`, `HiveOrcHadoopFsRelationSuite`, `ParquetIOSuite`. Closes #27980 from MaxGekk/reuse-rebase-funcs-in-java-funcs. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-24 21:14:25 +08:00
HyukjinKwon	c181c45f86	[SPARK-31231][BUILD] Explicitly setuptools version as 46.0.0 in pip package test ### What changes were proposed in this pull request? For a bit of background, PIP packaging test started to fail (see [this logs](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/120218/testReport/)) as of setuptools 46.1.0 release. In https://github.com/pypa/setuptools/issues/1424, they decided to don't keep the modes in `package_data`. In PySpark pip installation, we keep the executable scripts in `package_data` `fc4e56a54c/python/setup.py (L199-L200)`, and expose their symbolic links as executable scripts. So, the symbolic links (or copied scripts) executes the scripts copied from `package_data`, which doesn't have the executable permission in its mode: ``` /tmp/tmp.UmkEGNFdKF/3.6/bin/spark-submit: line 27: /tmp/tmp.UmkEGNFdKF/3.6/lib/python3.6/site-packages/pyspark/bin/spark-class: Permission denied /tmp/tmp.UmkEGNFdKF/3.6/bin/spark-submit: line 27: exec: /tmp/tmp.UmkEGNFdKF/3.6/lib/python3.6/site-packages/pyspark/bin/spark-class: cannot execute: Permission denied ``` The current issue is being tracked at https://github.com/pypa/setuptools/issues/2041 </br> For what this PR proposes: It sets the upper bound in PR builder for now to unblock other PRs. _This PR does not solve the issue yet. I will make a fix after monitoring https://github.com/pypa/setuptools/issues/2041_ ### Why are the changes needed? It currently affects users who uses the latest setuptools. So, _users seem unable to use PySpark with the latest setuptools._ See also https://github.com/pypa/setuptools/issues/2041#issuecomment-602566667 ### Does this PR introduce any user-facing change? It makes CI pass for now. No user-facing change yet. ### How was this patch tested? Jenkins will test. Closes #27995 from HyukjinKwon/investigate-pip-packaging. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-24 17:59:43 +09:00
HyukjinKwon	bd324007d5	[SPARK-31229][SQL][TESTS] Add unit tests TypeCoercion.findTypeForComplex and Cast.canCast in null <> complex types ### What changes were proposed in this pull request? This PR (SPARK-31229) is rather a followup of https://github.com/apache/spark/pull/27926 (SPARK-31166). It adds unittests for `TypeCoercion.findTypeForComplex` and `Cast.canCast` about struct, map and array with the respect to null types. ### Why are the changes needed? To detect which scope was broken in the future easily. ### Does this PR introduce any user-facing change? No, it's a test-only. ### How was this patch tested? Unittests were added. Closes #27990 from HyukjinKwon/SPARK-31166-followup. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-24 14:10:59 +09:00
Wenchen Fan	1d0f54951e	[SPARK-31205][SQL] support string literal as the second argument of date_add/date_sub functions ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/26412 introduced a behavior change that `date_add`/`date_sub` functions can't accept string and double values in the second parameter. This is reasonable as it's error-prone to cast string/double to int at runtime. However, using string literals as function arguments is very common in SQL databases. To avoid breaking valid use cases that the string literal is indeed an integer, this PR proposes to add ansi_cast for string literal in date_add/date_sub functions. If the string value is not a valid integer, we fail at query compiling time because of constant folding. ### Why are the changes needed? avoid breaking changes ### Does this PR introduce any user-facing change? Yes, now 3.0 can run `date_add('2011-11-11', '1')` like 2.4 ### How was this patch tested? new tests. Closes #27965 from cloud-fan/string. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-24 12:07:22 +08:00
Maxim Gekk	aa3a7429f4	[SPARK-31159][SQL][FOLLOWUP] Move checking of the `rebaseDateTime` flag out of the loop in `VectorizedColumnReader` ### What changes were proposed in this pull request? In the PR, I propose to refactor reading of timestamps of the `TIMESTAMP_MILLIS` logical type from Parquet files in `VectorizedColumnReader`, and move checking of the `rebaseDateTime` flag out of the internal loop. ### Why are the changes needed? To avoid any additional overhead of the checking the SQL config `spark.sql.legacy.parquet.rebaseDateTime.enabled` introduced by the PR https://github.com/apache/spark/pull/27915. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running the test suite `ParquetIOSuite`. Closes #27973 from MaxGekk/rebase-parquet-datetime-followup. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-23 23:02:48 +09:00
Wenchen Fan	d929c0dfe8	[SPARK-31133][SQL][DOC] fix sql ref doc for DML ### What changes were proposed in this pull request? `INSERT OVERWRITE DIRECTORY` can only use file format (class implements `org.apache.spark.sql.execution.datasources.FileFormat`). This PR fixes it and other minor improvement. ### Why are the changes needed? ### Does this PR introduce any user-facing change? ### How was this patch tested? Closes #27891 from cloud-fan/doc. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-23 22:00:50 +08:00
yi.wu	5c4d44bb83	[SPARK-31190][SQL] ScalaReflection should not erasure user defined AnyVal type ### What changes were proposed in this pull request? Improve `ScalaReflection` to only don't erasure non user defined `AnyVal` type, but still erasure other types, e.g. `Any`. And this brings two benefits: 1. Give better encode error message for some unsupported types, e.g. `Any` 2. Won't miss the walk path for the `AnyVal` type ### Why are the changes needed? Firstly, PR #15284 added encode(serializeFor/deserializeFor) support for value class, which extends `AnyVal`, by not erasure types. But, this also introduce a problem that when user try to encoder unsupported types, e.g. `Any`, it will fail on `java.lang.ClassNotFoundException: scala.Any` due to the reason that `scala.Any` doesn't erasure to `java.lang.Object`. Also, in current `getClassNameFromType()`, it always erasure types which could missing walked path for user defined `AnyVal` types. ### Does this PR introduce any user-facing change? Yes. For the test below: ``` case class Bar(i: Any) case class Foo(i: Bar) extends AnyVal test() { implicitly[ExpressionEncoder[Foo]] } ``` Before: ``` java.lang.ClassNotFoundException: scala.Any at java.net.URLClassLoader.findClass(URLClassLoader.java:382) at java.lang.ClassLoader.loadClass(ClassLoader.java:418) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:355) ... ```` After: ``` java.lang.UnsupportedOperationException: No Encoder found for Any - field (class: "java.lang.Object", name: "i") - field (class: "org.apache.spark.sql.catalyst.encoders.Bar", name: "i") - root class: "org.apache.spark.sql.catalyst.encoders.Foo" at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerFor$1(ScalaReflection.scala:561) ``` ### How was this patch tested? Added unit test and test manually. Closes #27959 from Ngone51/impr_anyval. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-23 16:28:34 +08:00
Maxim Gekk	db6247faa8	[SPARK-31211][SQL] Fix rebasing of 29 February of Julian leap years ### What changes were proposed in this pull request? In the PR, I propose to fix the issue of rebasing leap years in Julian calendar to Proleptic Gregorian calendar in which the years are not leap years. In the Julian calendar, every four years is a leap year, with a leap day added to the month of February. In Proleptic Gregorian calendar, every year that is exactly divisible by four is a leap year, except for years that are exactly divisible by 100, but these centurial years are leap years, if they are exactly divisible by 400. In this ways, the date 1000-02-29 exists in the Julian calendar but not in Proleptic Gregorian calendar. I modified the `rebaseJulianToGregorianMicros()` and `rebaseJulianToGregorianDays()` in `DateTimeUtils` by passing 1 as a day number of month while forming `LocalDate` or `LocalDateTime`, and adding the number of days using the `plusDays()` method. For example, 1000-02-29 doesn't exist in Proleptic Gregorian calendar, and `LocalDate.of(1000, 2, 29)` throws an exception. To avoid the issue, I build the `LocalDate.of(1000, 2, 1)` date and add 28 days. The `plusDays(28)` method produces the next valid date after `1000-02-28` which is 1000-03-01. ### Why are the changes needed? Before the changes, the `java.time.DateTimeException` exception is raised while loading the date `1000-02-29` from parquet files saved by Spark 2.4.5: ```scala scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTime.enabled", true) scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_date_leap").show 20/03/21 03:03:59 ERROR Executor: Exception in task 0.0 in stage 3.0 (TID 3) java.time.DateTimeException: Invalid date 'February 29' as '1000' is not a leap year ``` The parquet files were saved via the commands: ```shell $ export TZ="America/Los_Angeles" ``` ```scala scala> scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> val df = Seq(java.sql.Date.valueOf("1000-02-29")).toDF("dateS").select($"dateS".as("date")) df: org.apache.spark.sql.DataFrame = [date: date] scala> df.write.mode("overwrite").parquet("/Users/maxim/tmp/before_1582/2_4_5_date_leap") scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_date_leap").show +----------+ \| date\| +----------+ \|1000-02-29\| +----------+ ``` ### Does this PR introduce any user-facing change? Yes, after the fix: ```scala scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTime.enabled", true) scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_date_leap").show +----------+ \| date\| +----------+ \|1000-03-01\| +----------+ ``` ### How was this patch tested? Added tests to `DateTimeUtilsSuite`. Closes #27974 from MaxGekk/julian-date-29-feb. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-23 14:21:24 +08:00
LantaoJin	929b794e25	[SPARK-30494][SQL] Fix cached data leakage during replacing an existing view ### What changes were proposed in this pull request? The cached RDD for plan "select 1" stays in memory forever until the session close. This cached data cannot be used since the view temp1 has been replaced by another plan. It's a memory leak. We can reproduce by below commands: ``` Welcome to ____ __ / __/__ ___ _____/ /__ _\ \/ _ \/ _ `/ __/ '_/ /___/ .__/\_,_/_/ /_/\_\ version 3.0.0-SNAPSHOT /_/ Using Scala version 2.12.10 (Java HotSpot(TM) 64-Bit Server VM, Java 1.8.0_201) Type in expressions to have them evaluated. Type :help for more information. scala> spark.sql("create or replace temporary view temp1 as select 1") scala> spark.sql("cache table temp1") scala> spark.sql("create or replace temporary view temp1 as select 1, 2") scala> spark.sql("cache table temp1") scala> assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 1, 2")).isDefined) scala> assert(spark.sharedState.cacheManager.lookupCachedData(sql("select 1")).isDefined) ``` ### Why are the changes needed? Fix the memory leak, specially for long running mode. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Add an unit test. Closes #27185 from LantaoJin/SPARK-30494. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-22 22:22:13 -07:00
beliefer	a0cf972985	[SPARK-31141][DSTREAMS][DOC] Add version information to the configuration of Dstreams ### What changes were proposed in this pull request? Add version information to the configuration of `Dstreams`. I sorted out some information show below. Item name \| Since version \| JIRA ID \| Commit ID \| Note -- \| -- \| -- \| -- \| -- spark.streaming.backpressure.enabled \| 1.5.0 \| SPARK-9967 and SPARK-10099 \| 392bd19d678567751cd3844d9d166a7491c5887e#diff-1b584c4ed88a9022abb11d594f760997 \| spark.streaming.backpressure.initialRate \| 2.0.0 \| SPARK-11627 \| 7218c0eba957e0a079a407b79c3a050cce9647b2#diff-c64d571ef32d2dbf76e965ecd04a9f52 \| spark.streaming.blockInterval \| 0.8.0 \| None \| 46eecd110a4017ea0c86cbb1010d0ccd6a5eb2ef#diff-54d85b29e4349628a0de525c119399b5 \| spark.streaming.receiver.maxRate \| 1.0.2 \| SPARK-1341 \| ca19cfbcd5cfac9ad731350dfeea14355aec87d6#diff-c64d571ef32d2dbf76e965ecd04a9f52 \| spark.streaming.receiver.writeAheadLog.enable \| 1.2.1 \| SPARK-4482 \| ce5ea0fd611ce560f6e1fac83562469bdb97091e#diff-0607b70e4e79cbbc1a128c45784cb813 \| spark.streaming.unpersist \| 0.9.0 \| None \| 08b9fec93d00ff0ebb49af4d9ac72d2806eded02#diff-bcf5f84f78d23ebde7d532bea756bc57 \| spark.streaming.stopGracefullyOnShutdown \| 1.4.0 \| SPARK-7776 \| a17a5cb302c5fa6a4d3e9e3e0fa2100c0b5436d6#diff-8a7f0e3f26c15ba484e6312c3caf033d \| spark.streaming.kafka.maxRetries \| 1.3.0 \| SPARK-4964 \| a119cae48030520da9f26ee9a1270bed7f33031e#diff-26cb4369f86050dc2e75cd16291b2844 \| spark.streaming.ui.retainedBatches \| 1.0.0 \| SPARK-1386 \| f36dc3fed0a0671b0712d664db859da28c0a98e2#diff-56b8d67d07284cfab165d5363bd3500e \| spark.streaming.driver.writeAheadLog.closeFileAfterWrite \| 1.6.0 \| SPARK-11324 \| 4f030b9e82172659d250281782ac573cbd1438fc#diff-a1b3ec72e8d7cc91433a1cc64fe6e91d \| spark.streaming.receiver.writeAheadLog.closeFileAfterWrite \| 1.6.0 \| SPARK-11324 \| 4f030b9e82172659d250281782ac573cbd1438fc#diff-a1b3ec72e8d7cc91433a1cc64fe6e91d \| spark.streaming.receiver.writeAheadLog.class \| 1.4.0 \| SPARK-7056 \| 1868bd40dcce23990b98748b0239bd00452b1ca5#diff-a1b3ec72e8d7cc91433a1cc64fe6e91d \| spark.streaming.receiver.writeAheadLog.rollingIntervalSecs \| 1.4.0 \| SPARK-7056 \| 1868bd40dcce23990b98748b0239bd00452b1ca5#diff-a1b3ec72e8d7cc91433a1cc64fe6e91d \| spark.streaming.receiver.writeAheadLog.maxFailures \| 1.2.0 \| SPARK-4028 \| 234de9232bcfa212317a8073c4a82c3863b36b14#diff-8cec1a581eebcad673dc8930b1a2801c \| spark.streaming.driver.writeAheadLog.class \| 1.4.0 \| SPARK-7056 \| 1868bd40dcce23990b98748b0239bd00452b1ca5#diff-a1b3ec72e8d7cc91433a1cc64fe6e91d \| spark.streaming.driver.writeAheadLog.rollingIntervalSecs \| 1.4.0 \| SPARK-7056 \| 1868bd40dcce23990b98748b0239bd00452b1ca5#diff-a1b3ec72e8d7cc91433a1cc64fe6e91d \| spark.streaming.driver.writeAheadLog.maxFailures \| 1.4.0 \| SPARK-7056 \| 1868bd40dcce23990b98748b0239bd00452b1ca5#diff-a1b3ec72e8d7cc91433a1cc64fe6e91d \| spark.streaming.driver.writeAheadLog.allowBatching \| 1.6.0 \| SPARK-11141 \| dccc4645df629f35c4788d50b2c0a6ab381db4b7#diff-a1b3ec72e8d7cc91433a1cc64fe6e91d \| spark.streaming.driver.writeAheadLog.batchingTimeout \| 1.6.0 \| SPARK-11141 \| dccc4645df629f35c4788d50b2c0a6ab381db4b7#diff-a1b3ec72e8d7cc91433a1cc64fe6e91d \| spark.streaming.sessionByKey.deltaChainThreshold \| 1.6.0 \| SPARK-11290 \| daa74be6f863061221bb0c2f94e70672e6fcbeaa#diff-e0a40541298f885606a2361ff9c5af6c \| spark.streaming.backpressure.rateEstimator \| 1.5.0 \| SPARK-8977 \| 819be46e5a73f2d19230354ebba30c58538590f5#diff-5dcaea3a4eca07f898fa88fe6d69e5c3 \| spark.streaming.backpressure.pid.proportional \| 1.5.0 \| SPARK-8979 \| 0a1d2ca42c8b31d6b0e70163795f0185d4622f87#diff-5dcaea3a4eca07f898fa88fe6d69e5c3 \| spark.streaming.backpressure.pid.integral \| 1.5.0 \| SPARK-8979 \| 0a1d2ca42c8b31d6b0e70163795f0185d4622f87#diff-5dcaea3a4eca07f898fa88fe6d69e5c3 \| spark.streaming.backpressure.pid.derived \| 1.5.0 \| SPARK-8979 \| 0a1d2ca42c8b31d6b0e70163795f0185d4622f87#diff-5dcaea3a4eca07f898fa88fe6d69e5c3 \| spark.streaming.backpressure.pid.minRate \| 1.5.0 \| SPARK-9966 \| 612b4609bdd38763725ae07d77c2176aa6756e64#diff-5dcaea3a4eca07f898fa88fe6d69e5c3 \| spark.streaming.concurrentJobs \| 0.7.0 \| None \| c97ebf64377e853ab7c616a103869a4417f25954#diff-839f06302b2d648a85436486fc13c85d \| spark.streaming.internal.batchTime \| 1.4.0 \| SPARK-6862 \| 1b7106b867bc0aa4d64b669d79b646f862acaf47#diff-25124e4f06a1da237bf486eceb1f7967 \| It's not a configuration, it's a property spark.streaming.internal.outputOpId \| 1.4.0 \| SPARK-6862 \| 1b7106b867bc0aa4d64b669d79b646f862acaf47#diff-25124e4f06a1da237bf486eceb1f7967 \| It's not a configuration, it's a property spark.streaming.clock \| 0.7.0 \| None \| cae894ee7aefa4cf9b1952038a48be81e1d2a856#diff-839f06302b2d648a85436486fc13c85d \| spark.streaming.gracefulStopTimeout \| 1.0.0 \| SPARK-1332 \| 94cbe2329021296b660d88f3e8ef3734374020d2#diff-2f8c5c038fda47b9875e10785fdd2498 \| spark.streaming.manualClock.jump \| 0.7.0 \| None \| fc3d0b602a08fdd182c2138506d1cd9952631f95#diff-839f06302b2d648a85436486fc13c85d \| ### Why are the changes needed? Supplemental configuration version information. ### Does this PR introduce any user-facing change? 'No' ### How was this patch tested? Exists UT Closes #27898 from beliefer/add-version-to-dstream-config. Authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-23 13:01:44 +09:00
zhengruifeng	ded51b04d2	[SPARK-31138][ML][FOLLOWUP] ANOVA optimization ### What changes were proposed in this pull request? 1, remove unused var `numFeatures`; 2, remove the computation of `numSamples` and `numClasses`, since they can be directly infered by `counts: OpenHashMap[Double, Long]` ### Why are the changes needed? remove a unnecessary job to compute `numSamples` and `numClasses` ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #27979 from zhengruifeng/anova_followup. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-03-23 11:16:57 +08:00
beliefer	ae0699d4b5	[SPARK-31002][CORE][DOC][FOLLOWUP] Add version information to the configuration of Core ### What changes were proposed in this pull request? This PR follows up #27847, #27852 and https://github.com/apache/spark/pull/27913. I sorted out some information show below. Item name \| Since version \| JIRA ID \| Commit ID \| Note -- \| -- \| -- \| -- \| -- spark.storage.localDiskByExecutors.cacheSize \| 3.0.0 \| SPARK-27651 \| fd2bf55abaab08798a428d4e47d4050ba2b82a95#diff-6bdad48cfc34314e89599655442ff210 \| spark.storage.memoryMapLimitForTests \| 2.3.0 \| SPARK-3151 \| b8ffb51055108fd606b86f034747006962cd2df3#diff-abd96f2ae793cd6ea6aab5b96a3c1d7a \| spark.barrier.sync.timeout \| 2.4.0 \| SPARK-24817 \| 388f5a0635a2812cd71b08352e3ddc20293ec189#diff-6bdad48cfc34314e89599655442ff210 \| spark.scheduler.blacklist.unschedulableTaskSetTimeout \| 2.4.1 \| SPARK-22148 \| 52e9711d01694158ecb3691f2ec25c0ebe4b0207#diff-6bdad48cfc34314e89599655442ff210 \| spark.scheduler.barrier.maxConcurrentTasksCheck.interval \| 2.4.0 \| SPARK-24819 \| bfb74394a5513134ea1da9fcf4a1783b77dd64e4#diff-6bdad48cfc34314e89599655442ff210 \| spark.scheduler.barrier.maxConcurrentTasksCheck.maxFailures \| 2.4.0 \| SPARK-24819 \| bfb74394a5513134ea1da9fcf4a1783b77dd64e4#diff-6bdad48cfc34314e89599655442ff210 \| spark.unsafe.exceptionOnMemoryLeak \| 1.4.0 \| SPARK-7076 and SPARK-7077 and SPARK-7080 \| f49284b5bf3a69ed91a5e3e6e0ed3be93a6ab9e4#diff-5a0de266c82b95adb47d9bca714e1f1b \| spark.unsafe.sorter.spill.read.ahead.enabled \| 2.3.0 \| SPARK-21113 \| 1e978b17d63d7ba20368057aa4e65f5ef6e87369#diff-93a086317cea72a113cf81056882c206 \| spark.unsafe.sorter.spill.reader.buffer.size \| 2.1.0 \| SPARK-16862 \| c1937dd19a23bd096a4707656c7ba19fb5c16966#diff-93a086317cea72a113cf81056882c206 \| spark.plugins \| 3.0.0 \| SPARK-29397 \| d51d228048d519a9a666f48dc532625de13e7587#diff-6bdad48cfc34314e89599655442ff210 \| spark.cleaner.periodicGC.interval \| 1.6.0 \| SPARK-8414 \| 72da2a21f0940b97757ace5975535e559d627688#diff-75141521b1d55bc32d72b70032ad96c0 \| spark.cleaner.referenceTracking \| 1.0.0 \| SPARK-1103 \| 11eabbe125b2ee572fad359c33c93f5e6fdf0b2d#diff-364713d7776956cb8b0a771e9b62f82d \| spark.cleaner.referenceTracking.blocking \| 1.0.0 \| SPARK-1103 \| 11eabbe125b2ee572fad359c33c93f5e6fdf0b2d#diff-364713d7776956cb8b0a771e9b62f82d \| spark.cleaner.referenceTracking.blocking.shuffle \| 1.1.1 \| SPARK-3139 \| 5cf1e440137006eedd6846ac8fa57ccf9fd1958d#diff-75141521b1d55bc32d72b70032ad96c0 \| spark.cleaner.referenceTracking.cleanCheckpoints \| 1.4.0 \| SPARK-2033 \| 25998e4d73bcc95ac85d9af71adfdc726ec89568#diff-440e866c5df0b8386aff57f9f8bd8db1 \| spark.executor.logs.rolling.strategy \| 1.1.0 \| SPARK-1940 \| 4823bf470ec1b47a6f404834d4453e61d3dcbec9#diff-2b4575e096e4db7165e087f9429f2a02 \| spark.executor.logs.rolling.time.interval \| 1.1.0 \| SPARK-1940 \| 4823bf470ec1b47a6f404834d4453e61d3dcbec9#diff-2b4575e096e4db7165e087f9429f2a02 \| spark.executor.logs.rolling.maxSize \| 1.4.0 \| SPARK-5932 \| 2d222fb39dd978e5a33cde6ceb59307cbdf7b171#diff-529fc5c06b9731c1fbda6f3db60b16aa \| spark.executor.logs.rolling.maxRetainedFiles \| 1.1.0 \| SPARK-1940 \| 4823bf470ec1b47a6f404834d4453e61d3dcbec9#diff-2b4575e096e4db7165e087f9429f2a02 \| spark.executor.logs.rolling.enableCompression \| 2.0.2 \| SPARK-17711 \| 26e978a93f029e1a1b5c7524d0b52c8141b70997#diff-2b4575e096e4db7165e087f9429f2a02 \| spark.master.rest.enabled \| 1.3.0 \| SPARK-5388 \| 6ec0cdc14390d4dc45acf31040f21e1efc476fc0#diff-29dffdccd5a7f4c8b496c293e87c8668 \| spark.master.rest.port \| 1.3.0 \| SPARK-5388 \| 6ec0cdc14390d4dc45acf31040f21e1efc476fc0#diff-29dffdccd5a7f4c8b496c293e87c8668 \| spark.master.ui.port \| 1.1.0 \| SPARK-2857 \| 12f99cf5f88faf94d9dbfe85cb72d0010a3a25ac#diff-366c88f47e9b5cfa4d4305febeb8b026 \| spark.io.compression.snappy.blockSize \| 1.4.0 \| SPARK-5932 \| 2d222fb39dd978e5a33cde6ceb59307cbdf7b171#diff-529fc5c06b9731c1fbda6f3db60b16aa \| spark.io.compression.lz4.blockSize \| 1.4.0 \| SPARK-5932 \| 2d222fb39dd978e5a33cde6ceb59307cbdf7b171#diff-529fc5c06b9731c1fbda6f3db60b16aa \| spark.io.compression.codec \| 0.8.0 \| None \| 46eecd110a4017ea0c86cbb1010d0ccd6a5eb2ef#diff-df9e6118c481ceb27faa399114fac0a1 \| spark.io.compression.zstd.bufferSize \| 2.3.0 \| SPARK-19112 \| 444bce1c98c45147fe63e2132e9743a0c5e49598#diff-df9e6118c481ceb27faa399114fac0a1 \| spark.io.compression.zstd.level \| 2.3.0 \| SPARK-19112 \| 444bce1c98c45147fe63e2132e9743a0c5e49598#diff-df9e6118c481ceb27faa399114fac0a1 \| spark.io.warning.largeFileThreshold \| 3.0.0 \| SPARK-28366 \| 26d03b62e20d053943d03b5c5573dd349e49654c#diff-6bdad48cfc34314e89599655442ff210 \| spark.eventLog.compression.codec \| 3.0.0 \| SPARK-28118 \| 47f54b1ec717d0d744bf3ad46bb1ed3542b667c8#diff-6bdad48cfc34314e89599655442ff210 \| spark.buffer.size \| 0.5.0 \| None \| 4b1646a25f7581cecae108553da13833e842e68a#diff-eaf125f56ce786d64dcef99cf446a751 \| spark.locality.wait.process \| 0.8.0 \| None \| 46eecd110a4017ea0c86cbb1010d0ccd6a5eb2ef#diff-264da78fe625d594eae59d1adabc8ae9 \| spark.locality.wait.node \| 0.8.0 \| None \| 46eecd110a4017ea0c86cbb1010d0ccd6a5eb2ef#diff-264da78fe625d594eae59d1adabc8ae9 \| spark.locality.wait.rack \| 0.8.0 \| None \| 46eecd110a4017ea0c86cbb1010d0ccd6a5eb2ef#diff-264da78fe625d594eae59d1adabc8ae9 \| spark.reducer.maxSizeInFlight \| 1.4.0 \| SPARK-5932 \| 2d222fb39dd978e5a33cde6ceb59307cbdf7b171#diff-529fc5c06b9731c1fbda6f3db60b16aa \| spark.reducer.maxReqsInFlight \| 2.0.0 \| SPARK-6166 \| 894921d813a259f2f266fde7d86d2ecb5a0af24b#diff-eb30a71e0d04150b8e0b64929852e38b \| spark.broadcast.compress \| 0.6.0 \| None \| efc5423210d1aadeaea78273a4a8f10425753079#diff-76170a9c8f67b542bc58240a0a12fe08 \| spark.broadcast.blockSize \| 0.5.0 \| None \| b8ab7862b8bd168bca60bd930cd97c1099fbc8a8#diff-271d7958e14cdaa46cf3737cfcf51341 \| spark.broadcast.checksum \| 2.1.1 \| SPARK-18188 \| 06a56df226aa0c03c21f23258630d8a96385c696#diff-4f43d14923008c6650a8eb7b40c07f74 \| spark.broadcast.UDFCompressionThreshold \| 3.0.0 \| SPARK-28355 \| 79e204770300dab4a669b9f8e2421ef905236e7b#diff-6bdad48cfc34314e89599655442ff210 \| spark.rdd.compress \| 0.6.0 \| None \| efc5423210d1aadeaea78273a4a8f10425753079#diff-76170a9c8f67b542bc58240a0a12fe08 \| spark.rdd.parallelListingThreshold \| 2.0.0 \| SPARK-9926 \| 80a4bfa4d1c86398b90b26c34d8dcbc2355f5a6a#diff-eaababfc87ea4949f97860e8b89b7586 \| spark.rdd.limit.scaleUpFactor \| 2.1.0 \| SPARK-16984 \| 806d8a8e980d8ba2f4261bceb393c40bafaa2f73#diff-1d55e54678eff2076263f2fe36150c17 \| spark.serializer \| 0.5.0 \| None \| fd1d255821bde844af28e897fabd59a715659038#diff-b920b65c23bf3a1b3326325b0d6a81b2 \| spark.serializer.objectStreamReset \| 1.0.0 \| SPARK-942 \| 40566e10aae4b21ffc71ea72702b8df118ac5c8e#diff-6a59dfc43d1b31dc1c3072ceafa829f5 \| spark.serializer.extraDebugInfo \| 1.3.0 \| SPARK-5307 \| 636408311deeebd77fb83d2249e0afad1a1ba149#diff-6a59dfc43d1b31dc1c3072ceafa829f5 \| spark.jars \| 0.9.0 \| None \| f1d206c6b4c0a5b2517b05af05fdda6049e2f7c2#diff-364713d7776956cb8b0a771e9b62f82d \| spark.files \| 1.0.0 \| None \| 29ee101c73bf066bf7f4f8141c475b8d1bd3cf1c#diff-364713d7776956cb8b0a771e9b62f82d \| spark.submit.deployMode \| 1.5.0 \| SPARK-6797 \| 7f487c8bde14dbdd244a3493ad11a129ef2bb327#diff-4d2ab44195558d5a9d5f15b8803ef39d \| spark.submit.pyFiles \| 1.0.1 \| SPARK-1549 \| d7ddb26e1fa02e773999cc4a97c48d2cd1723956#diff-4d2ab44195558d5a9d5f15b8803ef39d \| spark.scheduler.allocation.file \| 0.8.1 \| None \| 976fe60f7609d7b905a34f18743efabd966407f0#diff-9bc0105ee454005379abed710cd20ced \| spark.scheduler.minRegisteredResourcesRatio \| 1.1.1 \| SPARK-2635 \| 3311da2f9efc5ff2c7d01273ac08f719b067d11d#diff-7d99a7c7a051e5e851aaaefb275a44a1 \| spark.scheduler.maxRegisteredResourcesWaitingTime \| 1.1.1 \| SPARK-2635 \| 3311da2f9efc5ff2c7d01273ac08f719b067d11d#diff-7d99a7c7a051e5e851aaaefb275a44a1 \| spark.scheduler.mode \| 0.8.0 \| None \| 98fb69822cf780160bca51abeaab7c82e49fab54#diff-cb7a25b3c9a7341c6d99bcb8e9780c92 \| spark.scheduler.revive.interval \| 0.8.1 \| None \| d0c9d41a061969d409715b86a91937d8de4c29f7#diff-7d99a7c7a051e5e851aaaefb275a44a1 \| spark.speculation \| 0.6.0 \| None \| e72afdb817bcc8388aeb8b8d31628fd5fd67acf1#diff-4e188f32951dc989d97fa7577858bc7c \| spark.speculation.interval \| 0.6.0 \| None \| e72afdb817bcc8388aeb8b8d31628fd5fd67acf1#diff-4e188f32951dc989d97fa7577858bc7c \| spark.speculation.multiplier \| 0.6.0 \| None \| e72afdb817bcc8388aeb8b8d31628fd5fd67acf1#diff-fff59f72dfe6ca4ccb607ad12535da07 \| spark.speculation.quantile \| 0.6.0 \| None \| e72afdb817bcc8388aeb8b8d31628fd5fd67acf1#diff-fff59f72dfe6ca4ccb607ad12535da07 \| spark.speculation.task.duration.threshold \| 3.0.0 \| SPARK-29976 \| ad238a2238a9d0da89be4424574436cbfaee579d#diff-6bdad48cfc34314e89599655442ff210 \| spark.yarn.stagingDir \| 2.0.0 \| SPARK-13063 \| bc36df127d3b9f56b4edaeb5eca7697d4aef761a#diff-14b8ed2ef4e3da985300b8d796a38fa9 \| spark.buffer.pageSize \| 1.5.0 \| SPARK-9411 \| 1b0099fc62d02ff6216a76fbfe17a4ec5b2f3536#diff-1b22e54318c04824a6d53ed3f4d1bb35 \| ### Why are the changes needed? Supplemental configuration version information. ### Does this PR introduce any user-facing change? 'No'. ### How was this patch tested? Exists UT Closes #27931 from beliefer/add-version-to-core-config-part-four. Authored-by: beliefer <beliefer@163.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-23 11:07:43 +09:00
Kent Yao	f81f11822c	[SPARK-31189][R][DOCS][FOLLOWUP] Replace Datetime pattern links in R doc ### What changes were proposed in this pull request? Use our own docs for data pattern instructions to replace java doc. ### Why are the changes needed? fix doc ### Does this PR introduce any user-facing change? yes. doc changed ### How was this patch tested? pass jenkins Closes #27975 from yaooqinn/SPARK-31189-2. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-22 14:22:44 +09:00
Huaxin Gao	307cfe1f8e	[SPARK-31185][ML] Implement VarianceThresholdSelector ### What changes were proposed in this pull request? Implement a Feature selector that removes all low-variance features. Features with a variance lower than the threshold will be removed. The default is to keep all features with non-zero variance, i.e. remove the features that have the same value in all samples. ### Why are the changes needed? VarianceThreshold is a simple baseline approach to feature selection. It removes all features whose variance doesn’t meet some threshold. The idea is when a feature doesn’t vary much within itself, it generally has very little predictive power. scikit has implemented this selector. https://scikit-learn.org/stable/modules/feature_selection.html#variance-threshold ### Does this PR introduce any user-facing change? Yes. ### How was this patch tested? Add new test suite. Closes #27954 from huaxingao/variance-threshold. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: zhengruifeng <ruifengz@foxmail.com>	2020-03-22 12:44:18 +08:00
Jungtaek Lim (HeartSaVioR)	f55f6b569b	[SPARK-31101][BUILD] Upgrade Janino to 3.0.16 ### What changes were proposed in this pull request? This PR(SPARK-31101) proposes to upgrade Janino to 3.0.16 which is released recently. * Merged pull request janino-compiler/janino#114 "Grow the code for relocatables, and do fixup, and relocate". Please see the commit log. - https://github.com/janino-compiler/janino/commits/3.0.16 You can see the changelog from the link: http://janino-compiler.github.io/janino/changelog.html / though release note for Janino 3.0.16 is actually incorrect. ### Why are the changes needed? We got some report on failure on user's query which Janino throws error on compiling generated code. The issue is here: janino-compiler/janino#113 It contains the information of generated code, symptom (error), and analysis of the bug, so please refer the link for more details. Janino 3.0.16 contains the PR janino-compiler/janino#114 which would enable Janino to succeed to compile user's query properly. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing UTs. Closes #27932 from HeartSaVioR/SPARK-31101-janino-3.0.16. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-21 19:10:23 -07:00
Gabor Somogyi	bf342bafa8	[SPARK-30541][TESTS] Implement KafkaDelegationTokenSuite with testRetry ### What changes were proposed in this pull request? `KafkaDelegationTokenSuite` has been ignored because showed flaky behaviour. In this PR I've changed the approach how the test executed and turning it on again. This PR contains the following: * The test runs in separate JVM in order to avoid modified security context * The body of the test runs in `testRetry` which reties if failed * Additional logs to analyse possible failures * Enhanced clean-up code ### Why are the changes needed? `KafkaDelegationTokenSuite ` is ignored. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Executed the test in loop 1k+ times in jenkins (locally much harder to reproduce). Closes #27877 from gaborgsomogyi/SPARK-30541. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-21 18:59:29 -07:00
Prashant Sharma	3799d2b9d8	[SPARK-30715][K8S][TESTS][FOLLOWUP] Update k8s client version in IT as well ### What changes were proposed in this pull request? This is a follow up for SPARK-30715 . Kubernetes client version in sync in integration-tests and kubernetes/core ### Why are the changes needed? More than once, the kubernetes client version has gone out of sync between integration tests and kubernetes/core. So brought them up in sync again and added a comment to save us from future need of this additional followup. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manually. Closes #27948 from ScrapCodes/follow-up-spark-30715. Authored-by: Prashant Sharma <prashsh1@in.ibm.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-21 18:26:53 -07:00
Eric Wu	3a48ea1fe0	[SPARK-31184][SQL] Support getTablesByType API of Hive Client ### What changes were proposed in this pull request? Hive 2.3+ supports `getTablesByType` API, which will provide an efficient way to get HiveTable with specific type. Now, we have following mappings when using `HiveExternalCatalog`. ``` CatalogTableType.EXTERNAL => HiveTableType.EXTERNAL_TABLE CatalogTableType.MANAGED => HiveTableType.MANAGED_TABLE CatalogTableType.VIEW => HiveTableType.VIRTUAL_VIEW ``` Without this API, we need to achieve the goal by `getTables` + `getTablesByName` + `filter with type`. This PR add `getTablesByType` in `HiveShim`. For those hive versions don't support this API, `UnsupportedOperationException` will be thrown. And the upper logic should catch the exception and fallback to the filter solution mentioned above. Since the JDK11 related fix in `Hive` is not released yet, manual tests against hive 2.3.7-SNAPSHOT is done by following the instructions of SPARK-29245. ### Why are the changes needed? This API will provide better usability and performance if we want to get a list of hiveTables with specific type. For example `HiveTableType.VIRTUAL_VIEW` corresponding to `CatalogTableType.VIEW`. ### Does this PR introduce any user-facing change? No, this is a support function. ### How was this patch tested? Add tests in VersionsSuite and manually run JDK11 test with following settings: - Hive 2.3.6 Metastore on JDK8 - Hive 2.3.7-SNAPSHOT library build from source of Hive 2.3 branch - Spark build with Hive 2.3.7-SNAPSHOT on jdk-11.0.6 Closes #27952 from Eric5553/GetTableByType. Authored-by: Eric Wu <492960551@qq.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-21 17:41:23 -07:00
yan ma	fae981e5f3	[SPARK-30773][ML] Support NativeBlas for level-1 routines ### What changes were proposed in this pull request? Change BLAS for part of level-1 routines(axpy, dot, scal(double, denseVector)) from java implementation to NativeBLAS when vector size>256 ### Why are the changes needed? In current ML BLAS.scala, all level-1 routines are fixed to use java implementation. But NativeBLAS(intel MKL, OpenBLAS) can bring up to 11X performance improvement based on performance test which apply direct calls against these methods. We should provide a way to allow user take advantage of NativeBLAS for level-1 routines. Here we do it through switching to NativeBLAS for these methods from f2jBLAS. ### Does this PR introduce any user-facing change? Yes, methods axpy, dot, scal in level-1 routines will switch to NativeBLAS when it has more than nativeL1Threshold(fixed value 256) elements and will fallback to f2jBLAS if native BLAS is not properly configured in system. ### How was this patch tested? Perf test direct calls level-1 routines Closes #27546 from yma11/SPARK-30773. Lead-authored-by: yan ma <yan.ma@intel.com> Co-authored-by: Ma Yan <yan.ma@intel.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-03-20 10:32:58 -05:00
Huaxin Gao	9a990133f6	[SPARK-31138][ML] Add ANOVA Selector for continuous features and categorical labels ### What changes were proposed in this pull request? Add ANOVA Selector ### Why are the changes needed? Currently Spark only supports selection of categorical features, while there are many requirements for the selection of continuous distribution features. https://github.com/apache/spark/pull/27679 added FValueSelector for continuous features and continuous labels. This PR adds ANOVASelector for continuous features and categorical labels. ### Does this PR introduce any user-facing change? Yes, add a new Selector. ### How was this patch tested? add new test suites Closes #27895 from huaxingao/anova. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-03-20 10:28:00 -05:00
Kent Yao	88ae6c4481	[SPARK-31189][SQL][DOCS] Fix errors and missing parts for datetime pattern document ### What changes were proposed in this pull request? Fix errors and missing parts for datetime pattern document 1. The pattern we use is similar to DateTimeFormatter and SimpleDateFormat but not identical. So we shouldn't use any of them in the API docs but use a link to the doc of our own. 2. Some pattern letters are missing 3. Some pattern letters are explicitly banned - Set('A', 'c', 'e', 'n', 'N') 4. the second fraction pattern different logic for parsing and formatting ### Why are the changes needed? fix and improve doc ### Does this PR introduce any user-facing change? yes, new and updated doc ### How was this patch tested? pass Jenkins viewed locally with `jekyll serve` ![image](https://user-images.githubusercontent.com/8326978/77044447-6bd3bb00-69fa-11ea-8d6f-7084166c5dea.png) Closes #27956 from yaooqinn/SPARK-31189. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-20 21:59:26 +08:00
Maxim Gekk	b402bc900a	[SPARK-31183][SQL][FOLLOWUP] Move rebase tests to `AvroSuite` and check the rebase flag out of function bodies ### What changes were proposed in this pull request? 1. The tests added by #27953 are moved from `AvroLogicalTypeSuite` to `AvroSuite`. 2. Checking of the `rebaseDateTime` flag is moved out from functions bodies. ### Why are the changes needed? 1. The tests are moved because they are not directly related to logical types. 2. Checking the flag out of functions bodies should improve performance. ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running Avro tests via the command `build/sbt avro/test` Closes #27964 from MaxGekk/rebase-avro-datetime-followup. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-20 19:02:54 +09:00
Maxim Gekk	6a668763b8	[SPARK-31195][SQL] Correct and reuse days rebase functions of `DateTimeUtils` in `DaysWritable` ### What changes were proposed in this pull request? In the PR, I propose to correct and re-use functions from `DateTimeUtils` for rebasing days before the cutover day `1582-10-15` in `org.apache.spark.sql.hive.DaysWritable`. ### Why are the changes needed? 0. Existing rebasing of days in `DaysWritable` is not correct. 1. To deduplicate code in `DaysWritable` 2. To use functions that are better tested and cross checked by loading dates/timestamps from Parquet/Avro files written by Spark 2.4.5 ### Does this PR introduce any user-facing change? This PR can introduce behavior change because the replaced code is different from the re-used code from `DateTimeUtils`. ### How was this patch tested? By existing test suite, for instance `HiveOrcHadoopFsRelationSuite`. Closes #27962 from MaxGekk/reuse-rebase-funcs. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-20 15:57:21 +09:00
Maxim Gekk	4766a36647	[SPARK-31183][SQL] Rebase date/timestamp from/to Julian calendar in Avro ### What changes were proposed in this pull request? The PR addresses the issue of compatibility with Spark 2.4 and earlier version in reading/writing dates and timestamp via Avro datasource. Previous releases are based on a hybrid calendar - Julian + Gregorian. Since Spark 3.0, Proleptic Gregorian calendar is used by default, see SPARK-26651. In particular, the issue pops up for dates/timestamps before 1582-10-15 when the hybrid calendar switches from/to Gregorian to/from Julian calendar. The same local date in different calendar is converted to different number of days since the epoch 1970-01-01. For example, the 1001-01-01 date is converted to: - -719164 in Julian calendar. Spark 2.4 saves the number as a value of DATE type into Avro files. - -719162 in Proleptic Gregorian calendar. Spark 3.0 saves the number as a date value. The PR proposes rebasing from/to Proleptic Gregorian calendar to the hybrid one under the SQL config: ``` spark.sql.legacy.avro.rebaseDateTime.enabled ``` which is set to `false` by default which means the rebasing is not performed by default. The details of the implementation: 1. Re-use 2 methods of `DateTimeUtils` added by the PR https://github.com/apache/spark/pull/27915 for rebasing microseconds. 2. Re-use 2 methods of `DateTimeUtils` added by the PR https://github.com/apache/spark/pull/27915 for rebasing days. 3. Use `rebaseGregorianToJulianMicros()` and `rebaseGregorianToJulianDays()` while saving timestamps/dates to Avro files if the SQL config is on. 4. Use `rebaseJulianToGregorianMicros()` and `rebaseJulianToGregorianDays()` while loading timestamps/dates from Avro files if the SQL config is on. 5. The SQL config `spark.sql.legacy.avro.rebaseDateTime.enabled` controls conversions from/to dates, and timestamps of the `timestamp-millis`, `timestamp-micros` logical types. ### Why are the changes needed? For the backward compatibility with Spark 2.4 and earlier versions. The changes allow users to read dates/timestamps saved by previous version, and get the same result. Also after the changes, users can enable the rebasing in write, and save dates/timestamps that can be loaded correctly by Spark 2.4 and earlier versions. ### Does this PR introduce any user-facing change? Yes, the timestamp `1001-01-01 01:02:03.123456` saved by Spark 2.4.5 as `timestamp-micros` is interpreted by Spark 3.0.0-preview2 differently: ```scala scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_date_avro").show(false) +----------+ \|date \| +----------+ \|1001-01-07\| +----------+ ``` After the changes: ```scala scala> spark.conf.set("spark.sql.legacy.avro.rebaseDateTime.enabled", true) scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.read.format("avro").load("/Users/maxim/tmp/before_1582/2_4_5_date_avro").show(false) +----------+ \|date \| +----------+ \|1001-01-01\| +----------+ ``` ### How was this patch tested? 1. Added tests to `AvroLogicalTypeSuite` to check rebasing in read. The test reads back avro files saved by Spark 2.4.5 via: ```shell $ export TZ="America/Los_Angeles" ``` ```scala scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> val df = Seq("1001-01-01").toDF("dateS").select($"dateS".cast("date").as("date")) df: org.apache.spark.sql.DataFrame = [date: date] scala> df.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_date_avro") scala> val df2 = Seq("1001-01-01 01:02:03.123456").toDF("tsS").select($"tsS".cast("timestamp").as("ts")) df2: org.apache.spark.sql.DataFrame = [ts: timestamp] scala> df2.write.format("avro").save("/Users/maxim/tmp/before_1582/2_4_5_ts_avro") scala> :paste // Entering paste mode (ctrl-D to finish) val timestampSchema = s""" \| { \| "namespace": "logical", \| "type": "record", \| "name": "test", \| "fields": [ \| {"name": "ts", "type": ["null", {"type": "long","logicalType": "timestamp-millis"}], "default": null} \| ] \| } \|""".stripMargin // Exiting paste mode, now interpreting. scala> df3.write.format("avro").option("avroSchema", timestampSchema).save("/Users/maxim/tmp/before_1582/2_4_5_ts_millis_avro") ``` 2. Added the following tests to `AvroLogicalTypeSuite` to check rebasing of dates/timestamps (in microsecond and millisecond precision). The tests write rebased a date/timestamps and read them back w/ enabled/disabled rebasing, and compare results. : - `rebasing microseconds timestamps in write` - `rebasing milliseconds timestamps in write` - `rebasing dates in write` Closes #27953 from MaxGekk/rebase-avro-datetime. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-20 13:57:49 +08:00
Dongjoon Hyun	f1cc86792f	[SPARK-31181][SQL][TESTS] Remove the default value assumption on CREATE TABLE test cases ### What changes were proposed in this pull request? A few `CREATE TABLE` test cases have some assumption on the default value of `LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED`. This PR (SPARK-31181) makes the test cases more explicit from test-case side. The configuration change was tested via https://github.com/apache/spark/pull/27894 during discussing SPARK-31136. This PR has only the test case part from that PR. ### Why are the changes needed? This makes our test case more robust in terms of the default value of `LEGACY_CREATE_HIVE_TABLE_BY_DEFAULT_ENABLED`. Even in the case where we switch the conf value, that will be one-liner with no test case changes. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins with the existing tests. Closes #27946 from dongjoon-hyun/SPARK-EXPLICIT-TEST. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-20 12:28:57 +08:00
Takeshi Yamamuro	ca499e9409	[SPARK-25121][SQL] Supports multi-part table names for broadcast hint resolution ### What changes were proposed in this pull request? This pr fixed code to respect a database name for broadcast table hint resolution. Currently, spark ignores a database name in multi-part names; ``` scala> sql("CREATE DATABASE testDb") scala> spark.range(10).write.saveAsTable("testDb.t") // without this patch scala> spark.range(10).join(spark.table("testDb.t"), "id").hint("broadcast", "testDb.t").explain == Physical Plan == (2) Project [id#24L] +- (2) BroadcastHashJoin [id#24L], [id#26L], Inner, BuildLeft :- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, false])) : +- (1) Range (0, 10, step=1, splits=4) +- (2) Project [id#26L] +- (2) Filter isnotnull(id#26L) +- (2) FileScan parquet testdb.t[id#26L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Repositories/spark/spark-2.3.1-bin-hadoop2.7/spark-warehouse..., PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint> // with this patch scala> spark.range(10).join(spark.table("testDb.t"), "id").hint("broadcast", "testDb.t").explain == Physical Plan == (2) Project [id#3L] +- (2) BroadcastHashJoin [id#3L], [id#5L], Inner, BuildRight :- (2) Range (0, 10, step=1, splits=4) +- BroadcastExchange HashedRelationBroadcastMode(List(input[0, bigint, true])) +- (1) Project [id#5L] +- (1) Filter isnotnull(id#5L) +- (1) FileScan parquet testdb.t[id#5L] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/Users/maropu/Repositories/spark/spark-master/spark-warehouse/testdb.db/t], PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint> ``` This PR comes from https://github.com/apache/spark/pull/22198 ### Why are the changes needed? For better usability. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added unit tests. Closes #27935 from maropu/SPARK-25121-2. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-19 20:11:04 -07:00
Dongjoon Hyun	c6a6d5e006	Revert "[SPARK-31170][SQL] Spark SQL Cli should respect hive-site.xml and spark.sql.warehouse.dir" This reverts commit `5bc0d76591`. Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-19 16:08:51 -07:00
Wenchen Fan	ac262cb272	[SPARK-30292][SQL][FOLLOWUP] ansi cast from strings to integral numbers (byte/short/int/long) should fail with fraction ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/26933 Fraction string like "1.23" is definitely not a valid integral format and we should fail to do the cast under the ANSI mode. ### Why are the changes needed? correct the ANSI cast behavior from string to integral ### Does this PR introduce any user-facing change? Yes under ANSI mode, but ANSI mode is off by default. ### How was this patch tested? new test Closes #27957 from cloud-fan/ansi. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-03-20 00:52:09 +09:00
Kris Mok	a1776288f4	[SPARK-31187][SQL] Sort the whole-stage codegen debug output by codegenStageId ### What changes were proposed in this pull request? Spark SQL's whole-stage codegen (WSCG) supports dumping the generated code to help with debugging. One way to get the generated code is through `df.queryExecution.debug.codegen`, or SQL `EXPLAIN CODEGEN` statement. The generated code is currently printed without specific ordering, which can make debugging a bit annoying. This PR makes a minor improvement to sort the codegen dump by the `codegenStageId`, ascending. After this change, the following query: ```scala spark.range(10).agg(sum('id)).queryExecution.debug.codegen ``` will always dump the generated code in a natural, stable order. A version of this example with shorter output is: ``` spark.range(10).agg(sum('id)).queryExecution.debug.codegenToSeq.map(_._1).foreach(println) (1) HashAggregate(keys=[], functions=[partial_sum(id#8L)], output=[sum#15L]) +- (1) Range (0, 10, step=1, splits=16) (2) HashAggregate(keys=[], functions=[sum(id#8L)], output=[sum(id)#12L]) +- Exchange SinglePartition, true, [id=#30] +- (1) HashAggregate(keys=[], functions=[partial_sum(id#8L)], output=[sum#15L]) +- *(1) Range (0, 10, step=1, splits=16) ``` The number of codegen stages within a single SQL query tends to be very small, most likely < 50, so the overhead of adding the sorting shouldn't be significant. ### Why are the changes needed? Minor improvement to aid WSCG debugging. ### Does this PR introduce any user-facing change? No user-facing change for end-users; minor change for developers who debug WSCG generated code. ### How was this patch tested? Manually tested the output; all other tests still pass. Closes #27955 from rednaxelafx/codegen. Authored-by: Kris Mok <kris.mok@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-03-19 20:53:01 +09:00
Maxim Gekk	bb295d80e3	[SPARK-31159][SQL] Rebase date/timestamp from/to Julian calendar in parquet ### What changes were proposed in this pull request? The PR addresses the issue of compatibility with Spark 2.4 and earlier version in reading/writing dates and timestamp via Parquet datasource. Previous releases are based on a hybrid calendar - Julian + Gregorian. Since Spark 3.0, Proleptic Gregorian calendar is used by default, see SPARK-26651. In particular, the issue pops up for dates/timestamps before 1582-10-15 when the hybrid calendar switches from/to Gregorian to/from Julian calendar. The same local date in different calendar is converted to different number of days since the epoch 1970-01-01. For example, the 1001-01-01 date is converted to: - -719164 in Julian calendar. Spark 2.4 saves the number as a value of DATE type into parquet. - -719162 in Proleptic Gregorian calendar. Spark 3.0 saves the number as a date value. According to the parquet spec, parquet timestamps of the `TIMESTAMP_MILLIS`, `TIMESTAMP_MICROS` output type and parquet dates should be based on Proleptic Gregorian calendar but the `INT96` timestamps should be stored as Julian days. Since the version 3.0, Spark conforms the spec but for the backward compatibility with previous version, the PR proposes rebasing from/to Proleptic Gregorian calendar to the hybrid one under the SQL config: ``` spark.sql.legacy.parquet.rebaseDateTime.enabled ``` which is set to `false` by default which means the rebasing is not performed by default. The details of the implementation: 1. Added 2 methods to `DateTimeUtils` for rebasing microseconds. `rebaseGregorianToJulianMicros()` builds a local timestamp in Proleptic Gregorian calendar, extracts date-time fields `year`, `month`, ..., `second fraction` from the local timestamp and uses them to build another local timestamp based on the hybrid calendar (using `java.util.Calendar` API). After that it calculates the number of microseconds since the epoch using the resulted local timestamp. The function performs the conversion via the system JVM time zone for compatibility with Spark 2.4 and earlier versions. The `rebaseJulianToGregorianMicros()` function does reverse conversion. 2. Added 2 methods to `DateTimeUtils` for rebasing days. `rebaseGregorianToJulianDays()` builds a local date from the passed number of days since the epoch in Proleptic Gregorian calendar, interprets the resulted date as a local date in the hybrid calendar and gets the number of days since the epoch from the resulted local date. The conversion is performed via the `UTC` time zone because the conversion is independent from time zones, and `UTC` is selected to void round issues of casting days to milliseconds and back. The `rebaseJulianToGregorianDays()` functions does revers conversion. 3. Use `rebaseGregorianToJulianMicros()` and `rebaseGregorianToJulianDays()` while saving timestamps/dates to parquet files if the SQL config is on. 4. Use `rebaseJulianToGregorianMicros()` and `rebaseJulianToGregorianDays()` while loading timestamps/dates from parquet files if the SQL config is on. 5. The SQL config `spark.sql.legacy.parquet.rebaseDateTime.enabled` controls conversions from/to dates, timestamps of `TIMESTAMP_MILLIS`, `TIMESTAMP_MICROS`, see the SQL config `spark.sql.parquet.outputTimestampType`. 6. The rebasing is always performed for `INT96` timestamps, independently from `spark.sql.legacy.parquet.rebaseDateTime.enabled`. 7. Supported the vectorized parquet reader, see the SQL config `spark.sql.parquet.enableVectorizedReader`. ### Why are the changes needed? - For the backward compatibility with Spark 2.4 and earlier versions. The changes allow users to read dates/timestamps saved by previous version, and get the same result. Also after the changes, users can enable the rebasing in write, and save dates/timestamps that can be loaded correctly by Spark 2.4 and earlier versions. - It fixes the bug of incorrect saving/loading timestamps of the `INT96` type ### Does this PR introduce any user-facing change? Yes, the timestamp `1001-01-01 01:02:03.123456` saved by Spark 2.4.5 as `TIMESTAMP_MICROS` is interpreted by Spark 3.0.0-preview2 differently: ```scala scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_ts_micros").show(false) +--------------------------+ \|ts \| +--------------------------+ \|1001-01-07 11:32:20.123456\| +--------------------------+ ``` After the changes: ```scala scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTime.enabled", true) scala> spark.read.parquet("/Users/maxim/tmp/before_1582/2_4_5_ts_micros").show(false) +--------------------------+ \|ts \| +--------------------------+ \|1001-01-01 01:02:03.123456\| +--------------------------+ ``` ### How was this patch tested? 1. Added tests to `ParquetIOSuite` to check rebasing in read for regular reader and vectorized parquet reader. The test reads back parquet files saved by Spark 2.4.5 via: ```shell $ export TZ="America/Los_Angeles" ``` ```scala scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> val df = Seq("1001-01-01").toDF("dateS").select($"dateS".cast("date").as("date")) df: org.apache.spark.sql.DataFrame = [date: date] scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_date") scala> val df = Seq("1001-01-01 01:02:03.123456").toDF("tsS").select($"tsS".cast("timestamp").as("ts")) df: org.apache.spark.sql.DataFrame = [ts: timestamp] scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS") scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_ts_micros") scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MILLIS") scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_ts_millis") scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "INT96") scala> df.write.parquet("/Users/maxim/tmp/before_1582/2_4_5_ts_int96") ``` 2. Manually check the write code path. Save date/timestamps (TIMESTAMP_MICROS, TIMESTAMP_MILLIS, INT96) by Spark 3.1.0-SNAPSHOT (after the changes): ```bash $ export TZ="America/Los_Angeles" ``` ```scala scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.conf.set("spark.sql.legacy.parquet.rebaseDateTime.enabled", true) scala> spark.conf.set("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS") scala> val df = Seq(("1001-01-01", "1001-01-01 01:02:03.123456")).toDF("dateS", "tsS").select($"dateS".cast("date").as("d"), $"tsS".cast("timestamp").as("ts")) df: org.apache.spark.sql.DataFrame = [d: date, ts: timestamp] scala> df.write.parquet("/Users/maxim/tmp/before_1582/3_0_0_micros") scala> spark.read.parquet("/Users/maxim/tmp/before_1582/3_0_0_micros").show(false) +----------+--------------------------+ \|d \|ts \| +----------+--------------------------+ \|1001-01-01\|1001-01-01 01:02:03.123456\| +----------+--------------------------+ ``` Read the saved date/timestamp by Spark 2.4.5: ```scala scala> spark.conf.set("spark.sql.session.timeZone", "America/Los_Angeles") scala> spark.read.parquet("/Users/maxim/tmp/before_1582/3_0_0_micros").show(false) +----------+--------------------------+ \|d \|ts \| +----------+--------------------------+ \|1001-01-01\|1001-01-01 01:02:03.123456\| +----------+--------------------------+ ``` Closes #27915 from MaxGekk/rebase-parquet-datetime. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-19 12:49:51 +08:00
sarthfrey-db	6fd3138e9c	[SPARK-30667][CORE] Change BarrierTaskContext allGather method return type This PR proposes that we change the return type of the `BarrierTaskContext.allGather` method to `Array[String]` instead of `ArrayBuffer[String]` since it is immutable. Based on discussion in #27640. cc zhengruifeng srowen Closes #27951 from sarthfrey/all-gather-api. Authored-by: sarthfrey-db <sarth.frey@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-19 12:12:39 +09:00
Burak Yavuz	4237251861	[SPARK-31178][SQL] Prevent V2 exec nodes from executing multiple times ### What changes were proposed in this pull request? This PR prevents the execution of V2 DataSource exec nodes multiple times when `collect()` is called on them. For V1 DataSources, commands would be executed as a RunnableCommand, which would cache the result as part of the `ExecutedCommandExec` node. We extend `V2CommandExec` for all the data writing commands so that they only get executed once as well. ### Why are the changes needed? Calling `collect()` on a SQL command that inserts data or creates a table gets executed multiple times otherwise. ### Does this PR introduce any user-facing change? Fixes a bug ### How was this patch tested? Unit tests Closes #27941 from brkyvz/doubleInsert. Authored-by: Burak Yavuz <brkyvz@gmail.com> Signed-off-by: Burak Yavuz <brkyvz@gmail.com>	2020-03-18 18:07:24 -07:00

... 5 6 7 8 9 ...

27117 commits