ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Michael Chirico	410fa91321	[SPARK-31578][R] Vectorize schema validation for arrow in types.R ### What changes were proposed in this pull request? Repeated `sapply` avoided in internal `checkSchemaInArrow` ### Why are the changes needed? Current implementation is doubly inefficient: 1. Repeatedly doing the same (95%) `sapply` loop 2. Doing scalar `==` on a vector (`==` should be done over the whole vector for efficiency) ### Does this PR introduce any user-facing change? No ### How was this patch tested? By my trusty friend the CI bots Closes #28372 from MichaelChirico/vectorize-types. Authored-by: Michael Chirico <michael.chirico@grabtaxi.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-28 11:03:51 +09:00
Michael Chirico	a68d98cf4f	[SPARK-31568][R] Replaces paste(sep="") to paste0 ### What changes were proposed in this pull request? All instances of `paste(..., sep = "")` in the code are replaced with `paste0` which is more performant ### Why are the changes needed? Performance & consistency (`paste0` is already used extensively in the R package) ### Does this PR introduce any user-facing change? No ### How was this patch tested? None Closes #28374 from MichaelChirico/r-paste0. Authored-by: Michael Chirico <michael.chirico@grabtaxi.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-28 10:58:48 +09:00
Dongjoon Hyun	79eaaaf6da	[SPARK-31580][BUILD] Upgrade Apache ORC to 1.5.10 ### What changes were proposed in this pull request? This PR aims to upgrade Apache ORC to 1.5.10. ### Why are the changes needed? Apache ORC 1.5.10 is a maintenance release with the following patches. - [ORC-621](https://issues.apache.org/jira/browse/ORC-621) Need reader fix for ORC-569 - [ORC-616](https://issues.apache.org/jira/browse/ORC-616) In Patched Base encoding, the value of headerThirdByte goes beyond the range of byte - [ORC-613](https://issues.apache.org/jira/browse/ORC-613) OrcMapredRecordReader mis-reuse struct object when actual children schema differs - [ORC-610](https://issues.apache.org/jira/browse/ORC-610) Updated Copyright year in the NOTICE file The following is release note. - https://issues.apache.org/jira/secure/ReleaseNote.jspa?projectId=12318320&version=12346912 ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins with the existing ORC tests and a newly added test case. - The first commit is already tested in `hive-2.3` profile with both native ORC implementation and Hive 2.3 ORC implementation. (https://github.com/apache/spark/pull/28373#issuecomment-620265114) - The latest run is about to make the test case disable in `hive-1.2` profile which doesn't use Apache ORC. - `hive-1.2`: https://github.com/apache/spark/pull/28373#issuecomment-620325906 Closes #28373 from dongjoon-hyun/SPARK-ORC-1.5.10. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-27 18:56:30 -07:00
Wenchen Fan	2f4f38b6f1	[SPARK-31577][SQL] Fix case-sensitivity and forward name conflict problems when check name conflicts of CTE relations ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/28318, to make the code more readable, by adding some comments to explain the trick and simplify the code to use a boolean flag instead of 2 string sets. This PR also fixes various problems: 1. the name check should consider case sensitivity 2. forward name conflicts like `with t as (with t2 as ...), t2 as ...` is not a real conflict and we shouldn't fail. ### Why are the changes needed? correct the behavior ### Does this PR introduce any user-facing change? yes, fix the fore-mentioned behaviors. ### How was this patch tested? new tests Closes #28371 from cloud-fan/followup. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-27 16:47:39 -07:00
Yuanjian Li	ba7adc4949	[SPARK-27340][SS] Alias on TimeWindow expression cause watermark metadata lost Credit to LiangchangZ, this PR reuses the UT as well as integrate test in #24457. Thanks Liangchang for your solid work. ### What changes were proposed in this pull request? Make metadata propagatable between Aliases. ### Why are the changes needed? In Structured Streaming, we added an Alias for TimeWindow by default. `590b9a0132/sql/core/src/main/scala/org/apache/spark/sql/functions.scala (L3272-L3273)` For some cases like stream join with watermark and window, users need to add an alias for convenience(we also added one in StreamingJoinSuite). The current metadata handling logic for `as` will lose the watermark metadata `590b9a0132/sql/core/src/main/scala/org/apache/spark/sql/Column.scala (L1049-L1054)` and finally cause the AnalysisException: ``` Stream-stream outer join between two streaming DataFrame/Datasets is not supported without a watermark in the join keys, or a watermark on the nullable side and an appropriate range condition ``` ### Does this PR introduce any user-facing change? Bugfix for an alias on time window with watermark. ### How was this patch tested? New UTs added. One for the functionality and one for explaining the common scenario. Closes #28326 from xuanyuanking/SPARK-27340. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-27 15:07:52 -07:00
Kousuke Saruta	7d4d05c684	[SPARK-31565][WEBUI][FOLLOWUP] Add font color setting of DAG-viz for query plan ### What changes were proposed in this pull request? This PR adds a font color setting of DAG-viz for query plan. ### Why are the changes needed? #28352 aimed to unify the font color of all DAG-viz in WebUI but there is one part left over. Before this change applied, the appearance of a query plan is like as follows. <img width="456" alt="plan-graph-fixed" src="https://user-images.githubusercontent.com/4736016/80325600-ca4d4e00-8870-11ea-945c-64971dbb752c.png"> The color of `WholeStageCodegen (1)` and its following text (`duration: total...`) is slightly darker than `SerializeFromObject`. After this change, those color is unified as `#333333`. <img width="450" alt="plan-graph-fixed2" src="https://user-images.githubusercontent.com/4736016/80325651-fb2d8300-8870-11ea-8ed8-178c124d224c.png"> ### Does this PR introduce any user-facing change? Slightly yes. ### How was this patch tested? I confirmed the style of `fill` and `color` is `#333333` by debug console of Chrome. <img width="321" alt="fill" src="https://user-images.githubusercontent.com/4736016/80325760-6c6d3600-8871-11ea-82e7-e789bf741f2a.png"> <img width="316" alt="color" src="https://user-images.githubusercontent.com/4736016/80325765-70995380-8871-11ea-8976-7020205d585c.png"> Closes #28355 from sarutak/followup-SPARK-31565. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-27 13:34:43 -07:00
Huaxin Gao	7735db2a27	[SPARK-31569][SQL][DOCS] Add links to subsections in SQL Reference main page ### What changes were proposed in this pull request? Add links to subsections in SQL Reference main page ### Why are the changes needed? Make SQL Reference complete ### Does this PR introduce any user-facing change? Yes before: <img width="1050" alt="Screen Shot 2020-04-26 at 10 52 42 PM" src="https://user-images.githubusercontent.com/13592258/80338238-a9551080-8810-11ea-8ae8-d6707fde2cac.png"> after: <img width="1050" alt="Screen Shot 2020-04-26 at 10 51 58 PM" src="https://user-images.githubusercontent.com/13592258/80338241-ac500100-8810-11ea-8518-95c4f8c0a2eb.png"> ### How was this patch tested? Manually build and check. Closes #28360 from huaxingao/sql-ref. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-04-27 09:45:00 -05:00
Kent Yao	5ba467ca1d	[SPARK-31550][SQL][DOCS] Set nondeterministic configurations with general meanings in sql configuration doc ### What changes were proposed in this pull request? ```scala spark.sql.session.timeZone spark.sql.warehouse.dir ``` these 2 configs are nondeterministic and vary with environments Besides, reflect code in `gen-sql-config-docs.py` via https://github.com/apache/spark/pull/28274#discussion_r412893096 and `configuration.md` via https://github.com/apache/spark/pull/28274#discussion_r412894905 ### Why are the changes needed? doc fix ### Does this PR introduce any user-facing change? no ### How was this patch tested? verify locally ![image](https://user-images.githubusercontent.com/8326978/80179099-5e7da200-8632-11ea-803f-d47a93151869.png) Closes #28322 from yaooqinn/SPARK-31550. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-27 17:08:52 +09:00
Michael Chirico	b6509aa502	[SPARK-31568][R][DOCS] Add detail about func/key in gapply to documentation ### What changes were proposed in this pull request? Improve documentation for `gapply` in `SparkR` ### Why are the changes needed? Spent a long time this weekend trying to figure out just what exactly `key` is in `gapply`'s `func`. I had assumed it would be a _named_ list, but apparently not -- the examples are working because `schema` is applying the name and the names of the output `data.frame` don't matter. As near as I can tell the description I've added is correct, namely, that `key` is an unnamed list. ### Does this PR introduce any user-facing change? No? Not in code. Only documentation. ### How was this patch tested? Not. Documentation only Closes #28350 from MichaelChirico/r-gapply-key-doc. Authored-by: Michael Chirico <michael.chirico@grabtaxi.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-27 17:02:18 +09:00
yi.wu	7df658414b	[SPARK-31529][SQL] Remove extra whitespaces in formatted explain ### What changes were proposed in this pull request? Remove all the extra whitespaces in the formatted explain. ### Why are the changes needed? The number of extra whitespaces of the formatted explain becomes different between master and branch-3.0. This causes a problem that whenever we backport formatted explain related tests from master to branch-3.0, it will fail branch-3.0. Besides, extra whitespaces are always disallowed in Spark. Thus, we should remove them as possible as we can. ### Does this PR introduce any user-facing change? No, formatted explain is newly added in Spark 3.0. ### How was this patch tested? Updated sql query tests. Closes #28315 from Ngone51/fix_extra_spaces. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-27 07:39:24 +00:00
HyukjinKwon	5dd581c88a	[SPARK-29664][PYTHON][SQL][FOLLOW-UP] Add deprecation warnings for getItem instead ### What changes were proposed in this pull request? This PR proposes to use a different approach instead of breaking it per Micheal's rubric added at https://spark.apache.org/versioning-policy.html. It deprecates the behaviour for now. It will be gradually removed in the future releases. After this change, ```python import warnings warnings.simplefilter("always") from pyspark.sql.functions import * df = spark.range(2) map_col = create_map(lit(0), lit(100), lit(1), lit(200)) df.withColumn("mapped", map_col.getItem(col('id'))).show() ``` ``` /.../python/pyspark/sql/column.py:311: DeprecationWarning: A column as 'key' in getItem is deprecated as of Spark 3.0, and will not be supported in the future release. Use `column[key]` or `column.key` syntax instead. DeprecationWarning) ... ``` ```python import warnings warnings.simplefilter("always") from pyspark.sql.functions import * df = spark.range(2) struct_col = struct(lit(0), lit(100), lit(1), lit(200)) df.withColumn("struct", struct_col.getField(lit("col1"))).show() ``` ``` /.../spark/python/pyspark/sql/column.py:336: DeprecationWarning: A column as 'name' in getField is deprecated as of Spark 3.0, and will not be supported in the future release. Use `column[name]` or `column.name` syntax instead. DeprecationWarning) ``` ### Why are the changes needed? To prevent the radical behaviour change after the amended versioning policy. ### Does this PR introduce any user-facing change? Yes, it will show the deprecated warning message. ### How was this patch tested? Manually tested. Closes #28327 from HyukjinKwon/SPARK-29664. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-27 14:49:22 +09:00
Kent Yao	ebc8fa50d0	[SPARK-31527][SQL] date add/subtract interval only allow those day precision in ansi mode ### What changes were proposed in this pull request? To follow ANSI，the expressions - `date + interval`, `interval + date` and `date - interval` should only accept intervals which the `microseconds` part is 0. ### Why are the changes needed? Better ANSI compliance ### Does this PR introduce any user-facing change? No, this PR should target 3.0.0 in which this feature is newly added. ### How was this patch tested? add more unit tests Closes #28310 from yaooqinn/SPARK-31527. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-27 05:28:46 +00:00
Bruce Robbins	a911287244	[SPARK-31557][SQL] Legacy time parser should return Gregorian days rather than Julian days ### What changes were proposed in this pull request? This PR modifies LegacyDateFormatter#parse to return proleptic Gregorian days rather than hybrid Julian days. ### Why are the changes needed? The legacy time parser currently returns epoch days in the hybrid Julian calendar. However, the callers to the legacy parser (e.g., UnivocityParser, JacksonParser) expect epoch days in the proleptic Gregorian calendar. As a result, pre-Gregorian dates like '1000-01-01' get interpreted as '1000-01-06'. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manual testing and modified existing unit tests. Closes #28345 from bersprockets/SPARK-31557. Authored-by: Bruce Robbins <bersprockets@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-27 05:00:36 +00:00
Juliusz Sompolski	560bd5401f	[SPARK-31388][SQL][TESTS] org.apache.spark.sql.hive.thriftserver.CliSuite doesn't match results correctly ### What changes were proposed in this pull request? `CliSuite.runCliWithin` was not matching for expected results correctly. It was matching for expected lines anywhere in stdout or stderr. On the example of `Single command with --database` test: In ``` runCliWithin(2.minute)( "CREATE DATABASE hive_db_test;" -> "", "USE hive_test;" -> "", "CREATE TABLE hive_test(key INT, val STRING);" -> "", "SHOW TABLES;" -> "hive_test" ) ``` It was looking for lines containing "", "", "" and then "hive_test". However, the string "hive_test" was contained in "hive_test_db", and hence: ``` 2020-04-08 17:53:12,752 INFO CliSuite - 2020-04-08 17:53:12.752 - stderr> Spark master: local, Application Id: local-1586368384172 2020-04-08 17:53:12,765 INFO CliSuite - stderr> found expected output line 0: "" 2020-04-08 17:53:12,765 INFO CliSuite - 2020-04-08 17:53:12.765 - stdout> spark-sql> CREATE DATABASE hive_db_test; 2020-04-08 17:53:12,765 INFO CliSuite - stdout> found expected output line 1: "" 2020-04-08 17:53:17,688 INFO CliSuite - 2020-04-08 17:53:17.688 - stderr> chgrp: changing ownership of 'file:///tmp/spark-8811f069-4cba-4c71-a5d6-62dd925fb5ff': chown: changing group of '/tmp/spark-8811f069-4cba-4c71-a5d6-62dd925fb5ff': Operation not permitted 2020-04-08 17:53:12,765 INFO CliSuite - stderr> found expected output line 2: "" 2020-04-08 17:53:18,069 INFO CliSuite - 2020-04-08 17:53:18.069 - stderr> Time taken: 5.265 seconds 2020-04-08 17:53:18,087 INFO CliSuite - 2020-04-08 17:53:18.087 - stdout> spark-sql> USE hive_test; 2020-04-08 17:53:12,765 INFO CliSuite - stdout> found expected output line 3: "hive_test" 2020-04-08 17:53:21,742 INFO CliSuite - Found all expected output. ``` Because of that, it could kill the CLI process without really even creating the table. This was not expected. The test could be flaky depending on whether process.destroy() in the finally block managed to kill it before it actually creates the table. I make the output checking more robust to not match on unexpected output, by making it check the echo of query output on the CLI. Also, wait for the CLI process to finish gracefully (triggered by closing its stdin), instead of killing it forcibly. ### Why are the changes needed? org.apache.spark.sql.hive.thriftserver.CliSuite was flaky, and didn't test outputs as expected. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests in CLISuite. Tested several times with no flakiness. Was getting flaky results almost on every run before. ``` [info] CliSuite: [info] - load warehouse dir from hive-site.xml (12 seconds, 568 milliseconds) [info] - load warehouse dir from --hiveconf (10 seconds, 648 milliseconds) [info] - load warehouse dir from --conf spark(.hadoop).hive.* (20 seconds, 653 milliseconds) [info] - load warehouse dir from spark.sql.warehouse.dir (9 seconds, 763 milliseconds) [info] - Simple commands (16 seconds, 238 milliseconds) [info] - Single command with -e (9 seconds, 967 milliseconds) [info] - Single command with --database (21 seconds, 205 milliseconds) [info] - Commands using SerDe provided in --jars (15 seconds, 51 milliseconds) [info] - SPARK-29022: Commands using SerDe provided in --hive.aux.jars.path (14 seconds, 625 milliseconds) [info] - SPARK-11188 Analysis error reporting (7 seconds, 960 milliseconds) [info] - SPARK-11624 Spark SQL CLI should set sessionState only once (7 seconds, 424 milliseconds) [info] - list jars (9 seconds, 520 milliseconds) [info] - list jar <jarfile> (9 seconds, 277 milliseconds) [info] - list files (9 seconds, 828 milliseconds) [info] - list file <filepath> (9 seconds, 646 milliseconds) [info] - apply hiveconf from cli command (9 seconds, 469 milliseconds) [info] - Support hive.aux.jars.path (10 seconds, 676 milliseconds) [info] - SPARK-28840 test --jars command (10 seconds, 921 milliseconds) [info] - SPARK-28840 test --jars and hive.aux.jars.path command (11 seconds, 49 milliseconds) [info] - SPARK-29022 Commands using SerDe provided in ADD JAR sql (14 seconds, 210 milliseconds) [info] - SPARK-26321 Should not split semicolon within quoted string literals (12 seconds, 729 milliseconds) [info] - Pad Decimal numbers with trailing zeros to the scale of the column (10 seconds, 381 milliseconds) [info] - SPARK-30049 Should not complain for quotes in commented lines (10 seconds, 935 milliseconds) [info] - SPARK-30049 Should not complain for quotes in commented with multi-lines (20 seconds, 731 milliseconds) ``` Closes #28156 from juliuszsompolski/SPARK-31388. Authored-by: Juliusz Sompolski <julek@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-27 04:27:55 +00:00
Weichen Xu	4a21c4cc92	[SPARK-31497][ML][PYSPARK] Fix Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save and load model ### What changes were proposed in this pull request? Fix Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save and load model. Most pyspark estimators/transformers inherit `JavaParams`, but some estimators are special (in order to support pure python implemented nested estimators/transformers): * Pipeline * OneVsRest * CrossValidator * TrainValidationSplit But note that, currently, in pyspark, estimators listed above, their model reader/writer do NOT support pure python implemented nested estimators/transformers. Because they use java reader/writer wrapper as python side reader/writer. Pyspark CrossValidator/TrainValidationSplit model reader/writer require all estimators define the `_transfer_param_map_to_java` and `_transfer_param_map_from_java` (used in model read/write). OneVsRest class already defines the two methods, but Pipeline do not, so it lead to this bug. In this PR I add `_transfer_param_map_to_java` and `_transfer_param_map_from_java` into Pipeline class. ### Why are the changes needed? Bug fix. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit test. Manually test in pyspark shell: 1) CrossValidator with Simple Pipeline estimator ``` from pyspark.ml import Pipeline from pyspark.ml.classification import LogisticRegression from pyspark.ml.evaluation import BinaryClassificationEvaluator from pyspark.ml.feature import HashingTF, Tokenizer from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, ParamGridBuilder training = spark.createDataFrame([ (0, "a b c d e spark", 1.0), (1, "b d", 0.0), (2, "spark f g h", 1.0), (3, "hadoop mapreduce", 0.0), (4, "b spark who", 1.0), (5, "g d a y", 0.0), (6, "spark fly", 1.0), (7, "was mapreduce", 0.0), ], ["id", "text", "label"]) # Configure an ML pipeline, which consists of tree stages: tokenizer, hashingTF, and lr. tokenizer = Tokenizer(inputCol="text", outputCol="words") hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features") lr = LogisticRegression(maxIter=10) pipeline = Pipeline(stages=[tokenizer, hashingTF, lr]) paramGrid = ParamGridBuilder() \ .addGrid(hashingTF.numFeatures, [10, 100, 1000]) \ .addGrid(lr.regParam, [0.1, 0.01]) \ .build() crossval = CrossValidator(estimator=pipeline, estimatorParamMaps=paramGrid, evaluator=BinaryClassificationEvaluator(), numFolds=2) # use 3+ folds in practice # Run cross-validation, and choose the best set of parameters. cvModel = crossval.fit(training) cvModel.save('/tmp/cv_model001') CrossValidatorModel.load('/tmp/cv_model001') ``` 2) CrossValidator with Pipeline estimator which include a OneVsRest estimator stage, and OneVsRest estimator nest a LogisticRegression estimator. ``` from pyspark.ml.linalg import Vectors from pyspark.ml import Estimator, Model from pyspark.ml.classification import LogisticRegression, LogisticRegressionModel, OneVsRest from pyspark.ml.evaluation import BinaryClassificationEvaluator, \ MulticlassClassificationEvaluator, RegressionEvaluator from pyspark.ml.linalg import Vectors from pyspark.ml.param import Param, Params from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, ParamGridBuilder, \ TrainValidationSplit, TrainValidationSplitModel from pyspark.sql.functions import rand from pyspark.testing.mlutils import SparkSessionTestCase dataset = spark.createDataFrame( [(Vectors.dense([0.0]), 0.0), (Vectors.dense([0.4]), 1.0), (Vectors.dense([0.5]), 0.0), (Vectors.dense([0.6]), 1.0), (Vectors.dense([1.0]), 1.0)] * 10, ["features", "label"]) ova = OneVsRest(classifier=LogisticRegression()) lr1 = LogisticRegression().setMaxIter(100) lr2 = LogisticRegression().setMaxIter(150) grid = ParamGridBuilder().addGrid(ova.classifier, [lr1, lr2]).build() evaluator = MulticlassClassificationEvaluator() pipeline = Pipeline(stages=[ova]) cv = CrossValidator(estimator=pipeline, estimatorParamMaps=grid, evaluator=evaluator) cvModel = cv.fit(dataset) cvModel.save('/tmp/model002') cvModel2 = CrossValidatorModel.load('/tmp/model002') ``` TrainValidationSplit testing code are similar so I do not paste them. Closes #28279 from WeichenXu123/fix_pipeline_tuning. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Xiangrui Meng <meng@databricks.com>	2020-04-26 21:04:14 -07:00
Kousuke Saruta	91ec2eacfa	[SPARK-31565][WEBUI] Unify the font color of label among all DAG-viz ### What changes were proposed in this pull request? This PR unifies the font color of label as `#333333` among all DAG-viz. ### Why are the changes needed? For the consistent appearance among all DAG-viz. There are three types of DAG-viz in the WebUI. One is for stages, another one is for RDDs and the last one is for query plans. But the font color of labels are slightly different among them. For stages, the color is `#333333` (simply 333) which is specified by `spark-dag-viz.css`. <img width="355" alt="job-graph" src="https://user-images.githubusercontent.com/4736016/80321397-b517f580-8857-11ea-8c8e-cf68f648ab05.png"> <img width="310" alt="job-graph-color" src="https://user-images.githubusercontent.com/4736016/80321399-ba754000-8857-11ea-8708-83bdef4bc1d1.png"> For RDDs, the color is `#212529` which is specified by `bootstrap.min.js`. <img width="386" alt="stage-graph" src="https://user-images.githubusercontent.com/4736016/80321438-f0b2bf80-8857-11ea-9c2a-13fa0fd1431c.png"> <img width="313" alt="stage-graph-color" src="https://user-images.githubusercontent.com/4736016/80321444-fa3c2780-8857-11ea-81b2-4f1203d47896.png"> For query plans, the color is `black` which is specified by `spark-sql-viz.css`. <img width="449" alt="plan-graph" src="https://user-images.githubusercontent.com/4736016/80321490-61f27280-8858-11ea-9c3a-2c98d3d4d03b.png"> <img width="316" alt="plan-graph-color" src="https://user-images.githubusercontent.com/4736016/80321496-6ae34400-8858-11ea-8fe8-0d6e4a821608.png"> After the change, the appearance is like as follows (no change for stages). For RDDs. <img width="389" alt="stage-graph-fixed" src="https://user-images.githubusercontent.com/4736016/80321613-6b300f00-8859-11ea-912f-d92474aa9f47.png"> For query plans. <img width="456" alt="plan-graph-fixed" src="https://user-images.githubusercontent.com/4736016/80321638-9a468080-8859-11ea-974c-33c56a8ffe1a.png"> ### Does this PR introduce any user-facing change? Yes. The unified color is slightly lighter than ever. ### How was this patch tested? Confirmed that the color code among all DAG-viz are `#333333` using browser's debug console. Closes #28352 from sarutak/unify-label-color. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-26 16:57:23 -07:00
Max Gekk	bd139bda4a	[SPARK-31489][SQL] Fix pushing down filters with `java.time.LocalDate` values in ORC ### What changes were proposed in this pull request? Convert `java.time.LocalDate` to `java.sql.Date` in pushed down filters to ORC datasource when Java 8 time API enabled. Closes #28272 ### Why are the changes needed? The changes fix the exception raised while pushing date filters when `spark.sql.datetime.java8API.enabled` is set to `true`: ``` Wrong value class java.time.LocalDate for DATE.EQUALS leaf java.lang.IllegalArgumentException: Wrong value class java.time.LocalDate for DATE.EQUALS leaf at org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl.checkLiteralType(SearchArgumentImpl.java:192) at org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl.<init>(SearchArgumentImpl.java:75) at org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$BuilderImpl.equals(SearchArgumentImpl.java:352) at org.apache.spark.sql.execution.datasources.orc.OrcFilters$.buildLeafSearchArgument(OrcFilters.scala:229) ``` ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? Added tests to `OrcFilterSuite`. Closes #28261 from MaxGekk/orc-date-filter-pushdown. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-26 15:49:00 -07:00
Peter Toth	4f53bfbbd5	[SPARK-31535][SQL] Fix nested CTE substitution ### What changes were proposed in this pull request? This PR fixes a CTE substitution issue so as to the following SQL return the correct empty result: ``` WITH t(c) AS (SELECT 1) SELECT * FROM t WHERE c IN ( WITH t(c) AS (SELECT 2) SELECT * FROM t ) ``` Before this PR the result was `1`. ### Why are the changes needed? To fix a correctness issue. ### Does this PR introduce any user-facing change? Yes, fixes a correctness issue. ### How was this patch tested? Added new test case. Closes #28318 from peter-toth/SPARK-31535-fix-nested-cte-substitution. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-26 15:31:32 -07:00
Kousuke Saruta	d61c6219cd	[SPARK-31534][WEBUI] Text for tooltip should be escaped ### What changes were proposed in this pull request? This PR escapes text for tooltip for DAG Viz and Timeline View. ### Why are the changes needed? This is a bug. Normally, DAG Viz and Timeline View show tooltip like as follows. <img width="278" alt="dag-viz-tooltip" src="https://user-images.githubusercontent.com/4736016/80127481-5a6c6880-85cf-11ea-8daf-cfd59aa3ba09.png"> <img width="477" alt="timeline-tooltip" src="https://user-images.githubusercontent.com/4736016/80127500-60624980-85cf-11ea-9b0f-cce301019e3a.png"> They contain a callsite properly. However, if a callsite contains characters which should be escaped for HTML without escaping , the corresponding tooltips wouldn't show the callsite and its following text properly. <img width="179" alt="dag-viz-tooltip-before-fixed" src="https://user-images.githubusercontent.com/4736016/80128480-b1267200-85d0-11ea-8035-ad68ae5fbcab.png"> <img width="261" alt="timeline-tooltip-before-fixed" src="https://user-images.githubusercontent.com/4736016/80128492-b5eb2600-85d0-11ea-9556-c48490110244.png"> The reason of this issue is that the source texts of the tooltip texts are not escaped. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? I tested manually. First, I ran a job `sc.parallelize(1 to 10).collect` in Spark Shell then, visited AllJobsPage and JobPage and confirmed tooltip texts. <img width="196" alt="dag-viz-tooltip-fixed" src="https://user-images.githubusercontent.com/4736016/80128813-2db95080-85d1-11ea-82f8-90a1f4547f30.png"> <img width="363" alt="timeline-tooltip-fixed" src="https://user-images.githubusercontent.com/4736016/80128824-31e56e00-85d1-11ea-9818-492b72b1c56e.png"> I also added a testcase. Closes #28317 from sarutak/fix-tooltip. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2020-04-27 05:14:46 +09:00
Takeshi Yamamuro	e01125db0d	[SPARK-31562][SQL] Update ExpressionDescription for substring, current_date, and current_timestamp ### What changes were proposed in this pull request? This PR intends to add entries for substring, current_date, and current_timestamp in the SQL built-in function documents. Specifically, the entries are as follows; - SELECT current_date; - SELECT current_timestamp; - SELECT substring('abcd' FROM 1); - SELECT substring('abcd' FROM 1 FOR 2); ### Why are the changes needed? To make the SQL (built-in functions) references complete. ### Does this PR introduce any user-facing change? <img width="1040" alt="Screen Shot 2020-04-25 at 16 51 07" src="https://user-images.githubusercontent.com/692303/80274851-6ca5ee00-8718-11ea-9a35-9ae82008cb4b.png"> <img width="974" alt="Screen Shot 2020-04-25 at 17 24 24" src="https://user-images.githubusercontent.com/692303/80275032-a88d8300-8719-11ea-92ec-95b80169ae28.png"> <img width="862" alt="Screen Shot 2020-04-25 at 17 27 48" src="https://user-images.githubusercontent.com/692303/80275114-36696e00-871a-11ea-8e39-02e93eabb92f.png"> ### How was this patch tested? Added test examples. Closes #28342 from maropu/SPARK-31562. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-26 11:46:52 -07:00
TJX2014	fe07b21b8a	[SPARK-31400][ML] The catalogString doesn't distinguish Vectors in ml and mllib What changes were proposed in this pull request? 1.Add class info output in org.apache.spark.ml.util.SchemaUtils#checkColumnType to distinct Vectors in ml and mllib 2.Add unit test Why are the changes needed? the catalogString doesn't distinguish Vectors in ml and mllib when mllib vector misused in ml https://issues.apache.org/jira/browse/SPARK-31400 Does this PR introduce any user-facing change? No How was this patch tested? Unit test is added Closes #28347 from TJX2014/master-catalogString-distinguish-Vectors-in-ml-and-mllib. Authored-by: TJX2014 <xiaoxingstack@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-04-26 11:35:44 -05:00
Gengliang Wang	f59ebdef5b	[SPARK-31558][UI] Code clean up in spark-sql-viz.js ### What changes were proposed in this pull request? 1. Remove console.log(), which seems unnecessary in the releases. 2. Replace the double equals to triple equals 3. Reuse jquery selector. ### Why are the changes needed? For better code quality. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests + manual test. Closes #28333 from gengliangwang/removeLog. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-04-25 13:43:52 -07:00
Kent Yao	7959808e96	[SPARK-31564][TESTS] Fix flaky AllExecutionsPageSuite for checking 1970 ### What changes were proposed in this pull request? Fix flakiness by checking `1970/01/01` instead of `1970`. The test was added by SPARK-27125 for 3.0.0. ### Why are the changes needed? the `org.apache.spark.sql.execution.ui.AllExecutionsPageSuite.SPARK-27019:correctly display SQL page when event reordering happens` test is flaky for just checking the `html` content not containing 1970. I will add a ticket to check and fix that. In the specific failure https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/121799/testReport, it failed because the `html` ``` ... <td sorttable_customkey="1587806019707"> ... ``` contained `1970`. ### Does this PR introduce any user-facing change? no ### How was this patch tested? passing jenkins Closes #28344 from yaooqinn/SPARK-31564. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-25 10:27:05 -07:00
Max Gekk	7d8216a664	[SPARK-31563][SQL] Fix failure of InSet.sql for collections of Catalyst's internal types ### What changes were proposed in this pull request? In the PR, I propose to fix the `InSet.sql` method for the cases when input collection contains values of internal Catalyst's types, for instance `UTF8String`. Elements of the input set `hset` are converted to Scala types, and wrapped by `Literal` to properly form SQL view of the input collection. ### Why are the changes needed? The changes fixed the bug in `InSet.sql` that makes wrong assumption about types of collection elements. See more details in SPARK-31563. ### Does this PR introduce any user-facing change? Highly likely, not. ### How was this patch tested? Added a test to `ColumnExpressionSuite` Closes #28343 from MaxGekk/fix-InSet-sql. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-25 09:29:51 -07:00
yi.wu	ab8cada1f9	[SPARK-31521][CORE] Correct the fetch size when merging blocks into a merged block ### What changes were proposed in this pull request? Fix the wrong fetch size. ### Why are the changes needed? The fetch size should be the sum of the size of merged block and the total size of those merging blocks. But we missed the size of merged block. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added a regression test. Closes #28301 from Ngone51/fix_merged_block_size. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-24 22:11:35 -07:00
Wei Zhang	3e83ccc5d8	[SPARK-31516][DOC] Fix non-existed metric hiveClientCalls.count of CodeGenerator in DOC ### What changes were proposed in this pull request? This PR proposes to remove the non-existed `hiveClientCalls.count` metric documentation of `CodeGenerator` of the Spark metrics system in the monitoring guide. There is a duplicated `hiveClientCalls.count` metric in both `namespace=HiveExternalCatalog` and `namespace=CodeGenerator` bullet lists, but there is only one defined inside object `HiveCatalogMetrics`. Closes #28292 from wezhang/monitoringdoc. Authored-by: Wei Zhang <wezhang@outlook.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-24 21:52:50 -07:00
Gengliang Wang	16b961526d	[SPARK-31560][SQL][TESTS] Add V1/V2 tests for TextSuite and WholeTextFileSuite ### What changes were proposed in this pull request? Add V1/V2 tests for TextSuite and WholeTextFileSuite ### Why are the changes needed? This is missing part since #24207. We should have these tests for test coverage. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit tests. Closes #28335 from gengliangwang/testV2Suite. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-24 18:59:15 -07:00
Holden Karau	9faad07ce7	HOTFIX Revert "[SPARK-20732][CORE] Decommission cache blocks HOTFIX test issue introduced in SPARK-20732 Closes #28337 from holdenk/revert-SPARK-20732. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: Holden Karau <hkarau@apple.com>	2020-04-24 18:51:25 -07:00
Kent Yao	f92652d0b5	[SPARK-31528][SQL] Remove millennium, century, decade from trunc/date_trunc fucntions ### What changes were proposed in this pull request? Similar to https://jira.apache.org/jira/browse/SPARK-31507, millennium, century, and decade are not commonly used in most modern platforms. For example Negative: https://docs.snowflake.com/en/sql-reference/functions-date-time.html#supported-date-and-time-parts https://prestodb.io/docs/current/functions/datetime.html#date_trunc https://teradata.github.io/presto/docs/148t/functions/datetime.html#date_trunc https://www.oracletutorial.com/oracle-date-functions/oracle-trunc/ Positive: https://docs.aws.amazon.com/redshift/latest/dg/r_Dateparts_for_datetime_functions.html https://www.postgresql.org/docs/9.1/functions-datetime.html#FUNCTIONS-DATETIME-TRUNC This PR removes these `fmt`s support for trunc and date_trunc functions. ### Why are the changes needed? clean uncommon datetime unit for easy maintenance, we can add them back if they are found very useful later. ### Does this PR introduce any user-facing change? no, targeting 3.0.0, these are newly added in 3.0.0 ### How was this patch tested? remove and modify existing units tests Closes #28313 from yaooqinn/SPARK-31528. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-24 18:28:41 -07:00
Kent Yao	caf3ab8411	[SPARK-31552][SQL] Fix ClassCastException in ScalaReflection arrayClassFor ### What changes were proposed in this pull request? the 2 method `arrayClassFor` and `dataTypeFor` in `ScalaReflection` call each other circularly, the cases in `dataTypeFor` are not fully handled in `arrayClassFor` For example: ```scala scala> implicit def newArrayEncoder[T <: Array[_] : TypeTag]: Encoder[T] = ExpressionEncoder() newArrayEncoder: [T <: Array[_]](implicit evidence$1: reflect.runtime.universe.TypeTag[T])org.apache.spark.sql.Encoder[T] scala> val decOne = Decimal(1, 38, 18) decOne: org.apache.spark.sql.types.Decimal = 1E-18 scala> val decTwo = Decimal(2, 38, 18) decTwo: org.apache.spark.sql.types.Decimal = 2E-18 scala> val decSpark = Array(decOne, decTwo) decSpark: Array[org.apache.spark.sql.types.Decimal] = Array(1E-18, 2E-18) scala> Seq(decSpark).toDF() java.lang.ClassCastException: org.apache.spark.sql.types.DecimalType cannot be cast to org.apache.spark.sql.types.ObjectType at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$arrayClassFor$1(ScalaReflection.scala:131) at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69) at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879) at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878) at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49) at org.apache.spark.sql.catalyst.ScalaReflection$.arrayClassFor(ScalaReflection.scala:120) at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$dataTypeFor$1(ScalaReflection.scala:105) at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69) at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879) at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878) at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49) at org.apache.spark.sql.catalyst.ScalaReflection$.dataTypeFor(ScalaReflection.scala:88) at org.apache.spark.sql.catalyst.ScalaReflection$.$anonfun$serializerForType$1(ScalaReflection.scala:399) at scala.reflect.internal.tpe.TypeConstraints$UndoLog.undo(TypeConstraints.scala:69) at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects(ScalaReflection.scala:879) at org.apache.spark.sql.catalyst.ScalaReflection.cleanUpReflectionObjects$(ScalaReflection.scala:878) at org.apache.spark.sql.catalyst.ScalaReflection$.cleanUpReflectionObjects(ScalaReflection.scala:49) at org.apache.spark.sql.catalyst.ScalaReflection$.serializerForType(ScalaReflection.scala:393) at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:57) at newArrayEncoder(<console>:57) ... 53 elided scala> ``` In this PR, we add the missing cases to `arrayClassFor` ### Why are the changes needed? bugfix as described above ### Does this PR introduce any user-facing change? no ### How was this patch tested? add a test for array encoders Closes #28324 from yaooqinn/SPARK-31552. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-24 18:04:26 -07:00
Gabor Somogyi	0ca3605d3d	[SPARK-31533][SQL][TESTS] Enable DB2IntegrationSuite test and upgrade the DB2 docker inside ### What changes were proposed in this pull request? This is a followup PR discussed [here](https://github.com/apache/spark/pull/28215#discussion_r410748547). ### Why are the changes needed? It would be good to re-enable `DB2IntegrationSuite` and upgrade the docker image inside to use the latest. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing docker integration tests. Closes #28325 from gaborgsomogyi/SPARK-31533. Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-24 17:56:58 -07:00
Huaxin Gao	054bef94ca	[SPARK-31491][SQL][DOCS] Re-arrange Data Types page to document Floating Point Special Values ### What changes were proposed in this pull request? Re-arrange Data Types page to document Floating Point Special Values ### Why are the changes needed? To complete SQL Reference ### Does this PR introduce any user-facing change? Yes - add Floating Point Special Values in Data Types page - move NaN Semantics to Data Types page <img width="1050" alt="Screen Shot 2020-04-24 at 9 14 57 AM" src="https://user-images.githubusercontent.com/13592258/80233996-3da25600-860c-11ea-8285-538efc16e431.png"> <img width="1050" alt="Screen Shot 2020-04-24 at 9 15 22 AM" src="https://user-images.githubusercontent.com/13592258/80234001-4004b000-860c-11ea-8954-72f63c92d50d.png"> <img width="1049" alt="Screen Shot 2020-04-24 at 9 15 44 AM" src="https://user-images.githubusercontent.com/13592258/80234006-41ce7380-860c-11ea-96bf-15e1aa2102ff.png"> ### How was this patch tested? Manually build and check Closes #28264 from huaxingao/datatypes. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-04-25 09:02:16 +09:00
Kent Yao	8424f55229	[SPARK-31532][SQL] Builder should not propagate static sql configs to the existing active or default SparkSession ### What changes were proposed in this pull request? SparkSessionBuilder shoud not propagate static sql configurations to the existing active/default SparkSession This seems a long-standing bug. ```scala scala> spark.sql("set spark.sql.warehouse.dir").show +--------------------+--------------------+ \| key\| value\| +--------------------+--------------------+ \|spark.sql.warehou...\|file:/Users/kenty...\| +--------------------+--------------------+ scala> spark.sql("set spark.sql.warehouse.dir=2"); org.apache.spark.sql.AnalysisException: Cannot modify the value of a static config: spark.sql.warehouse.dir; at org.apache.spark.sql.RuntimeConfig.requireNonStaticConf(RuntimeConfig.scala:154) at org.apache.spark.sql.RuntimeConfig.set(RuntimeConfig.scala:42) at org.apache.spark.sql.execution.command.SetCommand.$anonfun$x$7$6(SetCommand.scala:100) at org.apache.spark.sql.execution.command.SetCommand.run(SetCommand.scala:156) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:229) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3644) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:103) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3642) at org.apache.spark.sql.Dataset.<init>(Dataset.scala:229) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:100) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:97) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:607) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:764) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:602) ... 47 elided scala> import org.apache.spark.sql.SparkSession import org.apache.spark.sql.SparkSession scala> SparkSession.builder.config("spark.sql.warehouse.dir", "xyz").get getClass getOrCreate scala> SparkSession.builder.config("spark.sql.warehouse.dir", "xyz").getOrCreate 20/04/23 23:49:13 WARN SparkSession$Builder: Using an existing SparkSession; some configuration may not take effect. res7: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession6403d574 scala> spark.sql("set spark.sql.warehouse.dir").show +--------------------+-----+ \| key\|value\| +--------------------+-----+ \|spark.sql.warehou...\| xyz\| +--------------------+-----+ scala> OptionsAttachments ``` ### Why are the changes needed? bugfix as shown in the previous section ### Does this PR introduce any user-facing change? Yes, static SQL configurations with SparkSession.builder.config do not propagate to any existing or new SparkSession instances. ### How was this patch tested? new ut. Closes #28316 from yaooqinn/SPARK-31532. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-04-25 08:53:00 +09:00
Jian Tang	6a576161ae	[SPARK-31364][SQL][TESTS] Benchmark Parquet Nested Field Predicate Pushdown ### What changes were proposed in this pull request? This PR aims to add a benchmark suite for nested predicate pushdown with parquet file: Performance comparison: Nested predicate pushdown disabled vs enabled, with the following queries scenarios: 1. When predicate pushed down, parquet reader are able to filter out all the row groups without loading them. 2. When predicate pushed down, parquet reader only loads one of the row groups. 3. When predicate pushed down, parquet reader can't filter out any row group in order to see if we introduce too much overhead or not when enabling nested predicate push down. ### Why are the changes needed? No benchmark exists today for nested fields predicate pushdown performance evaluation. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Benchmark runs and reporting result. Closes #28319 from JiJiTang/SPARK-31364. Authored-by: Jian Tang <jian_tang@apple.com> Signed-off-by: DB Tsai <d_tsai@apple.com>	2020-04-24 22:10:58 +00:00
Prakhar Jain	249b214590	[SPARK-20732][CORE] Decommission cache blocks to other executors when an executor is decommissioned ### What changes were proposed in this pull request? After changes in SPARK-20628, CoarseGrainedSchedulerBackend can decommission an executor and stop assigning new tasks on it. We should also decommission the corresponding blockmanagers in the same way. i.e. Move the cached RDD blocks from those executors to other active executors. ### Why are the changes needed? We need to gracefully decommission the block managers so that the underlying RDD cache blocks are not lost in case the executors are taken away forcefully after some timeout (because of spotloss/pre-emptible VM etc). Its good to save as much cache data as possible. Also In future once the decommissioning signal comes from Cluster Manager (say YARN/Mesos etc), dynamic allocation + this change gives us opportunity to downscale the executors faster by making the executors free of cache data. Note that this is a best effort approach. We try to move cache blocks from decommissioning executors to active executors. If the active executors don't have free resources available on them for caching, then the decommissioning executors will keep the cache block which it was not able to move and it will still be able to serve them. Current overall Flow: 1. CoarseGrainedSchedulerBackend receives a signal to decommissionExecutor. On receiving the signal, it do 2 things - Stop assigning new tasks (SPARK-20628), Send another message to BlockManagerMasterEndpoint (via BlockManagerMaster) to decommission the corresponding BlockManager. 2. BlockManagerMasterEndpoint receives "DecommissionBlockManagers" message. On receiving this, it moves the corresponding block managers to "decommissioning" state. All decommissioning BMs are excluded from the getPeers RPC call which is used for replication. All these decommissioning BMs are also sent message from BlockManagerMasterEndpoint to start decommissioning process on themselves. 3. BlockManager on worker (say BM-x) receives the "DecommissionBlockManager" message. Now it will start BlockManagerDecommissionManager thread to offload all the RDD cached blocks. This thread can make multiple reattempts to decommission the existing cache blocks (multiple reattempts might be needed as there might not be sufficient space in other active BMs initially). ### Does this PR introduce any user-facing change? NO ### How was this patch tested? Added UTs. Closes #27864 from prakharjain09/SPARK-20732-rddcache-1. Authored-by: Prakhar Jain <prakharjain09@gmail.com> Signed-off-by: Holden Karau <hkarau@apple.com>	2020-04-24 11:22:08 -07:00
zhengruifeng	0ede08bcb2	[SPARK-31007][ML] KMeans optimization based on triangle-inequality ### What changes were proposed in this pull request? apply Lemma 1 in [Using the Triangle Inequality to Accelerate K-Means](https://www.aaai.org/Papers/ICML/2003/ICML03-022.pdf): > Let x be a point, and let b and c be centers. If d(b,c)>=2d(x,b) then d(x,c) >= d(x,b); It can be directly applied in EuclideanDistance, but not in CosineDistance. However, for CosineDistance we can luckily get a variant in the space of radian/angle. ### Why are the changes needed? It help improving the performance of prediction and training (mostly) ### Does this PR introduce any user-facing change? No ### How was this patch tested? existing testsuites Closes #27758 from zhengruifeng/km_triangle. Authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-04-24 11:24:15 -05:00
Yuming Wang	b10263b8e5	[SPARK-30724][SQL] Support 'LIKE ANY' and 'LIKE ALL' operators ### What changes were proposed in this pull request? `LIKE ANY/SOME` and `LIKE ALL` operators are mostly used when we are matching a text field with numbers of patterns. For example: Teradata / Hive 3.0 / Snowflake: ```sql --like any select 'foo' LIKE ANY ('%foo%','%bar%'); --like all select 'foo' LIKE ALL ('%foo%','%bar%'); ``` PostgreSQL: ```sql -- like any select 'foo' LIKE ANY (array['%foo%','%bar%']); -- like all select 'foo' LIKE ALL (array['%foo%','%bar%']); ``` This PR add support these two operators. More details: https://docs.teradata.com/reader/756LNiPSFdY~4JcCCcR5Cw/4~AyrPNmDN0Xk4SALLo6aQ https://issues.apache.org/jira/browse/HIVE-15229 https://docs.snowflake.net/manuals/sql-reference/functions/like_any.html ### Why are the changes needed? To smoothly migrate SQLs to Spark SQL. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Unit test. Closes #27477 from wangyum/SPARK-30724. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-04-24 22:20:32 +09:00
yi.wu	463c54419b	[SPARK-31010][SQL][DOC][FOLLOW-UP] Improve deprecated warning message for untyped scala udf ### What changes were proposed in this pull request? Give more friendly warning message/migration guide of deprecated scala udf to users. ### Why are the changes needed? User can not distinguish function signature between typed and untyped scala udf. Instead, we shall tell user what to do directly. ### Does this PR introduce any user-facing change? No, it's newly added in Spark 3.0. ### How was this patch tested? Pass Jenkins. Closes #28311 from Ngone51/update_udf_doc. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-24 19:10:18 +09:00
Huaxin Gao	b14b980ab8	[SPARK-31502][SQL][DOCS] Document identifier in SQL Reference ### What changes were proposed in this pull request? Document identifier in SQL Reference ### Why are the changes needed? make SQL Reference complete ### Does this PR introduce any user-facing change? Yes <img width="1049" alt="Screen Shot 2020-04-23 at 11 14 10 PM" src="https://user-images.githubusercontent.com/13592258/80180695-2f2a4f00-85b8-11ea-819b-f96872956d05.png"> <img width="1050" alt="Screen Shot 2020-04-23 at 11 32 32 PM" src="https://user-images.githubusercontent.com/13592258/80182062-e6c06080-85ba-11ea-9502-1c38358c97c9.png"> ### How was this patch tested? Manually build and check Closes #28277 from huaxingao/identifier. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-24 08:05:27 +00:00
yi.wu	263f04db86	[SPARK-31485][CORE] Avoid application hang if only partial barrier tasks launched ### What changes were proposed in this pull request? Use `dagScheduler.taskSetFailed` to abort a barrier stage instead of throwing exception within `resourceOffers`. ### Why are the changes needed? Any non fatal exception thrown within Spark RPC framework can be swallowed: `100fc58da5/core/src/main/scala/org/apache/spark/rpc/netty/Inbox.scala (L202-L211)` The method `TaskSchedulerImpl.resourceOffers` is also within the scope of Spark RPC framework. Thus, throw exception inside `resourceOffers` won't fail the application. As a result, if a barrier stage fail the require check at `require(addressesWithDescs.size == taskSet.numTasks, ...)`, the barrier stage will fail the check again and again util all tasks from `TaskSetManager` dequeued. But since the barrier stage isn't really executed, the application will hang. The issue can be reproduced by the following test: ```scala initLocalClusterSparkContext(2) val rdd0 = sc.parallelize(Seq(0, 1, 2, 3), 2) val dep = new OneToOneDependency[Int](rdd0) val rdd = new MyRDD(sc, 2, List(dep), Seq(Seq("executor_h_0"),Seq("executor_h_0"))) rdd.barrier().mapPartitions { iter => BarrierTaskContext.get().barrier() iter }.collect() ``` ### Does this PR introduce any user-facing change? Yes, application hang previously but fail-fast after this fix. ### How was this patch tested? Added a regression test. Closes #28257 from Ngone51/fix_barrier_abort. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-24 04:17:06 +00:00
Jungtaek Lim (HeartSaVioR)	39bc50dbf8	[SPARK-30804][SS] Measure and log elapsed time for "compact" operation in CompactibleFileStreamLog ### What changes were proposed in this pull request? This patch adds some log messages to log elapsed time for "compact" operation in FileStreamSourceLog and FileStreamSinkLog (added in CompactibleFileStreamLog) to help investigating the mysterious latency spike during the batch run. ### Why are the changes needed? Tracking latency is a critical aspect of streaming query. While "compact" operation may bring nontrivial latency (it's even synchronous, adding all the latency to the batch run), it's not measured and end users have to guess. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A for UT. Manual test with streaming query using file source & file sink. > grep "for compact batch" <driver log> ``` ... 20/02/20 19:27:36 WARN FileStreamSinkLog: Compacting took 24473 ms (load: 14185 ms, write: 10288 ms) for compact batch 21359 20/02/20 19:27:39 WARN FileStreamSinkLog: Loaded 1068000 entries (397985432 bytes in memory), and wrote 1068000 entries for compact batch 21359 20/02/20 19:29:52 WARN FileStreamSourceLog: Compacting took 3777 ms (load: 1524 ms, write: 2253 ms) for compact batch 21369 20/02/20 19:29:52 WARN FileStreamSourceLog: Loaded 229477 entries (68970112 bytes in memory), and wrote 229477 entries for compact batch 21369 20/02/20 19:30:17 WARN FileStreamSinkLog: Compacting took 24183 ms (load: 12992 ms, write: 11191 ms) for compact batch 21369 20/02/20 19:30:20 WARN FileStreamSinkLog: Loaded 1068500 entries (398171880 bytes in memory), and wrote 1068500 entries for compact batch 21369 ... ``` ![Screen Shot 2020-02-21 at 12 34 22 PM](https://user-images.githubusercontent.com/1317309/75002142-c6830100-54a6-11ea-8da6-17afb056653b.png) This messages are explaining why the operation duration peaks per every 10 batches which is compact interval. Latency from addBatch heavily increases in each peak which DOES NOT mean it takes more time to write outputs, but we have no idea if such message is not presented. NOTE: The output may be a bit different from the code, as it may be changed a bit during review phase. Closes #27557 from HeartSaVioR/SPARK-30804. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-24 12:34:44 +09:00
Dongjoon Hyun	6180028a37	[SPARK-31547][BUILD] Upgrade Genjavadoc to 0.16 ### What changes were proposed in this pull request? This PR aims to upgrade Genjavadoc to 0.16. ### Why are the changes needed? Although we skipped Scala 2.12.11, this brings 2.12.11 official support and better 2.12.12 compatibility. - https://github.com/lightbend/genjavadoc/commits/v0.16 ### Does this PR introduce any user-facing change? No. (The generated doc is the same) ### How was this patch tested? Build with 0.15 and 0.16. ``` $ SKIP_PYTHONDOC=1 SKIP_RDOC=1 SKIP_SQLDOC=1 jekyll build ``` Compare the result. The generated doc is identical. ``` $ diff -r _site_0.15 _site_0.16 \| grep -v '^diff -r' \| grep -v 'Generated by javadoc' \| sort \| uniq --- 5c5 ``` Closes #28321 from dongjoon-hyun/SPARK-31547. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2020-04-24 12:13:10 +09:00
Max Gekk	26165427c7	[SPARK-31488][SQL] Support `java.time.LocalDate` in Parquet filter pushdown ### What changes were proposed in this pull request? 1. Modified `ParquetFilters.valueCanMakeFilterOn()` to accept filters with `java.time.LocalDate` attributes. 2. Modified `ParquetFilters.dateToDays()` to support both types `java.sql.Date` and `java.time.LocalDate` in conversions to days. 3. Add implicit conversion from `LocalDate` to `Expression` (`Literal`). ### Why are the changes needed? To support pushed down filters with `java.time.LocalDate` attributes. Before the changes, date filters are not pushed down to Parquet datasource when `spark.sql.datetime.java8API.enabled` is `true`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added a test to `ParquetFilterSuite` Closes #28259 from MaxGekk/parquet-filter-java8-date-time. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-24 02:21:53 +00:00
Takeshi Yamamuro	42f496f6ac	[SPARK-31526][SQL][TESTS] Add a new test suite for ExpressionInfo ### What changes were proposed in this pull request? This PR intends to add a new test suite for `ExpressionInfo`. Major changes are as follows; - Added a new test suite named `ExpressionInfoSuite` - To improve test coverage, added a test for error handling in `ExpressionInfoSuite` - Moved the `ExpressionInfo`-related tests from `UDFSuite` to `ExpressionInfoSuite` - Moved the related tests from `SQLQuerySuite` to `ExpressionInfoSuite` - Added a comment in `ExpressionInfoSuite` (followup of https://github.com/apache/spark/pull/28224) ### Why are the changes needed? To improve test suites/coverage. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added tests. Closes #28308 from maropu/SPARK-31526. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-24 11:19:20 +09:00
bmarcott	f093480af9	fix version for config spark.locality.wait.legacyResetOnTaskLaunch (#28307 ) fix method return type doc	2020-04-23 14:38:15 -05:00
Kent Yao	8dc2c0247b	[SPARK-31522][SQL] Hive metastore client initialization related configurations should be static ### What changes were proposed in this pull request? HiveClient instance is cross-session, the following configurations which are defined in HiveUtils and used to create it should be considered static: 1. spark.sql.hive.metastore.version - used to determine the hive version in Spark 2. spark.sql.hive.metastore.jars - hive metastore related jars location which is used by spark to create hive client 3. spark.sql.hive.metastore.sharedPrefixes and spark.sql.hive.metastore.barrierPrefixes - package names of classes that are shared or separated between SparkContextLoader and hive client class loader Those are used only once when creating the hive metastore client. They should be static in SQLConf for retrieving them correctly. We should avoid them being changed by users with SET/RESET command. Speaking of spark.sql.hive.version - the fake of the spark.sql.hive.metastore.version, it is used by jdbc/thrift client for backward compatibility. ### Why are the changes needed? bugfix, these configurations should not be changed. ### Does this PR introduce any user-facing change? Yes, the following set of configs are not allowed to change. ``` Seq("spark.sql.hive.metastore.version ", "spark.sql.hive.metastore.jars", "spark.sql.hive.metastore.sharedPrefixes", "spark.sql.hive.metastore.barrierPrefixes") ``` ### How was this patch tested? add unit test Closes #28302 from yaooqinn/SPARK-31522. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-23 15:07:44 +00:00
yi.wu	6c018b31e2	[SPARK-16775][DOC][FOLLOW-UP] Add migration guide for removed accumulator v1 APIs ### What changes were proposed in this pull request? Add migration guide for removed accumulator v1 APIs. ### Why are the changes needed? Provide better guidance for users' migration. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass Jenkins. Closes #28309 from Ngone51/SPARK-16775-migration-guide. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-23 10:59:35 +00:00
Huaxin Gao	f543d6a1ee	[SPARK-31465][SQL][DOCS][FOLLOW-UP] Document Literal in SQL Reference ### What changes were proposed in this pull request? Need to address a few more comments ### Why are the changes needed? Fix a few problems ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? Manually build and check Closes #28306 from huaxingao/literal-folllowup. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-04-23 15:03:20 +09:00
Yuanjian Li	ca90e1932d	[SPARK-31515][SQL] Canonicalize Cast should consider the value of needTimeZone ### What changes were proposed in this pull request? Override the canonicalized fields with respect to the result of `needsTimeZone`. ### Why are the changes needed? The current approach breaks sematic equal of two cast expressions that don't relate with datetime type. If we don't need to use `timeZone` information casting `from` type to `to` type, then the timeZoneId should not influence the canonicalize result. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New UT added. Closes #28288 from xuanyuanking/SPARK-31515. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-04-23 14:32:10 +09:00
Huaxin Gao	03fe9ee428	[SPARK-31465][SQL][DOCS] Document Literal in SQL Reference ### What changes were proposed in this pull request? Document Literal in SQL Reference ### Why are the changes needed? Make SQL Reference complete ### Does this PR introduce any user-facing change? Yes <img width="1049" alt="Screen Shot 2020-04-22 at 8 50 04 PM" src="https://user-images.githubusercontent.com/13592258/80057912-9ecb0c00-84dc-11ea-881e-1415108d674f.png"> <img width="1050" alt="Screen Shot 2020-04-22 at 8 50 29 PM" src="https://user-images.githubusercontent.com/13592258/80057917-a12d6600-84dc-11ea-8884-81f2a94644d5.png"> <img width="1050" alt="Screen Shot 2020-04-22 at 8 50 54 PM" src="https://user-images.githubusercontent.com/13592258/80057922-a4c0ed00-84dc-11ea-9857-75db50f7b054.png"> <img width="1050" alt="Screen Shot 2020-04-22 at 8 51 15 PM" src="https://user-images.githubusercontent.com/13592258/80057927-a7234700-84dc-11ea-9124-45ae1f6143fd.png"> <img width="1050" alt="Screen Shot 2020-04-22 at 8 51 44 PM" src="https://user-images.githubusercontent.com/13592258/80057932-ab4f6480-84dc-11ea-8393-cf005af13ce9.png"> <img width="1050" alt="Screen Shot 2020-04-22 at 8 52 03 PM" src="https://user-images.githubusercontent.com/13592258/80057936-ad192800-84dc-11ea-8d78-9f071a82f1df.png"> <img width="1050" alt="Screen Shot 2020-04-22 at 8 52 28 PM" src="https://user-images.githubusercontent.com/13592258/80057940-b0141880-84dc-11ea-97a7-f787cad0ee03.png"> <img width="1050" alt="Screen Shot 2020-04-22 at 8 53 14 PM" src="https://user-images.githubusercontent.com/13592258/80057945-b30f0900-84dc-11ea-985f-c070609e2329.png"> <img width="1050" alt="Screen Shot 2020-04-22 at 8 53 34 PM" src="https://user-images.githubusercontent.com/13592258/80057949-b5716300-84dc-11ea-9452-3f51137fe03d.png"> <img width="1050" alt="Screen Shot 2020-04-22 at 8 53 56 PM" src="https://user-images.githubusercontent.com/13592258/80057957-b904ea00-84dc-11ea-8b12-a6f00362aa55.png"> <img width="1049" alt="Screen Shot 2020-04-22 at 8 54 12 PM" src="https://user-images.githubusercontent.com/13592258/80057962-bacead80-84dc-11ea-94da-916b1d1c1756.png"> ### How was this patch tested? Manually build and check Closes #28237 from huaxingao/literal. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-04-23 14:12:10 +09:00

... 13 14 15 16 17 ...

27785 commits