ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Takeshi Yamamuro	5c67d0c8f7	[SPARK-35293][SQL][TESTS] Use the newer dsdgen for TPCDSQueryTestSuite ### What changes were proposed in this pull request? This PR intends to replace `maropu/spark-tpcds-datagen` with `databricks/tpcds-kit` for using a newer dsdgen and update the golden files in `tpcds-query-results`. ### Why are the changes needed? For better testing. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? GA passed. Closes #32420 from maropu/UseTpcdsKit. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-05-06 15:25:46 +09:00
Dongjoon Hyun	19661f6ae2	[SPARK-35325][SQL][TESTS] Add nested column ORC encryption test case ### What changes were proposed in this pull request? This PR aims to enrich ORC encryption test coverage for nested columns. ### Why are the changes needed? This will provide a test coverage for this feature. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs with the newly added test case. Closes #32449 from dongjoon-hyun/SPARK-35325. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-05 22:29:54 -07:00
Dongjoon Hyun	a0c76a8755	[SPARK-35319][K8S][BUILD] Upgrade K8s client to 5.3.1 ### What changes were proposed in this pull request? This PR aims to upgrade K8s client to 5.3.1. ### Why are the changes needed? This will bring the latest bug fixes. - https://github.com/fabric8io/kubernetes-client/releases/tag/v5.3.1 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. K8s IT is manually tested like the following. ``` KubernetesSuite: - Run SparkPi with no resources - Run SparkPi with a very long application name. - Use SparkLauncher.NO_RESOURCE - Run SparkPi with a master URL without a scheme. - Run SparkPi with an argument. - Run SparkPi with custom labels, annotations, and environment variables. - All pods have the same service account by default - Run extraJVMOptions check on driver - Run SparkRemoteFileTest using a remote data file - Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties - Run SparkPi with env and mount secrets. - Run PySpark on simple pi.py example - Run PySpark to test a pyfiles example - Run PySpark with memory customization - Run in client mode. - Start pod creation from template - PVs with local storage - Launcher client dependencies - SPARK-33615: Launcher client archives - SPARK-33748: Launcher python client respecting PYSPARK_PYTHON - SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python - Launcher python client dependencies using a zip file - Test basic decommissioning - Test basic decommissioning with shuffle cleanup - Test decommissioning with dynamic allocation & shuffle cleanups - Test decommissioning timeouts - Run SparkR on simple dataframe.R example Run completed in 18 minutes, 33 seconds. Total number of tests run: 27 Suites: completed 2, aborted 0 Tests: succeeded 27, failed 0, canceled 0, ignored 0, pending 0 All tests passed. [INFO] ------------------------------------------------------------------------ [INFO] Reactor Summary for Spark Project Parent POM 3.2.0-SNAPSHOT: [INFO] [INFO] Spark Project Parent POM ........................... SUCCESS [ 3.959 s] [INFO] Spark Project Tags ................................. SUCCESS [ 7.830 s] [INFO] Spark Project Local DB ............................. SUCCESS [ 3.457 s] [INFO] Spark Project Networking ........................... SUCCESS [ 5.496 s] [INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [ 3.239 s] [INFO] Spark Project Unsafe ............................... SUCCESS [ 9.006 s] [INFO] Spark Project Launcher ............................. SUCCESS [ 2.422 s] [INFO] Spark Project Core ................................. SUCCESS [02:17 min] [INFO] Spark Project Kubernetes Integration Tests ......... SUCCESS [21:05 min] [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 23:59 min [INFO] Finished at: 2021-05-05T11:59:19-07:00 [INFO] ------------------------------------------------------------------------ ``` Closes #32443 from dongjoon-hyun/SPARK-35319. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-05 19:50:37 -07:00
Dongjoon Hyun	0126924568	[SPARK-35323][BUILD] Remove unused libraries from LICENSE-binary ### What changes were proposed in this pull request? This PR removes unused libraries from `LICENSE-binary` file. ### Why are the changes needed? SPARK-33212 removes many `Hadoop 3`-only transitive libraries like `dnsjava-2.1.7.jar`. We can simplify Apache Spark LICENSE file by removing them. ### Does this PR introduce _any_ user-facing change? Yes, but this is only LICENSE file change. ### How was this patch tested? Manual. Closes #32445 from dongjoon-hyun/SPARK-35323. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-05-05 18:27:56 -07:00
Yingyi Bu	7970318296	[SPARK-35155][SQL] Add rule id pruning to Analyzer rules ### What changes were proposed in this pull request? Added rule id based pruning to Analyzer rules in fixed point batches: - org.apache.spark.sql.catalyst.analysis.Analyzer$AddMetadataColumns - org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractGenerator - org.apache.spark.sql.catalyst.analysis.Analyzer$ExtractWindowExpressions - org.apache.spark.sql.catalyst.analysis.Analyzer$GlobalAggregates - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggAliasInGroupBy - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAggregateFunctions - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveAliases - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveBinaryArithmetic - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveDeserializer - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveEncodersInUDF - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGenerate - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveGroupingAnalytics - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveInsertInto - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveMissingReferences - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveNewInstance - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOrdinalInOrderByAndGroupBy - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveOutputRelation - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolvePivot - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRandomSeed - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveReferences - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveRelations - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveSubqueryColumnAliases - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTables - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveTempViews - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUpCast - org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveUserSpecifiedColumns - org.apache.spark.sql.catalyst.analysis.Analyzer$WindowsSubstitution - org.apache.spark.sql.catalyst.analysis.DeduplicateRelations - org.apache.spark.sql.catalyst.analysis.EliminateSubqueryAliases - org.apache.spark.sql.catalyst.analysis.EliminateUnions - org.apache.spark.sql.catalyst.analysis.ResolveCreateNamedStruct - org.apache.spark.sql.catalyst.analysis.ResolveHints$ResolveCoalesceHints - org.apache.spark.sql.catalyst.analysis.ResolveHints$ResolveJoinStrategyHints - org.apache.spark.sql.catalyst.analysis.ResolveInlineTables - org.apache.spark.sql.catalyst.analysis.ResolveLambdaVariables - org.apache.spark.sql.catalyst.analysis.ResolveTimeZone - org.apache.spark.sql.catalyst.analysis.ResolveUnion - org.apache.spark.sql.catalyst.analysis.SubstituteUnresolvedOrdinals - org.apache.spark.sql.catalyst.analysis.TimeWindowing Subsequent PRs will add tree bits based pruning to those rules. Split a big PR to reduce review load. ### Why are the changes needed? Reduce the number of tree traversals and hence improve the query compilation latency. ### How was this patch tested? Existing tests. Closes #32425 from sigmod/analyzer. Authored-by: Yingyi Bu <yingyi.bu@databricks.com> Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>	2021-05-06 08:55:29 +08:00
Chao Sun	4fe4b65d9e	[SPARK-35315][TESTS] Keep benchmark result consistent between spark-submit and SBT ### What changes were proposed in this pull request? Set `IS_TESTING` to true in `BenchmarkBase`, before running benchmarks. ### Why are the changes needed? Currently benchmark can be done via 2 ways: `spark-submit`, or SBT command. However in the former Spark will miss some properties such as `IS_TESTING`, which is necessary to turn on/off certain behavior like codegen (`spark.sql.codegen.factoryMode`). Therefore, the result could differ between the two. In addition, the benchmark GitHub workflow is using the spark-submit approach. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #32440 from sunchao/SPARK-35315. Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2021-05-05 18:30:51 +08:00
Yijia Cui	bbdbe0f734	[SPARK-34854][SQL][SS] Expose source metrics via progress report and add Kafka use-case to report delay ### What changes were proposed in this pull request? This pull request proposes a new API for streaming sources to signal that they can report metrics, and adds a use case to support Kafka micro batch stream to report the stats of # of offsets for the current offset falling behind the latest. A public interface is added. `metrics`: returns the metrics reported by the streaming source with given offset. ### Why are the changes needed? The new API can expose any custom metrics for the "current" offset for streaming sources. Different from #31398, this PR makes metrics available to user through progress report, not through spark UI. A use case is that people want to know how the current offset falls behind the latest offset. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test for Kafka micro batch source v2 are added to test the Kafka use case. Closes #31944 from yijiacui-db/SPARK-34297. Authored-by: Yijia Cui <yijia.cui@databricks.com> Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>	2021-05-05 17:26:07 +09:00
dsolow	f550e03b96	[SPARK-34794][SQL] Fix lambda variable name issues in nested DataFrame functions ### What changes were proposed in this pull request? To fix lambda variable name issues in nested DataFrame functions, this PR modifies code to use a global counter for `LambdaVariables` names created by higher order functions. This is the rework of #31887. Closes #31887. ### Why are the changes needed? This moves away from the current hard-coded variable names which break on nested function calls. There is currently a bug where nested transforms in particular fail (the inner variable shadows the outer variable) For this query: ``` val df = Seq( (Seq(1,2,3), Seq("a", "b", "c")) ).toDF("numbers", "letters") df.select( f.flatten( f.transform( $"numbers", (number: Column) => { f.transform( $"letters", (letter: Column) => { f.struct( number.as("number"), letter.as("letter") ) } ) } ) ).as("zipped") ).show(10, false) ``` This is the current (incorrect) output: ``` +------------------------------------------------------------------------+ \|zipped \| +------------------------------------------------------------------------+ \|[{a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}, {a, a}, {b, b}, {c, c}]\| +------------------------------------------------------------------------+ ``` And this is the correct output after fix: ``` +------------------------------------------------------------------------+ \|zipped \| +------------------------------------------------------------------------+ \|[{1, a}, {1, b}, {1, c}, {2, a}, {2, b}, {2, c}, {3, a}, {3, b}, {3, c}]\| +------------------------------------------------------------------------+ ``` ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added the new test in `DataFrameFunctionsSuite`. Closes #32424 from maropu/pr31887. Lead-authored-by: dsolow <dsolow@sayari.com> Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org> Co-authored-by: dmsolow <dsolow@sayarianalytics.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-05-05 12:46:13 +09:00
Yingyi Bu	7fd3f8f9ec	[SPARK-35294][SQL] Add tree traversal pruning in rules with dedicated files under optimizer ### What changes were proposed in this pull request? Added the following TreePattern enums: - CREATE_NAMED_STRUCT - EXTRACT_VALUE - JSON_TO_STRUCT - OUTER_REFERENCE - AGGREGATE - LOCAL_RELATION - EXCEPT - LIMIT - WINDOW Used them in the following rules: - DecorrelateInnerQuery - LimitPushDownThroughWindow - OptimizeCsvJsonExprs - PropagateEmptyRelation - PullOutGroupingExpressions - PushLeftSemiLeftAntiThroughJoin - ReplaceExceptWithFilter - RewriteDistinctAggregates - SimplifyConditionalsInPredicate - UnwrapCastInBinaryComparison ### Why are the changes needed? Reduce the number of tree traversals and hence improve the query compilation latency. ### How was this patch tested? Existing tests. Closes #32421 from sigmod/opt. Authored-by: Yingyi Bu <yingyi.bu@databricks.com> Signed-off-by: Gengliang Wang <ltnwgl@gmail.com>	2021-05-04 19:17:22 +08:00
byungsoo	9b387a1718	[SPARK-35308][TESTS] Fix bug in SPARK-35266 that creates benchmark files in invalid path with wrong name ### What changes were proposed in this pull request? This PR fixes a bug in [SPARK-35266](https://issues.apache.org/jira/browse/SPARK-35266) that creates benchmark files in the invalid path with the wrong name. e.g. For `BLASBenchmark`, - AS-IS: Creates `benchmarksBLASBenchmark-results.txt` in `{SPARK_HOME}/mllib-local/` - TO-BE: Creates `BLASBenchmark-results.txt` in `{SPARK_HOME}/mllib-local/benchmarks/` ### Why are the changes needed? As you can see in the above example, new benchmark files cannot be created as intended due to this bug. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? After building Spark, manually tested with the following command: ``` SPARK_GENERATE_BENCHMARK_FILES=1 bin/spark-submit --class \ org.apache.spark.benchmark.Benchmarks --jars \ "`find . -name '-SNAPSHOT-tests.jar' -o -name 'avro-SNAPSHOT.jar' \| paste -sd ',' -`" \ "`find . -name 'spark-core-SNAPSHOT-tests.jar'`" \ "org.apache.spark.ml.linalg.BLASBenchmark" ``` It successfully generated the benchmark files as intended (`BLASBenchmark-results.txt` in `{SPARK_HOME}/mllib-local/benchmarks/`). Closes #32432 from byungsoo-oh/SPARK-35308. Lead-authored-by: byungsoo <byungsoo@byungsoo-pc.tn.corp.samsungelectronics.net> Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-05-04 19:40:57 +09:00
HyukjinKwon	a2927cb28b	[SPARK-35302][INFRA] Benchmark workflow should create new files for new benchmarks ### What changes were proposed in this pull request? Currently, it fails at `git diff --name-only` when new benchmarks are added, see https://github.com/HyukjinKwon/spark/actions/runs/808870999 We should include untracked files (new benchmark result files) to upload so developers download the results. ### Why are the changes needed? So the new benchmark results can be added and uploaded. ### Does this PR introduce _any_ user-facing change? No, dev-only ### How was this patch tested? Tested at: https://github.com/HyukjinKwon/spark/actions/runs/808867285 Closes #32428 from HyukjinKwon/include-new-benchmarks. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-05-04 19:02:52 +09:00
Xinrong Meng	5ecb112410	[SPARK-35300][PYTHON][DOCS] Standardize module names in install.rst ### What changes were proposed in this pull request? Use full names of modules in `install.rst` when specifying dependencies. ### Why are the changes needed? Using full names makes it more clear. In addition, `pandas APIs on Spark` as a new module can start to be recognized by more people. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual verification. Closes #32427 from xinrong-databricks/nameDoc. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-05-04 11:02:57 +09:00
Xinrong Meng	120c389b00	[SPARK-34887][PYTHON] Port Koalas dependencies into PySpark ### What changes were proposed in this pull request? Port Koalas dependencies appropriately to PySpark dependencies. ### Why are the changes needed? pandas-on-Spark has its own required dependency and optional dependencies. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual test. Closes #32386 from xinrong-databricks/portDeps. Authored-by: Xinrong Meng <xinrong.meng@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-05-04 09:04:23 +09:00
garawalid	176218b6b8	[SPARK-35292][PYTHON] Delete redundant parameter in mypy configuration ### What changes were proposed in this pull request? The parameter no_implicit_optional is defined twice in the mypy configuration, [ligne 20](https://github.com/apache/spark/blob/master/python/mypy.ini#L20) and ligne 105. ### Why are the changes needed? We would like to keep the mypy configuration clean. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This patch can be tested with `dev/lint-python` Closes #32418 from garawalid/feature/clean-mypy-config. Authored-by: garawalid <gwalid94@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-05-04 09:01:34 +09:00
HyukjinKwon	8aaa9e890a	[SPARK-35250][SQL][DOCS] Fix duplicated STOP_AT_DELIMITER to SKIP_VALUE at CSV's unescapedQuoteHandling option documentation ### What changes were proposed in this pull request? This is rather a followup of https://github.com/apache/spark/pull/30518 that should be ported back to `branch-3.1` too. `STOP_AT_DELIMITER` was mistakenly used twice. The duplicated `STOP_AT_DELIMITER` should be `SKIP_VALUE` in the documentation. ### Why are the changes needed? To correctly document. ### Does this PR introduce _any_ user-facing change? Yes, it fixes the user-facing documentation. ### How was this patch tested? I checked them via running linters. Closes #32423 from HyukjinKwon/SPARK-35250. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-05-04 08:44:18 +09:00
Tobias Hermann	54e0aa10c8	[MINOR][SS][DOCS] Fix a typo in the documentation of GroupState ### What changes were proposed in this pull request? Fixing some typos in the documenting comments. ### Why are the changes needed? To make reading the docs more pleasant. ### Does this PR introduce _any_ user-facing change? Yes, since the user sees the docs. ### How was this patch tested? It was not tested, because no code was changed. Closes #32400 from Dobiasd/patch-1. Authored-by: Tobias Hermann <editgym@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-05-03 19:35:38 +09:00
byungsoo	be6ecb6d19	[SPARK-35266][TESTS] Fix error in BenchmarkBase.scala that occurs when creating benchmark files in non-existent directory ### What changes were proposed in this pull request? This PR fixes an error in `BenchmarkBase.scala` that occurs when creating a benchmark file in a non-existent directory. ### Why are the changes needed? When submitting a benchmark job using `org.apache.spark.benchmark.Benchmarks` class with `SPARK_GENERATE_BENCHMARK_FILES=1` option, an exception is raised if the directory where the benchmark file will be generated does not exist. For more information, please refer to [SPARK-35266](https://issues.apache.org/jira/browse/SPARK-35266). ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? After building Spark, manually tested with the following command: ``` SPARK_GENERATE_BENCHMARK_FILES=1 bin/spark-submit --class \ org.apache.spark.benchmark.Benchmarks --jars \ "`find . -name '-SNAPSHOT-tests.jar' -o -name 'avro-SNAPSHOT.jar' \| paste -sd ',' -`" \ "`find . -name 'spark-core-SNAPSHOT-tests.jar'`" \ "org.apache.spark.ml.linalg.BLASBenchmark" ``` It successfully generated the benchmark result files. Why it is sufficient: As illustrated in the comments in `Benchmarks.scala`, the command below runs all benchmarks and generates the results: ``` SPARK_GENERATE_BENCHMARK_FILES=1 bin/spark-submit --class \ org.apache.spark.benchmark.Benchmarks --jars \ "`find . -name '-SNAPSHOT-tests.jar' -o -name 'avro-SNAPSHOT.jar' \| paste -sd ',' -`" \ "`find . -name 'spark-core-SNAPSHOT-tests.jar'`" \ "*" ``` Of all the benchmarks (55 benchmarks in total), only `BLASBenchmark` fails due to the proposed issue for the current code in the master branch. Thus, it is currently sufficient to test `BLASBenchmark` to validate this change. Closes #32394 from byungsoo-oh/SPARK-35266. Authored-by: byungsoo <byungsoo@byungsoo-pc.tn.corp.samsungelectronics.net> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-05-03 18:06:06 +09:00
Yikun Jiang	44b7931936	[SPARK-35176][PYTHON] Standardize input validation error type ### What changes were proposed in this pull request? This PR corrects some exception type when the function input params are failed to validate due to TypeError. In order to convenient to review, there are 3 commits in this PR: - Standardize input validation error type on sql - Standardize input validation error type on ml - Standardize input validation error type on pandas ### Why are the changes needed? As suggestion from Python exception doc [1]: "Raised when an operation or function is applied to an object of inappropriate type.", but there are many Value error are raised in some pyspark code, this patch fix them. [1] https://docs.python.org/3/library/exceptions.html#TypeError Note that: this patch only addresses the exsiting some wrong raise type for input validation, the input validation decorator/framework which mentioned in [SPARK-35176](https://issues.apache.org/jira/browse/SPARK-35176), would be submited in a speparated patch. ### Does this PR introduce _any_ user-facing change? Yes, code can raise the right TypeError instead of ValueError. ### How was this patch tested? Existing test case and UT Closes #32368 from Yikun/SPARK-35176. Authored-by: Yikun Jiang <yikunkero@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-05-03 15:34:24 +09:00
Chao Sun	2a8d7ed4bf	[SPARK-35281][SQL] StaticInvoke should not apply boxing if return type is primitive ### What changes were proposed in this pull request? In `StaticInvoke`, when result is nullable, don't box the return value if its type is primitive. ### Why are the changes needed? It is unnecessary to apply boxing when the method return value is of primitive type, and it would hurt performance a lot if the method is simple. The check is done in `Invoke` but not in `StaticInvoke`. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added a UT. Closes #32416 from sunchao/SPARK-35281. Authored-by: Chao Sun <sunchao@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-05-03 14:55:35 +09:00
Max Gekk	335f00b19b	[SPARK-35285][SQL] Parse ANSI interval types in SQL schema ### What changes were proposed in this pull request? 1. Extend Spark SQL parser to support parsing of: - `INTERVAL YEAR TO MONTH` to `YearMonthIntervalType` - `INTERVAL DAY TO SECOND` to `DayTimeIntervalType` 2. Assign new names to the ANSI interval types according to the SQL standard to be able to parse the names back by Spark SQL parser. Override the `typeName()` name of `YearMonthIntervalType`/`DayTimeIntervalType`. ### Why are the changes needed? To be able to use new ANSI interval types in SQL. The SQL standard requires the types to be defined according to the rules: ``` <interval type> ::= INTERVAL <interval qualifier> <interval qualifier> ::= <start field> TO <end field> \| <single datetime field> <start field> ::= <non-second primary datetime field> [ <left paren> <interval leading field precision> <right paren> ] <end field> ::= <non-second primary datetime field> \| SECOND [ <left paren> <interval fractional seconds precision> <right paren> ] <primary datetime field> ::= <non-second primary datetime field \| SECOND <non-second primary datetime field> ::= YEAR \| MONTH \| DAY \| HOUR \| MINUTE <interval fractional seconds precision> ::= <unsigned integer> <interval leading field precision> ::= <unsigned integer> ``` Currently, Spark SQL supports only `YEAR TO MONTH` and `DAY TO SECOND` as `<interval qualifier>`. ### Does this PR introduce _any_ user-facing change? Should not since the types has not been released yet. ### How was this patch tested? By running the affected tests such as: ``` $ build/sbt "sql/testOnly SQLQueryTestSuite -- -z interval.sql" $ build/sbt "sql/testOnly SQLQueryTestSuite -- -z datetime.sql" $ build/sbt "test:testOnly ExpressionTypeCheckingSuite" $ build/sbt "sql/testOnly SQLQueryTestSuite -- -z windowFrameCoercion.sql" $ build/sbt "sql/testOnly *SQLQueryTestSuite -- -z literals.sql" ``` Closes #32409 from MaxGekk/parse-ansi-interval-types. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-05-03 13:50:35 +09:00
Takeshi Yamamuro	cd689c942c	[SPARK-35192][SQL][TESTS] Port minimal TPC-DS datagen code from databricks/spark-sql-perf ### What changes were proposed in this pull request? This PR proposes to port minimal code to generate TPC-DS data from [databricks/spark-sql-perf](https://github.com/databricks/spark-sql-perf). The classes in a new class file `tpcdsDatagen.scala` are basically copied from the `databricks/spark-sql-perf` codebase. Note that I've modified them a bit to follow the Spark code style and removed unnecessary parts from them. The code authors of these classes are: juliuszsompolski npoggi wangyum ### Why are the changes needed? We frequently use TPCDS data now for benchmarks/tests, but the classes for the TPCDS schemas of datagen and benchmarks/tests are managed separately, e.g., - https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/TPCDSBase.scala - https://github.com/databricks/spark-sql-perf/blob/master/src/main/scala/com/databricks/spark/sql/perf/tpcds/TPCDSTables.scala I think this causes some inconveniences, e.g., we need to update both files in the separate repositories if we update the TPCDS schema #32037. So, it would be useful for the Spark codebase to generate them by referring to the same schema definition. ### Does this PR introduce _any_ user-facing change? dev only. ### How was this patch tested? Manually checked and GA passed. Closes #32243 from maropu/tpcdsDatagen. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-05-03 12:04:42 +09:00
Angerszhuuuu	caa46ce0b6	[SPARK-35112][SQL] Support Cast string to day-second interval ### What changes were proposed in this pull request? Support Cast string to day-seconds interval ### Why are the changes needed? Users can cast day-second interval string to DayTimeIntervalType. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added UT Closes #32271 from AngersZhuuuu/SPARK-35112. Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com> Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-05-02 09:28:51 +03:00
Peter Toth	cfc0495f9c	[SPARK-34581][SQL] Don't optimize out grouping expressions from aggregate expressions without aggregate function ### What changes were proposed in this pull request? This PR adds a new rule `PullOutGroupingExpressions` to pull out complex grouping expressions to a `Project` node under an `Aggregate`. These expressions are then referenced in both grouping expressions and aggregate expressions without aggregate functions to ensure that optimization rules don't change the aggregate expressions to invalid ones that no longer refer to any grouping expressions. ### Why are the changes needed? If aggregate expressions (without aggregate functions) in an `Aggregate` node are complex then the `Optimizer` can optimize out grouping expressions from them and so making aggregate expressions invalid. Here is a simple example: ``` SELECT not(t.id IS NULL) , count(*) FROM t GROUP BY t.id IS NULL ``` In this case the `BooleanSimplification` rule does this: ``` === Applying Rule org.apache.spark.sql.catalyst.optimizer.BooleanSimplification === !Aggregate [isnull(id#222)], [NOT isnull(id#222) AS (NOT (id IS NULL))#226, count(1) AS c#224L] Aggregate [isnull(id#222)], [isnotnull(id#222) AS (NOT (id IS NULL))#226, count(1) AS c#224L] +- Project [value#219 AS id#222] +- Project [value#219 AS id#222] +- LocalRelation [value#219] +- LocalRelation [value#219] ``` where `NOT isnull(id#222)` is optimized to `isnotnull(id#222)` and so it no longer refers to any grouping expression. Before this PR: ``` == Optimized Logical Plan == Aggregate [isnull(id#222)], [isnotnull(id#222) AS (NOT (id IS NULL))#234, count(1) AS c#232L] +- Project [value#219 AS id#222] +- LocalRelation [value#219] ``` and running the query throws an error: ``` Couldn't find id#222 in [isnull(id#222)#230,count(1)#226L] java.lang.IllegalStateException: Couldn't find id#222 in [isnull(id#222)#230,count(1)#226L] ``` After this PR: ``` == Optimized Logical Plan == Aggregate [_groupingexpression#233], [NOT _groupingexpression#233 AS (NOT (id IS NULL))#230, count(1) AS c#228L] +- Project [isnull(value#219) AS _groupingexpression#233] +- LocalRelation [value#219] ``` and the query works. ### Does this PR introduce _any_ user-facing change? Yes, the query works. ### How was this patch tested? Added new UT. Closes #32396 from peter-toth/SPARK-34581-keep-grouping-expressions-2. Authored-by: Peter Toth <peter.toth@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-02 05:53:09 +00:00
Liang-Chi Hsieh	6ce1b161e9	[SPARK-35278][SQL] Invoke should find the method with correct number of parameters ### What changes were proposed in this pull request? This patch fixes `Invoke` expression when the target object has more than one method with the given method name. ### Why are the changes needed? `Invoke` will find out the method on the target object with given method name. If there are more than one method with the name, currently it is undeterministic which method will be used. We should add the condition of parameter number when finding the method. ### Does this PR introduce _any_ user-facing change? Yes, fixed a bug when using `Invoke` on a object where more than one method with the given method name. ### How was this patch tested? Unit test. Closes #32404 from viirya/verify-invoke-param-len. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>	2021-05-01 10:20:46 -07:00
Yuming Wang	72e238a790	[SPARK-35273][SQL] CombineFilters support non-deterministic expressions ### What changes were proposed in this pull request? This pr makes `CombineFilters` support non-deterministic expressions. For example: ```sql spark.sql("CREATE TABLE t1(id INT, dt STRING) using parquet PARTITIONED BY (dt)") spark.sql("CREATE VIEW v1 AS SELECT * FROM t1 WHERE dt NOT IN ('2020-01-01', '2021-01-01')") spark.sql("SELECT * FROM v1 WHERE dt = '2021-05-01' AND rand() <= 0.01").explain() ``` Before this pr: ``` == Physical Plan == (1) Filter (isnotnull(dt#1) AND ((dt#1 = 2021-05-01) AND (rand(-6723800298719475098) <= 0.01))) +- (1) ColumnarToRow +- FileScan parquet default.t1[id#0,dt#1] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], PartitionFilters: [NOT dt#1 IN (2020-01-01,2021-01-01)], PushedFilters: [], ReadSchema: struct<id:int> ``` After this pr: ``` == Physical Plan == (1) Filter (rand(-2400509328955813273) <= 0.01) +- (1) ColumnarToRow +- FileScan parquet default.t1[id#0,dt#1] Batched: true, DataFilters: [], Format: Parquet, Location: InMemoryFileIndex(0 paths)[], PartitionFilters: [isnotnull(dt#1), NOT dt#1 IN (2020-01-01,2021-01-01), (dt#1 = 2021-05-01)], PushedFilters: [], ReadSchema: struct<id:int> ``` ### Why are the changes needed? Improve query performance. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #32405 from wangyum/SPARK-35273. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-05-01 06:02:11 +00:00
Dongjoon Hyun	4e8701a77d	[SPARK-35280][K8S] Promote KubernetesUtils to DeveloperApi ### What changes were proposed in this pull request? Since SPARK-22757, `KubernetesUtils` has been used as an important utility class by all K8s modules and `ExternalClusterManager`s. This PR aims to promote `KubernetesUtils` to `DeveloperApi` in order to maintain it officially in a backward compatible way at Apache Spark 3.2.0. ### Why are the changes needed? Apache Spark 3.1.1 makes `Kubernetes` module GA and provides an extensible external cluster manager framework. To have `ExternalClusterManager` for K8s environment, `KubernetesUtils` class is crucial and needs to be stable. By promoting to a subset of K8s developer API, we can maintain these more sustainable way and give a better and stable functionality to K8s users. In this PR, `Since` annotations denote the last function signature changes because these are going to become public at Apache Spark 3.2.0. \| Version \| Function Name \| \|-\|-\| \| 2.3.0 \| parsePrefixedKeyValuePairs \| \| 2.3.0 \| requireNandDefined \| \| 2.3.0 \| parsePrefixedKeyValuePairs \| \| 2.4.0 \| parseMasterUrl \| \| 3.0.0 \| requireBothOrNeitherDefined \| \| 3.0.0 \| requireSecondIfFirstIsDefined \| \| 3.0.0 \| selectSparkContainer \| \| 3.0.0 \| formatPairsBundle \| \| 3.0.0 \| formatPodState \| \| 3.0.0 \| containersDescription \| \| 3.0.0 \| containerStatusDescription \| \| 3.0.0 \| formatTime \| \| 3.0.0 \| uniqueID \| \| 3.0.0 \| buildResourcesQuantities \| \| 3.0.0 \| uploadAndTransformFileUris \| \| 3.0.0 \| uploadFileUri \| \| 3.0.0 \| requireBothOrNeitherDefined \| \| 3.0.0 \| buildPodWithServiceAccount \| \| 3.0.0 \| isLocalAndResolvable \| \| 3.1.1 \| renameMainAppResource \| \| 3.1.1 \| addOwnerReference \| \| 3.2.0 \| loadPodFromTemplate \| ### Does this PR introduce _any_ user-facing change? Yes, but this is new API additions. ### How was this patch tested? Pass the CIs. Closes #32406 from dongjoon-hyun/SPARK-35280. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-04-30 11:39:18 -07:00
ulysses-you	39889df32a	[SPARK-35264][SQL] Support AQE side broadcastJoin threshold ### What changes were proposed in this pull request? ~~This PR aims to add a new AQE optimizer rule `DynamicJoinSelection`. Like other AQE partition number configs, this rule add a new broadcast threshold config `spark.sql.adaptive.autoBroadcastJoinThreshold`.~~ This PR amis to add a flag in `Statistics` to distinguish AQE stats or normal stats, so that we can make some sql configs isolation between AQE and normal. ### Why are the changes needed? The main idea here is that make join config isolation between normal planner and aqe planner which shared the same code path. Actually we do not very trust using the static stats to consider if it can build broadcast hash join. In our experience it's very common that Spark throw broadcast timeout or driver side OOM exception when execute a bit large plan. And due to braodcast join is not reversed which means if we covert join to braodcast hash join at first time, we(AQE) can not optimize it again, so it should make sense to decide if we can do broadcast at aqe side using different sql config. ### Does this PR introduce _any_ user-facing change? Yes, a new config `spark.sql.adaptive.autoBroadcastJoinThreshold` added. ### How was this patch tested? Add new test. Closes #32391 from ulysses-you/SPARK-35264. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-04-30 09:16:21 +00:00
Angerszhuuuu	11ea255283	[SPARK-35111][SQL] Support Cast string to year-month interval ### What changes were proposed in this pull request? Support Cast string to year-month interval Supported format as below ``` ANSI_STYLE, like INTERVAL -'-10-1' YEAR TO MONTH HIVE_STYLE like 10-1 or -10-1 Rules from the SQL standard about ANSI_STYLE: <interval literal> ::= INTERVAL [ <sign> ] <interval string> <interval qualifier> <interval string> ::= <quote> <unquoted interval string> <quote> <unquoted interval string> ::= [ <sign> ] { <year-month literal> \| <day-time literal> } <year-month literal> ::= <years value> [ <minus sign> <months value> ] \| <months value> <years value> ::= <datetime value> <months value> ::= <datetime value> <datetime value> ::= <unsigned integer> <unsigned integer> ::= <digit>... ``` ### Why are the changes needed? Support Cast string to year-month interval ### Does this PR introduce _any_ user-facing change? User can cast year month interval string to YearMonthIntervalType ### How was this patch tested? Added UT Closes #32266 from AngersZhuuuu/SPARK-SPARK-35111. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-04-30 08:03:07 +03:00
William Hyun	ac8813e37c	[SPARK-35277][BUILD] Upgrade snappy to 1.1.8.4 ### What changes were proposed in this pull request? This PR aims to upgrade snappy to version 1.1.8.4. ### Why are the changes needed? This will bring the latest bug fixes and improvements. - https://github.com/xerial/snappy-java/blob/master/Milestone.md#snappy-java-1183-2021-01-20 - Make pure-java Snappy thread-safe - Improved SnappyFramedInput/OutputStream performance by using java.util.zip.CRC32C ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #32402 from williamhyun/snappy1184. Authored-by: William Hyun <william@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-04-29 21:26:16 -07:00
lipzhu	77e9152898	[SPARK-35255][BUILD] Automated formatting for Scala Code for Blank Lines ### What changes were proposed in this pull request? https://github.com/databricks/scala-style-guide#blanklines https://scalameta.org/scalafmt/docs/configuration.html#newlinestoplevelstatements ### How was this patch tested? Manually tested by modifying a few files and running ./dev/scalafmt then checking that ./dev/scalastyle still passed. Closes #32383 from lipzhu/SPARK-35255. Authored-by: lipzhu <lipzhu@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2021-04-30 11:45:58 +09:00
Kousuke Saruta	e8bf8fe213	[SPARK-35047][SQL] Allow Json datasources to write non-ascii characters as codepoints ### What changes were proposed in this pull request? This PR proposes to enable the JSON datasources to write non-ascii characters as codepoints. To enable/disable this feature, I introduce a new option `writeNonAsciiCharacterAsCodePoint` for JSON datasources. ### Why are the changes needed? JSON specification allows codepoints as literal but Spark SQL's JSON datasources don't support the way to do it. It's great if we can write non-ascii characters as codepoints, which is a platform neutral representation. ### Does this PR introduce _any_ user-facing change? Yes. Users can write non-ascii characters as codepoints with JSON datasources. ### How was this patch tested? New test. Closes #32147 from sarutak/json-unicode-write. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-04-29 09:50:15 -07:00
Kousuke Saruta	8a5af37c25	[SPARK-35268][BUILD] Upgrade GenJavadoc to 0.17 ### What changes were proposed in this pull request? This PR upgrades `GenJavadoc` to `0.17`. ### Why are the changes needed? This version seems to include a fix for an issue which can happen with Scala 2.13.5. https://github.com/lightbend/genjavadoc/releases/tag/v0.17 ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I confirmed build succeed with the following commands. ``` # For Scala 2.12 $ build/sbt -Phive -Phive-thriftserver -Pyarn -Pmesos -Pkubernetes -Phadoop-cloud -Pspark-ganglia-lgpl -Pkinesis-asl -Pdocker-integration-tests -Pkubernetes-integration-tests unidoc # For Scala 2.13 build/sbt -Phive -Phive-thriftserver -Pyarn -Pmesos -Pkubernetes -Phadoop-cloud -Pspark-ganglia-lgpl -Pkinesis-asl -Pdocker-integration-tests -Pkubernetes-integration-tests -Pscala-2.13 unidoc ``` Closes #32392 from sarutak/upgrade-genjavadoc-0.17. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-04-29 09:47:14 -07:00
attilapiros	738cf7f8ff	[SPARK-35009][CORE] Avoid creating multiple python worker monitor threads for the same worker and same task context ### What changes were proposed in this pull request? With this PR Spark avoids creating multiple monitor threads for the same worker and same task context. ### Why are the changes needed? Without this change unnecessary threads will be created. It even can cause job failure for example when a coalesce (without shuffle) from high partition number goes to very low one. This exception is exactly comes for such a run: ``` py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (192.168.1.210 executor driver): java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:166) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.CoalescedRDD.$anonfun$compute$1(CoalescedRDD.scala:99) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49) at scala.collection.TraversableOnce.to(TraversableOnce.scala:315) at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313) at scala.collection.AbstractIterator.to(Iterator.scala:1429) at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307) at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429) at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294) at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288) at scala.collection.AbstractIterator.toArray(Iterator.scala:1429) at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2260) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.failJobAndIndependentStages(DAGScheduler.scala:2262) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2(DAGScheduler.scala:2211) at org.apache.spark.scheduler.DAGScheduler.$anonfun$abortStage$2$adapted(DAGScheduler.scala:2210) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:2210) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1(DAGScheduler.scala:1083) at org.apache.spark.scheduler.DAGScheduler.$anonfun$handleTaskSetFailed$1$adapted(DAGScheduler.scala:1083) at scala.Option.foreach(Option.scala:407) at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:1083) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:2449) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2391) at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:2380) at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:49) at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:872) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2220) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2241) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2260) at org.apache.spark.SparkContext.runJob(SparkContext.scala:2285) at org.apache.spark.rdd.RDD.$anonfun$collect$1(RDD.scala:1030) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151) at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112) at org.apache.spark.rdd.RDD.withScope(RDD.scala:414) at org.apache.spark.rdd.RDD.collect(RDD.scala:1029) at org.apache.spark.api.python.PythonRDD$.collectAndServe(PythonRDD.scala:180) at org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244) at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357) at py4j.Gateway.invoke(Gateway.java:282) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:238) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.OutOfMemoryError: unable to create new native thread at java.lang.Thread.start0(Native Method) at java.lang.Thread.start(Thread.java:717) at org.apache.spark.api.python.BasePythonRunner.compute(PythonRunner.scala:166) at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:65) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:373) at org.apache.spark.rdd.RDD.iterator(RDD.scala:337) at org.apache.spark.rdd.CoalescedRDD.$anonfun$compute$1(CoalescedRDD.scala:99) at scala.collection.Iterator$$anon$11.nextCur(Iterator.scala:484) at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:490) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.generic.Growable.$plus$plus$eq(Growable.scala:62) at scala.collection.generic.Growable.$plus$plus$eq$(Growable.scala:53) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:105) at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:49) at scala.collection.TraversableOnce.to(TraversableOnce.scala:315) at scala.collection.TraversableOnce.to$(TraversableOnce.scala:313) at scala.collection.AbstractIterator.to(Iterator.scala:1429) at scala.collection.TraversableOnce.toBuffer(TraversableOnce.scala:307) at scala.collection.TraversableOnce.toBuffer$(TraversableOnce.scala:307) at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1429) at scala.collection.TraversableOnce.toArray(TraversableOnce.scala:294) at scala.collection.TraversableOnce.toArray$(TraversableOnce.scala:288) at scala.collection.AbstractIterator.toArray(Iterator.scala:1429) at org.apache.spark.rdd.RDD.$anonfun$collect$2(RDD.scala:1030) at org.apache.spark.SparkContext.$anonfun$runJob$5(SparkContext.scala:2260) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:131) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:498) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1437) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:501) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ... 1 more ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manually I used a the following Python script used (`reproduce-SPARK-35009.py`): ``` import pyspark conf = pyspark.SparkConf().setMaster("local[*]").setAppName("Test1") sc = pyspark.SparkContext.getOrCreate(conf) rows = 70000 data = list(range(rows)) rdd = sc.parallelize(data, rows) assert rdd.getNumPartitions() == rows rdd0 = rdd.filter(lambda x: False) data = rdd0.coalesce(1).collect() assert data == [] ``` Spark submit: ``` $ ./bin/spark-submit reproduce-SPARK-35009.py ``` #### With this change Checking the number of monitor threads with jcmd: ``` $ jcmd 85273 sun.tools.jcmd.JCmd 85227 org.apache.spark.deploy.SparkSubmit reproduce-SPARK-35009.py 41020 scala.tools.nsc.MainGenericRunner $ jcmd 85227 Thread.print \| grep -c "Monitor for python" 2 $ jcmd 85227 Thread.print \| grep -c "Monitor for python" 2 ... $ jcmd 85227 Thread.print \| grep -c "Monitor for python" 2 $ jcmd 85227 Thread.print \| grep -c "Monitor for python" 2 $ jcmd 85227 Thread.print \| grep -c "Monitor for python" 2 $ jcmd 85227 Thread.print \| grep -c "Monitor for python" 2 ``` <img width="859" alt="Screenshot 2021-04-14 at 16 06 51" src="https://user-images.githubusercontent.com/2017933/114731755-4969b980-9d42-11eb-8ec5-f60b217bdd96.png"> #### Without this change ``` ... $ jcmd 90052 Thread.print \| grep -c "Monitor for python" [INSERT] 5645 .. ``` <img width="856" alt="Screenshot 2021-04-14 at 16 30 18" src="https://user-images.githubusercontent.com/2017933/114731724-4373d880-9d42-11eb-9f9b-d976bf2530e2.png"> Closes #32169 from attilapiros/SPARK-35009. Authored-by: attilapiros <piros.attila.zsolt@gmail.com> Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com>	2021-04-29 18:38:31 +02:00
lipzhu	4e3daa5994	[SPARK-35254][BUILD] Upgrade SBT to 1.5.1 ### What changes were proposed in this pull request? This PR aims to upgrade SBT to 1.5.1. ### Why are the changes needed? https://github.com/sbt/sbt/releases/tag/v1.5.1 ### Does this PR introduce _any_ user-facing change? NO. ### How was this patch tested? Pass the SBT CIs (Build/Test/Docs/Plugins). Closes #32382 from lipzhu/SPARK-35254. Authored-by: lipzhu <lipzhu@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-04-29 09:32:43 -07:00
yangjie01	7b78e34417	[SPARK-35269][BUILD] Upgrade commons-lang3 to 3.12.0 ### What changes were proposed in this pull request? This pr aims to upgrade Apache commons-lang3 to 3.12.0 ### Why are the changes needed? This version will bring the latest bug fixes as follows: - https://commons.apache.org/proper/commons-lang/changes-report.html#a3.12.0 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #32393 from LuciferYang/lang3-to-312. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2021-04-29 09:27:28 -07:00
yi.wu	068b6c8be6	[SPARK-35234][CORE] Reserve the format of stage failureMessage ### What changes were proposed in this pull request? `failureMessage` is already formatted, but `replaceAll("\n", " ")` destroyed the format. This PR fixed it. ### Why are the changes needed? The formatted error message is easier to read and debug. ### Does this PR introduce _any_ user-facing change? Yes, users see the clear error message in the application log. (Note I changed a little bit to let the test throw exception intentionally. The test itself is good.) Before: ![2141619490903_ pic_hd](https://user-images.githubusercontent.com/16397174/116177970-5a092f00-a747-11eb-9a0f-017391e80c8b.jpg) After: ![2151619490955_ pic_hd](https://user-images.githubusercontent.com/16397174/116177981-5ecde300-a747-11eb-90ef-fd16e906beeb.jpg) ### How was this patch tested? Manually tested. Closes #32356 from Ngone51/format-stage-error-message. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: attilapiros <piros.attila.zsolt@gmail.com>	2021-04-29 16:33:36 +02:00
Kousuke Saruta	132cbf0c8c	[SPARK-35105][SQL] Support multiple paths for ADD FILE/JAR/ARCHIVE commands ### What changes were proposed in this pull request? This PR extends `ADD FILE/JAR/ARCHIVE` commands to be able to take multiple path arguments like Hive. ### Why are the changes needed? To make those commands more useful. ### Does this PR introduce _any_ user-facing change? Yes. In the current implementation, those commands can take a path which contains whitespaces without enclose it by neither `'` nor `"` but after this change, users need to enclose such paths. I've note this incompatibility in the migration guide. ### How was this patch tested? New tests. Closes #32205 from sarutak/add-multiple-files. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-04-29 13:58:51 +09:00
Kousuke Saruta	529b875901	[SPARK-35226][SQL] Support refreshKrb5Config option in JDBC datasources ### What changes were proposed in this pull request? This PR proposes to introduce a new JDBC option `refreshKrb5Config` which allows to reflect the change of `krb5.conf`. ### Why are the changes needed? In the current master, JDBC datasources can't accept `refreshKrb5Config` which is defined in `Krb5LoginModule`. So even if we change the `krb5.conf` after establishing a connection, the change will not be reflected. The similar issue happens when we run multiple `KrbIntegrationSuites` at the same time. `MiniKDC` starts and stops every KerberosIntegrationSuite and different port number is recorded to `krb5.conf`. Due to `SecureConnectionProvider.JDBCConfiguration` doesn't take `refreshKrb5Config`, KerberosIntegrationSuites except the first running one see the wrong port so those suites fail. You can easily confirm with the following command. ``` build/sbt -Phive Phive-thriftserver -Pdocker-integration-tests "testOnly org.apache.spark.sql.jdbc.KrbIntegrationSuite" ``` ### Does this PR introduce _any_ user-facing change? Yes. Users can set `refreshKrb5Config` to refresh krb5 relevant configuration. ### How was this patch tested? New test. Closes #32344 from sarutak/kerberos-refresh-issue. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2021-04-29 13:55:53 +09:00
Kent Yao	771356555c	[SPARK-34786][SQL][FOLLOWUP] Explicitly declare DecimalType(20, 0) for Parquet UINT_64 ### What changes were proposed in this pull request? Explicitly declare DecimalType(20, 0) for Parquet UINT_64, avoid use DecimalType.LongDecimal which only happens to have 20 as precision. https://github.com/apache/spark/pull/31960#discussion_r622691560 ### Why are the changes needed? fix ambiguity ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? not needed, just current CI pass Closes #32390 from yaooqinn/SPARK-34786-F. Authored-by: Kent Yao <yao@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-04-29 04:51:27 +00:00
yangjie01	74b93261af	[SPARK-35135][CORE] Turn the `WritablePartitionedIterator` from a trait into a default implementation class ### What changes were proposed in this pull request? `WritablePartitionedIterator` define in `WritablePartitionedPairCollection.scala` and there are two implementation of these trait, but the code for these two implementations is duplicate. The main change of this pr is turn the `WritablePartitionedIterator` from a trait into a default implementation class because there is only one implementation now. ### Why are the changes needed? Cleanup duplicate code. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the Jenkins or GitHub Action Closes #32232 from LuciferYang/writable-partitioned-iterator. Authored-by: yangjie01 <yangjie01@baidu.com> Signed-off-by: yi.wu <yi.wu@databricks.com>	2021-04-29 11:46:24 +08:00
Wenchen Fan	403e4795e9	[SPARK-35244][SQL][FOLLOWUP] Add null check for the exception cause ### What changes were proposed in this pull request? Make sure we re-throw an exception that is not null. ### Why are the changes needed? to be super safe ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? N/A Closes #32387 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-04-29 09:21:32 +09:00
Chao Sun	86d3bb5f7d	[SPARK-34981][SQL] Implement V2 function resolution and evaluation Co-Authored-By: Chao Sun <sunchaoapple.com> Co-Authored-By: Ryan Blue <rbluenetflix.com> ### What changes were proposed in this pull request? This implements function resolution and evaluation for functions registered through V2 FunctionCatalog [SPARK-27658](https://issues.apache.org/jira/browse/SPARK-27658). In particular: - Added documentation for how to define the "magic method" in `ScalarFunction`. - Added a new expression `ApplyFunctionExpression` which evaluates input by delegating to `ScalarFunction.produceResult` method. - added a new expression `V2Aggregator` which is a type of `TypedImperativeAggregate`. It's a wrapper of V2 `AggregateFunction` and mostly delegate methods to the implementation of the latter. It also uses plain Java serde for intermediate state. - Added function resolution logic for `ScalarFunction` and `AggregateFunction` in `Analyzer`. + For `ScalarFunction` this checks if the magic method is implemented through Java reflection, and create a `Invoke` expression if so. Otherwise, it checks if the default `produceResult` is overridden. If so, it creates a `ApplyFunctionExpression` which evaluates through `InternalRow`. Otherwise an analysis exception is thrown. + For `AggregateFunction`, this checks if the `update` method is overridden. If so, it converts it to `V2Aggregator`. Otherwise an analysis exception is thrown similar to the case of `ScalarFunction`. - Extended existing `InMemoryTableCatalog` to add the function catalog capability. Also renamed it to `InMemoryCatalog` since it no longer only covers tables. Note: this currently can successfully detect whether a subclass overrides the default `produceResult` or `update` method from the parent interface only for Java implementations. It seems in Scala it's hard to differentiate whether a subclass overrides a default method from its parent interface. In this case, it will be a runtime error instead of analysis error. A few TODOs: - Extend `V2SessionCatalog` with function catalog. This seems a little tricky since API such V2 `FunctionCatalog`'s `loadFunction` is different from V1 `SessionCatalog`'s `lookupFunction`. - Add magic method for `AggregateFunction`. - Type coercion when looking up functions ### Why are the changes needed? As V2 FunctionCatalog APIs are finalized, we should integrate it with function resolution and evaluation process so that they are actually useful. ### Does this PR introduce _any_ user-facing change? Yes, now a function exposed through V2 FunctionCatalog can be analyzed and evaluated. ### How was this patch tested? Added new unit tests. Closes #32082 from sunchao/resolve-func-v2. Lead-authored-by: Chao Sun <sunchao@apple.com> Co-authored-by: Chao Sun <sunchao@apache.org> Co-authored-by: Chao Sun <sunchao@uber.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-04-28 17:21:49 +00:00
ulysses-you	0bcf348438	[SPARK-34781][SQL][FOLLOWUP] Adjust the order of AQE optimizer rules ### What changes were proposed in this pull request? Reorder `DemoteBroadcastHashJoin` and `EliminateUnnecessaryJoin`. ### Why are the changes needed? Skip unnecessary check in `DemoteBroadcastHashJoin` if `EliminateUnnecessaryJoin` affects. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No result affect. Closes #32380 from ulysses-you/SPARK-34781-FOLLOWUP. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2021-04-28 13:59:24 +00:00
ulysses-you	8b62c2964d	[SPARK-35214][SQL] OptimizeSkewedJoin support ShuffledHashJoinExec ### What changes were proposed in this pull request? Add `ShuffledHashJoin` pattern check in `OptimizeSkewedJoin` so that we can optimize it. ### Why are the changes needed? Currently, we have already supported all type of join through hint that make it easy to choose the join implementation. We would choose `ShuffledHashJoin` if one table is not big but over the broadcast threshold. It's better that we can support optimize it in `OptimizeSkewedJoin`. ### Does this PR introduce _any_ user-facing change? Probably yes, the execute plan in AQE mode may be changed. ### How was this patch tested? Improve exists test in `AdaptiveQueryExecSuite` Closes #32328 from ulysses-you/SPARK-35214. Authored-by: ulysses-you <ulyssesyou18@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-04-28 16:57:57 +09:00
Angerszhuuuu	26a5e339a6	[SPARK-33976][SQL][DOCS][FOLLOWUP] Fix syntax error in select doc page ### What changes were proposed in this pull request? Add doc about `TRANSFORM` and related function. ### Why are the changes needed? ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Not need Closes #32257 from AngersZhuuuu/SPARK-33976-followup. Authored-by: Angerszhuuuu <angers.zhu@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2021-04-28 16:47:02 +09:00
gengjiaan	56bb8155c5	[SPARK-35085][SQL] Get columns operation should handle ANSI interval column properly ### What changes were proposed in this pull request? This PR let JDBC clients identify ANSI interval columns properly. ### Why are the changes needed? This PR is similar to https://github.com/apache/spark/pull/29539. JDBC users can query interval values through thrift server, create views with ansi interval columns, e.g. `CREATE global temp view view1 as select interval '1-1' year to month as I;` but when they want to get the details of the columns of view1, the will fail with `Unrecognized type name: YEAR-MONTH INTERVAL` ``` Caused by: java.lang.IllegalArgumentException: Unrecognized type name: YEAR-MONTH INTERVAL at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.toJavaSQLType(SparkGetColumnsOperation.scala:190) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$addToRowSet$1(SparkGetColumnsOperation.scala:206) at scala.collection.immutable.List.foreach(List.scala:392) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.addToRowSet(SparkGetColumnsOperation.scala:198) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$7(SparkGetColumnsOperation.scala:109) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$7$adapted(SparkGetColumnsOperation.scala:109) at scala.Option.foreach(Option.scala:407) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$5(SparkGetColumnsOperation.scala:109) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.$anonfun$runInternal$5$adapted(SparkGetColumnsOperation.scala:107) at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62) at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49) at org.apache.spark.sql.hive.thriftserver.SparkGetColumnsOperation.runInternal(SparkGetColumnsOperation.scala:107) ... 34 more ``` ### Does this PR introduce _any_ user-facing change? Yes. Let hive JDBC recognize ANSI interval. ### How was this patch tested? Jenkins test. Closes #32345 from beliefer/SPARK-35085. Lead-authored-by: gengjiaan <gengjiaan@360.cn> Co-authored-by: beliefer <beliefer@163.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-04-28 08:58:43 +03:00
PengLei	046c8c3dd6	[SPARK-34878][SQL][TESTS] Check actual sizes of year-month and day-time intervals ### What changes were proposed in this pull request? As we have suport the year-month and day-time intervals. Add the test actual size of year-month and day-time intervals type ### Why are the changes needed? Just add test ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? ./dev/scalastyle run test for "ColumnTypeSuite" Closes #32366 from Peng-Lei/SPARK-34878. Authored-by: PengLei <18066542445@189.cn> Signed-off-by: Max Gekk <max.gekk@gmail.com>	2021-04-28 07:48:49 +03:00
Jose Torres	253a1aee46	[SPARK-35246][SS] Don't allow streaming-batch intersects ### What changes were proposed in this pull request? The UnsupportedOperationChecker shouldn't allow streaming-batch intersects. As described in the ticket, they can't actually be planned correctly, and even simple cases like the below will fail: ``` test("intersect") { val input = MemoryStream[Long] val df = input.toDS().intersect(spark.range(10).as[Long]) testStream(df) ( AddData(input, 1L), CheckAnswer(1) ) } ``` ### Why are the changes needed? Users will be confused by the cryptic errors produced from trying to run an invalid query plan. ### Does this PR introduce _any_ user-facing change? Some queries which previously failed with a poor error will now fail with a better one. ### How was this patch tested? modified unit test Closes #32371 from jose-torres/ossthing. Authored-by: Jose Torres <joseph.torres@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2021-04-28 10:47:11 +09:00
Wenchen Fan	10c2b68d24	[SPARK-35244][SQL] Invoke should throw the original exception ### What changes were proposed in this pull request? This PR updates the interpreted code path of invoke expressions, to unwrap the `InvocationTargetException` ### Why are the changes needed? Make interpreted and codegen path consistent for invoke expressions. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? new UT Closes #32370 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2021-04-28 10:45:04 +09:00
Kousuke Saruta	abb1f0c5d7	[SPARK-35236][SQL] Support archive files as resources for CREATE FUNCTION USING syntax ### What changes were proposed in this pull request? This PR proposes to make `CREATE FUNCTION USING` syntax can take archives as resources. ### Why are the changes needed? It would be useful. `CREATE FUNCTION USING` syntax doesn't support archives as resources because archives were not supported in Spark SQL. Now Spark SQL supports archives so I think we can support them for the syntax. ### Does this PR introduce _any_ user-facing change? Yes. Users can specify archives for `CREATE FUNCTION USING` syntax. ### How was this patch tested? New test. Closes #32359 from sarutak/load-function-using-archive. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: hyukjinkwon <gurwls223@apache.org>	2021-04-28 10:15:21 +09:00

1 2 3 4 5 ...

29977 commits