ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Holden Karau	991b7977b5	[SPARK-33727][K8S] Fall back from gnupg.net to openpgp.org ### What changes were proposed in this pull request? While building R docker image if we can't fetch the key from gnupg.net fall back to openpgp.org ### Why are the changes needed? gnupg.net key servers are flaky and sometimes fail to resolve or return keys. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Tried to add key on my desktop, it failed, then tried to add key with openpgp.org and it succeed. Closes #30696 from holdenk/SPARK-33727-gnupg-server-is-flaky. Authored-by: Holden Karau <hkarau@apple.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-10 11:35:55 +09:00
Liang-Chi Hsieh	667f64f447	[SPARK-33725][BUILD] Upgrade snappy-java to 1.1.8.2 ### What changes were proposed in this pull request? This upgrades snappy-java to 1.1.8.2. ### Why are the changes needed? Minor version upgrade that includes: - [Fixed](https://github.com/xerial/snappy-java/pull/265) an initialization issue when using a recent Mac OS X version - Support Apple Silicon (M1, Mac-aarch64) - Fixed the pure-java Snappy fallback logic when no native library for your platform is found. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit test. Closes #30690 from viirya/upgrade-snappy. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-09 14:26:53 -08:00
Anton Okolnychyi	fa9ce1d4e8	[SPARK-33722][SQL] Handle DELETE in ReplaceNullWithFalseInPredicate ### What changes were proposed in this pull request? This PR adds `DeleteFromTable` to supported plans in `ReplaceNullWithFalseInPredicate`. ### Why are the changes needed? This change allows Spark to optimize delete conditions like we optimize filters. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? This PR extends the existing test cases to also cover `DeleteFromTable`. Closes #30688 from aokolnychyi/spark-33722. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-09 11:42:54 -08:00
HyukjinKwon	b5399d4ef1	[SPARK-33071][SPARK-33536][SQL][FOLLOW-UP] Rename deniedMetadataKeys to nonInheritableMetadataKeys in Alias ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/30488. This PR proposes to rename `Alias.deniedMetadataKeys` to `Alias.nonInheritableMetadataKeys` to make it less confusing. ### Why are the changes needed? To make it easier to maintain and read. ### Does this PR introduce _any_ user-facing change? No. This is rather a code cleanup. ### How was this patch tested? Ran the unittests written in the previous PR manually. Jenkins and GitHub Actions in this PR should also test them. Closes #30682 from HyukjinKwon/SPARK-33071-SPARK-33536. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-09 20:26:18 +09:00
Gengliang Wang	9959d49942	[SPARK-33719][DOC] Add make_date/make_timestamp/make_interval into the doc of ANSI Compliance ### What changes were proposed in this pull request? Add make_date/make_timestamp/make_interval into the doc of ANSI Compliance ### Why are the changes needed? Users can know that these functions throw runtime exceptions under ANSI mode if the result is not valid. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Build doc and check it in browser: ![image](https://user-images.githubusercontent.com/1097932/101608930-34a79e80-39bb-11eb-9294-9d9b8c3f6faa.png) Closes #30683 from gengliangwang/improveDoc. Authored-by: Gengliang Wang <gengliang.wang@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-09 19:47:20 +09:00
Dooyoung Hwang	a713a7eee3	[SPARK-33655][SQL] Improve performance of processing FETCH_PRIOR ### What changes were proposed in this pull request? Currently, when a client requests FETCH_PRIOR to Thriftserver, Thriftserver reiterates from the start position. Because Thriftserver caches a query result with an array when THRIFTSERVER_INCREMENTAL_COLLECT feature is off, FETCH_PRIOR can be implemented without reiterating the result. A trait FeatureIterator is added in order to separate the implementation for iterator and an array. Also, FeatureIterator supports moves cursor with absolute position, which will be useful for the implementation of FETCH_RELATIVE, FETCH_ABSOLUTE. ### Why are the changes needed? For better performance of Thriftserver. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? FetchIteratorSuite Closes #30600 from Dooyoung-Hwang/refactor_with_fetch_iterator. Authored-by: Dooyoung Hwang <dooyoung.hwang@sk.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-09 18:35:24 +09:00
suqilong	48f93af9f3	[SPARK-33669] Wrong error message from YARN application state monitor when sc.stop in yarn client mode ### What changes were proposed in this pull request? This change make InterruptedIOException to be treated as InterruptedException when closing YarnClientSchedulerBackend, which doesn't log error with "YARN application has exited unexpectedly xxx" ### Why are the changes needed? For YarnClient mode, when stopping YarnClientSchedulerBackend, it first tries to interrupt Yarn application monitor thread. In MonitorThread.run() it catches InterruptedException to gracefully response to stopping request. But client.monitorApplication method also throws InterruptedIOException when the hadoop rpc call is calling. In this case, MonitorThread will not know it is interrupted, a Yarn App failed is returned with "Failed to contact YARN for application xxxxx; YARN application has exited unexpectedly with state xxxxx" is logged with error level. which confuse user a lot. ### Does this PR introduce _any_ user-facing change? Yes ### How was this patch tested? very simple patch, seems no need? Closes #30617 from sqlwindspeaker/yarn-client-interrupt-monitor. Authored-by: suqilong <suqilong@qiyi.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>	2020-12-09 01:21:13 -06:00
Kent Yao	c88eddac3b	[SPARK-33641][SQL][DOC][FOLLOW-UP] Add migration guide for CHAR VARCHAR types ### What changes were proposed in this pull request? Add migration guide for CHAR VARCHAR types ### Why are the changes needed? for migration ### Does this PR introduce _any_ user-facing change? doc change ### How was this patch tested? passing ci Closes #30654 from yaooqinn/SPARK-33641-F. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-09 06:44:10 +00:00
Terry Kim	29fed23ba1	[SPARK-33703][SQL] Migrate MSCK REPAIR TABLE to use UnresolvedTable to resolve the identifier ### What changes were proposed in this pull request? This PR proposes to migrate `MSCK REPAIR TABLE` to use `UnresolvedTable` to resolve the table identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). Note that `MSCK REPAIR TABLE` is not supported for v2 tables. ### Why are the changes needed? The PR makes the resolution consistent behavior consistent. For example, ```scala sql("CREATE DATABASE test") sql("CREATE TABLE spark_catalog.test.t (id bigint, val string) USING csv PARTITIONED BY (id)") sql("CREATE TEMPORARY VIEW t AS SELECT 2") sql("USE spark_catalog.test") sql("MSCK REPAIR TABLE t") // works fine ``` , but after this PR: ``` sql("MSCK REPAIR TABLE t") org.apache.spark.sql.AnalysisException: t is a temp view. 'MSCK REPAIR TABLE' expects a table; line 1 pos 0 ``` , which is the consistent behavior with other commands. ### Does this PR introduce _any_ user-facing change? After this PR, `MSCK REPAIR TABLE t` in the above example is resolved to a temp view `t` first instead of `spark_catalog.test.t`. ### How was this patch tested? Updated existing tests. Closes #30664 from imback82/repair_table_V2. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-09 05:06:37 +00:00
Weichen Xu	f021f6d3c7	[MINOR][ML] Increase Bounded MLOR (without regularization) test error tolerance ### What changes were proposed in this pull request? Improve LogisticRegression test error tolerance ### Why are the changes needed? When we switch BLAS version, some of the tests will fail due to too strict error tolerance in test. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #30587 from WeichenXu123/fix_lor_test. Authored-by: Weichen Xu <weichen.xu@databricks.com> Signed-off-by: Weichen Xu <weichen.xu@databricks.com>	2020-12-09 11:18:09 +08:00
Nicholas Marion	3ac70f169d	[SPARK-33695][BUILD] Upgrade to jackson to 2.10.5 and jackson-databind to 2.10.5.1 ### What changes were proposed in this pull request? Upgrade the jackson dependencies to 2.10.5 and jackson-databind to 2.10.5.1 ### Why are the changes needed? Jackson dependency has vulnerability CVE-2020-25649. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Existing unit tests. Closes #30656 from n-marion/SPARK-33695_upgrade-jackson. Authored-by: Nicholas Marion <nmarion@us.ibm.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-08 12:11:06 -08:00
Wenchen Fan	6fd234503c	[SPARK-32110][SQL] normalize special floating numbers in HyperLogLog++ ### What changes were proposed in this pull request? Currently, Spark treats 0.0 and -0.0 semantically equal, while it still retains the difference between them so that users can see -0.0 when displaying the data set. The comparison expressions in Spark take care of the special floating numbers and implement the correct semantic. However, Spark doesn't always use these comparison expressions to compare values, and we need to normalize the special floating numbers before comparing them in these places: 1. GROUP BY 2. join keys 3. window partition keys This PR fixes one more place that compares values without using comparison expressions: HyperLogLog++ ### Why are the changes needed? Fix the query result ### Does this PR introduce _any_ user-facing change? Yes, the result of HyperLogLog++ becomes correct now. ### How was this patch tested? a new test case, and a few more test cases that pass before this PR to improve test coverage. Closes #30673 from cloud-fan/bug. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-08 11:41:35 -08:00
Dongjoon Hyun	c001dd49e4	[SPARK-33675][INFRA][FOLLOWUP] Schedule branch-3.1 snapshot at master branch ### What changes were proposed in this pull request? Currently, `master`/`branch-3.0`/`branch-2.4` snapshot publishing is successfully migrated from Jenkins to `GitHub Action`. - https://github.com/apache/spark/actions?query=workflow%3A%22Publish+Snapshot%22 This PR aims to schedule `branch-3.1` snapshot at `master` branch. ### Why are the changes needed? This is because it turns out that `GitHub Action Schedule` works only at `master` branch. (the default branch). - https://docs.github.com/en/free-pro-teamlatest/actions/reference/events-that-trigger-workflows#scheduled-events ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The matrix triggering is tested at the forked branch. - https://github.com/dongjoon-hyun/spark/runs/1519015974 Closes #30674 from dongjoon-hyun/SPARK-SCHEDULE-3.1. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-08 10:43:41 -08:00
Josh Soref	a093d6feef	[MINOR] Spelling sql/core ### What changes were proposed in this pull request? This PR intends to fix typos in the sub-modules: * `sql/core` Split per srowen https://github.com/apache/spark/pull/30323#issuecomment-728981618 NOTE: The misspellings have been reported at `706a726f87 (commitcomment-44064356)` ### Why are the changes needed? Misspelled words make it harder to read / understand content. ### Does this PR introduce _any_ user-facing change? There are various fixes to documentation, etc... ### How was this patch tested? No testing was performed Closes #30531 from jsoref/spelling-sql-core. Authored-by: Josh Soref <jsoref@users.noreply.github.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-12-08 08:57:13 -06:00
Terry Kim	c05ee06f5b	[SPARK-33685][SQL] Migrate DROP VIEW command to use UnresolvedView to resolve the identifier ### What changes were proposed in this pull request? This PR introduces `UnresolvedView` in the resolution framework to resolve the identifier. This PR then migrates `DROP VIEW` to use `UnresolvedView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? To use `UnresolvedView` for view resolution. Note that there is no resolution behavior change with this PR. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated existing tests. Closes #30636 from imback82/drop_view_v2. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-08 14:07:58 +00:00
Max Gekk	2b30dde249	[SPARK-33688][SQL] Migrate SHOW TABLE EXTENDED to new resolution framework ### What changes were proposed in this pull request? 1. Remove old statement `ShowTableStatement` 2. Introduce new command `ShowTableExtended` for `SHOW TABLE EXTENDED`. This PR is the first step of new V2 implementation of `SHOW TABLE EXTENDED`, see SPARK-33393. ### Why are the changes needed? This is a part of effort to make the relation lookup behavior consistent: SPARK-29900. ### Does this PR introduce _any_ user-facing change? The changes should not affect V1 tables. For V2, Spark outputs the error: ``` SHOW TABLE EXTENDED is not supported for v2 tables. ``` ### How was this patch tested? By running `SHOW TABLE EXTENDED` tests: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "test:testOnly *ShowTablesSuite" ``` Closes #30645 from MaxGekk/show-table-extended-statement. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-08 12:08:22 +00:00
luluorta	99613cd581	[SPARK-33677][SQL] Skip LikeSimplification rule if pattern contains any escapeChar ### What changes were proposed in this pull request? `LikeSimplification` rule does not work correctly for many cases that have patterns containing escape characters, for example: `SELECT s LIKE 'm%aca' ESCAPE '%' FROM t` `SELECT s LIKE 'maacaa' ESCAPE 'a' FROM t` For simpilicy, this PR makes this rule just be skipped if `pattern` contains any `escapeChar`. ### Why are the changes needed? Result corrupt. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Added Unit test. Closes #30625 from luluorta/SPARK-33677. Authored-by: luluorta <luluorta@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-12-08 20:45:25 +09:00
Dongjoon Hyun	031c5ef280	[SPARK-33679][SQL] Enable spark.sql.adaptive.enabled by default ### What changes were proposed in this pull request? This PR aims to enable `spark.sql.adaptive.enabled` by default for Apache Spark 3.2.0. ### Why are the changes needed? By switching the default for Apache Spark 3.2, the whole community can focus more on the stabilizing this feature in the various situation more seriously. ### Does this PR introduce _any_ user-facing change? Yes, but this is an improvement and it's supposed to have no bugs. ### How was this patch tested? Pass the CIs. Closes #30628 from dongjoon-hyun/SPARK-33679. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-07 23:10:35 -08:00
Dongjoon Hyun	3a6546d385	[MINOR][INFRA] Add -Pdocker-integration-tests to GitHub Action Scala 2.13 build job ### What changes were proposed in this pull request? This aims to add `-Pdocker-integration-tests` at GitHub Action job for Scala 2.13 compilation. ### Why are the changes needed? We fixed Scala 2.13 compilation of this module at https://github.com/apache/spark/pull/30660 . This PR will prevent accidental regression at that module. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GitHub Action Scala 2.13 job. Closes #30661 from dongjoon-hyun/SPARK-DOCKER-IT. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>	2020-12-08 14:11:39 +09:00
Terry Kim	5aefc49b0f	[SPARK-33664][SQL] Migrate ALTER TABLE ... RENAME TO to use UnresolvedTableOrView to resolve identifier ### What changes were proposed in this pull request? This PR proposes to migrate `ALTER [TABLE\|ViEW] ... RENAME TO` to use `UnresolvedTableOrView` to resolve the table/view identifier. This allows consistent resolution rules (temp view first, etc.) to be applied for both v1/v2 commands. More info about the consistent resolution rule proposal can be found in [JIRA](https://issues.apache.org/jira/browse/SPARK-29900) or [proposal doc](https://docs.google.com/document/d/1hvLjGA8y_W_hhilpngXVub1Ebv8RsMap986nENCFnrg/edit?usp=sharing). ### Why are the changes needed? To use `UnresolvedTableOrView` for table/view resolution. Note that `AlterTableRenameCommand` internally resolves to a temp view first, so there is no resolution behavior change with this PR. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Updated existing tests. Closes #30610 from imback82/rename_v2. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-08 03:54:16 +00:00
Kousuke Saruta	8bcebfa59a	[SPARK-33698][BUILD][TESTS] Fix the build error of OracleIntegrationSuite for Scala 2.13 ### What changes were proposed in this pull request? This PR fixes a build error of `OracleIntegrationSuite` with Scala 2.13. ### Why are the changes needed? Build should pass with Scala 2.13. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I confirmed that the build pass with the following command. ``` $ build/sbt -Pdocker-integration-tests -Pscala-2.13 "docker-integration-tests/test:compile" ``` Closes #30660 from sarutak/fix-docker-integration-tests-for-scala-2.13. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-07 19:09:59 -08:00
Ruifeng Zheng	ebd8b9357a	[SPARK-33609][ML] word2vec reduce broadcast size ### What changes were proposed in this pull request? 1, directly use float vectors instead of converting to double vectors, this is about 2x faster than using vec.axpy; 2, mark `wordList` and `wordVecNorms` lazy 3, avoid slicing in computation of `wordVecNorms` ### Why are the changes needed? halve broadcast size ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing testsuites Closes #30548 from zhengruifeng/w2v_float32_transform. Lead-authored-by: Ruifeng Zheng <ruifengz@foxmail.com> Co-authored-by: zhengruifeng <ruifengz@foxmail.com> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>	2020-12-08 11:04:29 +08:00
Dongjoon Hyun	b2a79306ef	[SPARK-33680][SQL][TESTS][FOLLOWUP] Fix more test suites to have explicit confs ### What changes were proposed in this pull request? This is a follow-up for SPARK-33680 to remove the assumption on the default value of `spark.sql.adaptive.enabled` . ### Why are the changes needed? According to the test result https://github.com/apache/spark/pull/30628#issuecomment-739866168, the [previous run](https://github.com/apache/spark/pull/30628#issuecomment-739641105) didn't run all tests. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass the CIs. Closes #30655 from dongjoon-hyun/SPARK-33680. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-07 18:59:15 -08:00
Fokko Driesprong	e4d1c10760	[SPARK-32320][PYSPARK] Remove mutable default arguments This is bad practice, and might lead to unexpected behaviour: https://florimond.dev/blog/articles/2018/08/python-mutable-defaults-are-the-source-of-all-evil/ ``` fokkodriesprongFan spark % grep -R "={}" python \| grep def python/pyspark/resource/profile.py: def __init__(self, _java_resource_profile=None, _exec_req={}, _task_req={}): python/pyspark/sql/functions.py:def from_json(col, schema, options={}): python/pyspark/sql/functions.py:def to_json(col, options={}): python/pyspark/sql/functions.py:def schema_of_json(json, options={}): python/pyspark/sql/functions.py:def schema_of_csv(csv, options={}): python/pyspark/sql/functions.py:def to_csv(col, options={}): python/pyspark/sql/functions.py:def from_csv(col, schema, options={}): python/pyspark/sql/avro/functions.py:def from_avro(data, jsonFormatSchema, options={}): ``` ``` fokkodriesprongFan spark % grep -R "=\[\]" python \| grep def python/pyspark/ml/tuning.py: def __init__(self, bestModel, avgMetrics=[], subModels=None): python/pyspark/ml/tuning.py: def __init__(self, bestModel, validationMetrics=[], subModels=None): ``` ### What changes were proposed in this pull request? Removing the mutable default arguments. ### Why are the changes needed? Removing the mutable default arguments, and changing the signature to `Optional[...]`. ### Does this PR introduce _any_ user-facing change? No 👍 ### How was this patch tested? Using the Flake8 bugbear code analysis plugin. Closes #29122 from Fokko/SPARK-32320. Authored-by: Fokko Driesprong <fokko@apache.org> Signed-off-by: Ruifeng Zheng <ruifengz@foxmail.com>	2020-12-08 09:35:36 +08:00
Anton Okolnychyi	02508b68ec	[SPARK-33621][SQL] Add a way to inject data source rewrite rules ### What changes were proposed in this pull request? This PR adds a way to inject data source rewrite rules. ### Why are the changes needed? Right now `SparkSessionExtensions` allow us to inject optimization rules but they are added to operator optimization batch. There are cases when users need to run rules after the operator optimization batch (e.g. cases when a rule relies on the fact that expressions have been optimized). Currently, this is not possible. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? This PR comes with a new test. Closes #30577 from aokolnychyi/spark-33621-v3. Authored-by: Anton Okolnychyi <aokolnychyi@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-07 15:32:10 -08:00
Wenchen Fan	c0874ba9f1	[SPARK-33480][SQL][FOLLOWUP] do not expose user data in error message ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/30412. This PR updates the error message of char/varchar table insertion length check, to not expose user data. ### Why are the changes needed? This is risky to expose user data in the error message, especially the string data, as it may contain sensitive data. ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? updated tests Closes #30653 from cloud-fan/minor2. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-07 13:35:37 -08:00
Wenchen Fan	6aff215077	[SPARK-33693][SQL] deprecate spark.sql.hive.convertCTAS ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/30554 . Now we have a new config for converting CREATE TABLE, we don't need the old config that only works for CTAS. ### Why are the changes needed? It's confusing for having two config while one can cover another completely. ### Does this PR introduce _any_ user-facing change? no, it's deprecating not removing. ### How was this patch tested? N/A Closes #30651 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-07 10:50:31 -08:00
Josh Soref	c62b84a043	[MINOR] Spelling sql not core ### What changes were proposed in this pull request? This PR intends to fix typos in the sub-modules: * `sql/catalyst` * `sql/hive-thriftserver` * `sql/hive` Split per srowen https://github.com/apache/spark/pull/30323#issuecomment-728981618 NOTE: The misspellings have been reported at `706a726f87 (commitcomment-44064356)` ### Why are the changes needed? Misspelled words make it harder to read / understand content. ### Does this PR introduce _any_ user-facing change? There are various fixes to documentation, etc... ### How was this patch tested? No testing was performed Closes #30532 from jsoref/spelling-sql-not-core. Authored-by: Josh Soref <jsoref@users.noreply.github.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-12-07 08:40:29 -06:00
Kent Yao	da72b87374	[SPARK-33641][SQL] Invalidate new char/varchar types in public APIs that produce incorrect results ### What changes were proposed in this pull request? In this PR, we suppose to narrow the use cases of the char/varchar data types, of which are invalid now or later ### Why are the changes needed? 1. udf ```scala scala> spark.udf.register("abcd", () => "12345", org.apache.spark.sql.types.VarcharType(2)) scala> spark.sql("select abcd()").show scala.MatchError: CharType(2) (of class org.apache.spark.sql.types.VarcharType) at org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeFor(RowEncoder.scala:215) at org.apache.spark.sql.catalyst.encoders.RowEncoder$.externalDataTypeForInput(RowEncoder.scala:212) at org.apache.spark.sql.catalyst.expressions.objects.ValidateExternalType.<init>(objects.scala:1741) at org.apache.spark.sql.catalyst.encoders.RowEncoder$.$anonfun$serializerFor$3(RowEncoder.scala:175) at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245) at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36) at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33) at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:198) at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245) at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242) at scala.collection.mutable.ArrayOps$ofRef.flatMap(ArrayOps.scala:198) at org.apache.spark.sql.catalyst.encoders.RowEncoder$.serializerFor(RowEncoder.scala:171) at org.apache.spark.sql.catalyst.encoders.RowEncoder$.apply(RowEncoder.scala:66) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$2(Dataset.scala:99) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:611) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:606) ... 47 elided ``` 2. spark.createDataframe ``` scala> spark.createDataFrame(spark.read.text("README.md").rdd, new org.apache.spark.sql.types.StructType().add("c", "char(1)")).show +--------------------+ \| c\| +--------------------+ \| # Apache Spark\| \| \| \|Spark is a unifie...\| \|high-level APIs i...\| \|supports general ...\| \|rich set of highe...\| \|MLlib for machine...\| \|and Structured St...\| \| \| \|<https://spark.ap...\| \| \| \|[![Jenkins Build]...\| \|[![AppVeyor Build...\| \|[![PySpark Covera...\| \| \| \| \| ``` 3. reader.schema ``` scala> spark.read.schema("a varchar(2)").text("./README.md").show(100) +--------------------+ \| a\| +--------------------+ \| # Apache Spark\| \| \| \|Spark is a unifie...\| \|high-level APIs i...\| \|supports general ...\| ``` 4. etc ### Does this PR introduce _any_ user-facing change? NO, we intend to avoid protentical breaking change ### How was this patch tested? new tests Closes #30586 from yaooqinn/SPARK-33641. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-07 13:40:15 +00:00
Linhong Liu	d730b6bdaa	[SPARK-32680][SQL] Don't Preprocess V2 CTAS with Unresolved Query ### What changes were proposed in this pull request? The analyzer rule `PreprocessTableCreation` will preprocess table creation related logical plan. But for CTAS, if the sub-query can't be resolved, preprocess it will cause "Invalid call to toAttribute on unresolved object" (instead of a user-friendly error msg: "table or view not found"). This PR fixes this wrongly preprocess for CTAS using V2 catalog. ### Why are the changes needed? bug fix ### Does this PR introduce _any_ user-facing change? The error message for CTAS with a non-exists table changed from: `UnresolvedException: Invalid call to toAttribute on unresolved object, tree: xxx` to `AnalysisException: Table or view not found: xxx` ### How was this patch tested? added test Closes #30637 from linhongliu-db/fix-ctas. Authored-by: Linhong Liu <linhong.liu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-07 13:25:43 +00:00
Yuming Wang	1e0c006748	[SPARK-33617][SQL] Add default parallelism configuration for Spark SQL queries ### What changes were proposed in this pull request? This pr add default parallelism configuration(`spark.sql.default.parallelism`) for Spark SQL and make it effective for `LocalTableScan`. ### Why are the changes needed? Avoid generating small files for INSERT INTO TABLE from VALUES, for example: ```sql CREATE TABLE t1(id int) USING parquet; INSERT INTO TABLE t1 VALUES (1), (2), (3), (4), (5), (6), (7), (8); ``` Before this pr: ``` -rw-r--r-- 1 root root 421 Dec 1 01:54 part-00000-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet -rw-r--r-- 1 root root 421 Dec 1 01:54 part-00001-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet -rw-r--r-- 1 root root 421 Dec 1 01:54 part-00002-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet -rw-r--r-- 1 root root 421 Dec 1 01:54 part-00003-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet -rw-r--r-- 1 root root 421 Dec 1 01:54 part-00004-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet -rw-r--r-- 1 root root 421 Dec 1 01:54 part-00005-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet -rw-r--r-- 1 root root 421 Dec 1 01:54 part-00006-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet -rw-r--r-- 1 root root 421 Dec 1 01:54 part-00007-4d5a3a89-2995-4328-b2ae-908febbbaf4a-c000.snappy.parquet -rw-r--r-- 1 root root 0 Dec 1 01:54 _SUCCESS ``` After this pr and set `spark.sql.files.minPartitionNum` to 1: ``` -rw-r--r-- 1 root root 452 Dec 1 01:59 part-00000-6de50c79-e305-4f8d-b6ae-39f46b2619c6-c000.snappy.parquet -rw-r--r-- 1 root root 0 Dec 1 01:59 _SUCCESS ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Unit test. Closes #30559 from wangyum/SPARK-33617. Lead-authored-by: Yuming Wang <yumwang@ebay.com> Co-authored-by: Yuming Wang <yumwang@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-07 21:36:52 +09:00
Max Gekk	26c0493318	[SPARK-33676][SQL] Require exact matching of partition spec to the schema in V2 `ALTER TABLE .. ADD/DROP PARTITION` ### What changes were proposed in this pull request? Check that partitions specs passed to v2 `ALTER TABLE .. ADD/DROP PARTITION` exactly match to the partition schema (all partition fields from the schema are specified in partition specs). ### Why are the changes needed? 1. To have the same behavior as V1 `ALTER TABLE .. ADD/DROP PARTITION` that output the error: ```sql spark-sql> create table tab1 (id int, a int, b int) using parquet partitioned by (a, b); spark-sql> ALTER TABLE tab1 ADD PARTITION (A='9'); Error in query: Partition spec is invalid. The spec (a) must match the partition spec (a, b) defined in table '`default`.`tab1`'; ``` 2. To prevent future errors caused by not fully specified partition specs. ### Does this PR introduce _any_ user-facing change? Yes. The V2 implementation of `ALTER TABLE .. ADD/DROP PARTITION` output the same error as V1 commands. ### How was this patch tested? By running the test suite with new UT: ``` $ build/sbt "test:testOnly *AlterTablePartitionV2SQLSuite" ``` Closes #30624 from MaxGekk/add-partition-full-spec. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-12-07 08:14:36 +00:00
Max Gekk	87c056088e	[SPARK-33671][SQL] Remove VIEW checks from V1 table commands ### What changes were proposed in this pull request? Remove VIEW checks from the following V1 commands: - `SHOW PARTITIONS` - `TRUNCATE TABLE` - `LOAD DATA` The checks are performed earlier at: `acc211d2cf/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (L885-L889)` ### Why are the changes needed? To improve code maintenance, and remove dead codes. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing test suites like `v1/ShowPartitionsSuite`. 1. LOAD DATA: `acc211d2cf/sql/core/src/test/scala/org/apache/spark/sql/execution/SQLViewSuite.scala (L176-L179)` 2. TRUNCATE TABLE: `acc211d2cf/sql/core/src/test/scala/org/apache/spark/sql/execution/SQLViewSuite.scala (L180-L183)` 3. SHOW PARTITIONS: - v1/ShowPartitionsSuite Closes #30620 from MaxGekk/show-table-check-view. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-06 23:22:52 -08:00
Kousuke Saruta	d48ef34911	[SPARK-33684][BUILD] Upgrade httpclient from 4.5.6 to 4.5.13 ### What changes were proposed in this pull request? This PR upgrades `commons.httpclient` from `4.5.6` to `4.5.13`. 4.5.6 is released over 2 years ago and now we can use more stable `4.5.13`. https://archive.apache.org/dist/httpcomponents/httpclient/RELEASE_NOTES-4.5.x.txt ### Why are the changes needed? To follow the more stable release. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Should be done by the existing tests. Closes #30634 from sarutak/upgrade-httpclient. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-06 23:02:36 -08:00
Dongjoon Hyun	73412ffb3a	[SPARK-33680][SQL][TESTS] Fix PrunePartitionSuiteBase/BucketedReadWithHiveSupportSuite not to depend on the default conf ### What changes were proposed in this pull request? This PR updates `PrunePartitionSuiteBase/BucketedReadWithHiveSupportSuite` to have the require conf explicitly. ### Why are the changes needed? The unit test should not depend on the default configurations. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? According to https://github.com/apache/spark/pull/30628 , this seems to be the only ones. Pass the CIs. Closes #30631 from dongjoon-hyun/SPARK-CONF-AGNO. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-06 19:34:54 -08:00
Kousuke Saruta	e88f0d4a24	[SPARK-33683][INFRA] Remove -Djava.version=11 from Scala 2.13 build in GitHub Actions ### What changes were proposed in this pull request? This PR removes `-Djava.version=11` from the build command for Scala 2.13 in the GitHub Actions' job. In the GitHub Actions' job, the build command for Scala 2.13 is defined as follows. ``` ./build/sbt -Pyarn -Pmesos -Pkubernetes -Phive -Phive-thriftserver -Phadoop-cloud -Pkinesis-asl -Djava.version=11 -Pscala-2.13 compile test:compile ``` Though, Scala 2.13 build uses Java 8 rather than 11 so let's remove `-Djava.version=11`. ### Why are the changes needed? To build with consistent configuration. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Should be done by GitHub Actions' workflow. Closes #30633 from sarutak/scala-213-java11. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-06 17:57:19 -08:00
Max Gekk	29096a8869	[SPARK-33670][SQL] Verify the partition provider is Hive in v1 SHOW TABLE EXTENDED ### What changes were proposed in this pull request? Invoke the check `DDLUtils.verifyPartitionProviderIsHive()` from V1 implementation of `SHOW TABLE EXTENDED` when partition specs are specified. This PR is some kind of follow up https://github.com/apache/spark/pull/16373 and https://github.com/apache/spark/pull/15515. ### Why are the changes needed? To output an user friendly error with recommendation like " ... partition metadata is not stored in the Hive metastore. To import this information into the metastore, run `msck repair table tableName` " instead of silently output an empty result. ### Does this PR introduce _any_ user-facing change? Yes. ### How was this patch tested? By running the affected test suites, in particular: ``` $ build/sbt -Phive-2.3 -Phive-thriftserver "hive/test:testOnly *PartitionProviderCompatibilitySuite" ``` Closes #30618 from MaxGekk/show-table-extended-verifyPartitionProviderIsHive. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-07 10:21:04 +09:00
Dongjoon Hyun	e32de29bce	[SPARK-33675][INFRA] Add GitHub Action job to publish snapshot ### What changes were proposed in this pull request? This PR aims to add `GitHub Action` job to publish daily snapshot for master branch. - https://repository.apache.org/content/groups/snapshots/org/apache/spark/spark-core_2.12/3.2.0-SNAPSHOT/ For the other branches, I'll make adjusted backports. - For `branch-3.1`, we can specify the checkout `ref` to `branch-3.1`. - For `branch-2.4` and `branch-3.0`, we can publish at every commit since the traffic is low. - https://github.com/apache/spark/pull/30630 (branch-3.0) - https://github.com/apache/spark/pull/30629 (branch-2.4 LTS) ### Why are the changes needed? After this series of jobs, this will reduce our maintenance burden permanently from AmpLab Jenkins by removing the following completely. https://amplab.cs.berkeley.edu/jenkins/view/Spark%20Packaging/ For now, AmpLab Jenkins doesn't have a job for `branch-3.1`. We can do it by ourselves by `GitHub Action`. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? The snapshot publishing is tested here at PR trigger. Since this PR adds a scheduled job, we cannot test in this PR. - https://github.com/dongjoon-hyun/spark/runs/1505792859 Apache Infra team finished the setup here. - https://issues.apache.org/jira/browse/INFRA-21167 Closes #30623 from dongjoon-hyun/SPARK-33675. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-07 10:05:28 +09:00
Terry Kim	119539fd49	[SPARK-33663][SQL] Uncaching should not be called on non-existing temp views ### What changes were proposed in this pull request? This PR proposes to fix a misleading logs in the following scenario when uncaching is called on non-existing views: ``` scala> sql("CREATE TABLE table USING parquet AS SELECT 2") res0: org.apache.spark.sql.DataFrame = [] scala> val df = spark.table("table") df: org.apache.spark.sql.DataFrame = [2: int] scala> df.createOrReplaceTempView("t2") 20/12/04 10:16:24 WARN CommandUtils: Exception when attempting to uncache $name org.apache.spark.sql.AnalysisException: Table or view not found: t2;; 'UnresolvedRelation [t2], [], false at org.apache.spark.sql.catalyst.analysis.package$AnalysisErrorAt.failAnalysis(package.scala:42) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1(CheckAnalysis.scala:113) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.$anonfun$checkAnalysis$1$adapted(CheckAnalysis.scala:93) at org.apache.spark.sql.catalyst.trees.TreeNode.foreachUp(TreeNode.scala:183) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis(CheckAnalysis.scala:93) at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.checkAnalysis$(CheckAnalysis.scala:90) at org.apache.spark.sql.catalyst.analysis.Analyzer.checkAnalysis(Analyzer.scala:152) at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:172) at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:214) at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:169) at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:73) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:138) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768) at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:138) at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:73) at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:71) at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:63) at org.apache.spark.sql.Dataset$.$anonfun$ofRows$1(Dataset.scala:90) at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:768) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:88) at org.apache.spark.sql.DataFrameReader.table(DataFrameReader.scala:889) at org.apache.spark.sql.SparkSession.table(SparkSession.scala:589) at org.apache.spark.sql.internal.CatalogImpl.uncacheTable(CatalogImpl.scala:476) at org.apache.spark.sql.execution.command.CommandUtils$.uncacheTableOrView(CommandUtils.scala:392) at org.apache.spark.sql.execution.command.CreateViewCommand.run(views.scala:124) ``` Since `t2` does not exist yet, it shouldn't try to uncache. ### Why are the changes needed? To fix misleading message. ### Does this PR introduce _any_ user-facing change? Yes, the above message will not be displayed if the view doesn't exist yet. ### How was this patch tested? Manually tested since this is a log message printed. Closes #30608 from imback82/fix_cache_message. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-07 09:48:16 +09:00
Xiao Li	b94ecf0734	[SPARK-33674][TEST] Show Slowpoke notifications in SBT tests ### What changes were proposed in this pull request? This PR is to show Slowpoke notifications in the log when running tests using SBT. For example, the test case "zero sized blocks" in ExternalShuffleServiceSuite enters the infinite loop. After this change, the log file will have a notification message every 5 minute when the test case running longer than two minutes. Below is an example message. ``` [info] ExternalShuffleServiceSuite: [info] - groupByKey without compression (101 milliseconds) [info] - shuffle non-zero block size (3 seconds, 186 milliseconds) [info] - shuffle serializer (3 seconds, 189 milliseconds) [info] * Test still running after 2 minute, 1 seconds: suite name: ExternalShuffleServiceSuite, test name: zero sized blocks. [info] * Test still running after 7 minute, 1 seconds: suite name: ExternalShuffleServiceSuite, test name: zero sized blocks. [info] * Test still running after 12 minutes, 1 seconds: suite name: ExternalShuffleServiceSuite, test name: zero sized blocks. [info] * Test still running after 17 minutes, 1 seconds: suite name: ExternalShuffleServiceSuite, test name: zero sized blocks. ``` ### Why are the changes needed? When the tests/code has bug and enters the infinite loop, it is hard to tell which test cases hit some issues from the log, especially when we are running the tests in parallel. It would be nice to show the Slowpoke notifications. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Manual testing in my local dev environment. Closes #30621 from gatorsmile/addSlowpoke. Authored-by: Xiao Li <gatorsmile@gmail.com> Signed-off-by: Yuming Wang <yumwang@ebay.com>	2020-12-06 22:36:34 +08:00
Max Gekk	48297818f3	[SPARK-33667][SQL] Respect the `spark.sql.caseSensitive` config while resolving partition spec in v1 `SHOW PARTITIONS` ### What changes were proposed in this pull request? Preprocess the partition spec passed to the V1 SHOW PARTITIONS implementation `ShowPartitionsCommand`, and normalize the passed spec according to the partition columns w.r.t the case sensitivity flag spark.sql.caseSensitive. ### Why are the changes needed? V1 SHOW PARTITIONS is case sensitive in fact, and doesn't respect the SQL config spark.sql.caseSensitive which is false by default, for instance: ```sql spark-sql> CREATE TABLE tbl1 (price int, qty int, year int, month int) > USING parquet > PARTITIONED BY (year, month); spark-sql> INSERT INTO tbl1 PARTITION(year = 2015, month = 1) SELECT 1, 1; spark-sql> SHOW PARTITIONS tbl1 PARTITION(YEAR = 2015, Month = 1); Error in query: Non-partitioning column(s) [YEAR, Month] are specified for SHOW PARTITIONS; ``` The `SHOW PARTITIONS` command must show the partition `year = 2015, month = 1` specified by `YEAR = 2015, Month = 1`. ### Does this PR introduce _any_ user-facing change? Yes. After the changes, the command above works as expected: ```sql spark-sql> SHOW PARTITIONS tbl1 PARTITION(YEAR = 2015, Month = 1); year=2015/month=1 ``` ### How was this patch tested? By running the affected test suites: - `v1/ShowPartitionsSuite` - `v2/ShowPartitionsSuite` Closes #30615 from MaxGekk/show-partitions-case-sensitivity-test. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-06 02:56:08 -08:00
HyukjinKwon	5250841537	[SPARK-33256][PYTHON][DOCS] Clarify PySpark follows NumPy documentation style ### What changes were proposed in this pull request? This PR adds few lines about docstring style to document that PySpark follows [NumPy documentation style](https://numpydoc.readthedocs.io/en/latest/format.html). We all completed the migration to NumPy documentation style at SPARK-32085. Ideally we should have a page like https://pandas.pydata.org/docs/development/contributing_docstring.html but I would like to leave it as a future work. ### Why are the changes needed? To tell developers that PySpark now follows NumPy documentation style. ### Does this PR introduce _any_ user-facing change? No, it's a change in unreleased branches yet. ### How was this patch tested? Manually tested via `make clean html` under `python/docs`: ![Screen Shot 2020-12-06 at 1 34 50 PM](https://user-images.githubusercontent.com/6477701/101271623-d5ce0380-37c7-11eb-93ac-da73caa50c37.png) Closes #30622 from HyukjinKwon/SPARK-33256. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-06 01:22:24 -08:00
Chao Sun	e857e06452	[SPARK-33652][SQL] DSv2: DeleteFrom should refresh cache ### What changes were proposed in this pull request? This changes `DeleteFromTableExec` to also refresh caches referencing the original table, by passing the `refreshCache` callback to the class. Note that in order to construct the callback, I have to change `DataSourceV2ScanRelation` to contain a `DataSourceV2Relation` instead of a `Table`. ### Why are the changes needed? Currently DSv2 delete from table doesn't refresh caches. This could lead to correctness issue if the staled cache is queried later. ### Does this PR introduce _any_ user-facing change? Yes. Now delete from table in v2 also refreshes cache. ### How was this patch tested? Added a test case. Closes #30597 from sunchao/SPARK-33652. Authored-by: Chao Sun <sunchao@apple.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-06 01:14:22 -08:00
Prashant Sharma	6317ba29a1	[SPARK-33668][K8S][TEST] Fix flaky test "Verify logging configuration is picked from the provided ### What changes were proposed in this pull request? Fix flaky test "Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties." The test is flaking, with multiple flaked instances - the reason for the failure has been similar to: ``` The code passed to eventually never returned normally. Attempted 109 times over 3.0079882413999997 minutes. Last failure message: Failure executing: GET at: https://192.168.39.167:8443/api/v1/namespaces/b37fc72a991b49baa68a2eaaa1516463/pods/spark-pi-97a9bc76308e7fe3-exec-1/log?pretty=false. Message: pods "spark-pi-97a9bc76308e7fe3-exec-1" not found. Received status: Status(apiVersion=v1, code=404, details=StatusDetails(causes=[], group=null, kind=pods, name=spark-pi-97a9bc76308e7fe3-exec-1, retryAfterSeconds=null, uid=null, additionalProperties={}), kind=Status, message=pods "spark-pi-97a9bc76308e7fe3-exec-1" not found, metadata=ListMeta(_continue=null, remainingItemCount=null, resourceVersion=null, selfLink=null, additionalProperties={}), reason=NotFound, status=Failure, additionalProperties={}).. (KubernetesSuite.scala:402) ``` https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36854/console https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36852/console https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36850/console https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/36848/console From the above failures, it seems, that executor finishes too quickly and is removed by spark before the test can complete. So, in order to mitigate this situation, one way is to turn on the flag "spark.kubernetes.executor.deleteOnTermination" ### Why are the changes needed? Fixes a flaky test. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Existing tests. May be a few runs of jenkins integration test, may reveal if the problem is resolved or not. Closes #30616 from ScrapCodes/SPARK-33668/fix-flaky-k8s-integration-test. Authored-by: Prashant Sharma <prashsh1@in.ibm.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-05 23:04:55 -08:00
Terry Kim	154f604403	[MINOR] Fix string interpolation in CommandUtils.scala and KafkaDataConsumer.scala ### What changes were proposed in this pull request? This PR proposes to fix a string interpolation in `CommandUtils.scala` and `KafkaDataConsumer.scala`. ### Why are the changes needed? To fix a string interpolation bug. ### Does this PR introduce _any_ user-facing change? Yes, the string will be correctly constructed. ### How was this patch tested? Existing tests since they were used in exception/log messages. Closes #30609 from imback82/fix_cache_str_interporlation. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-12-06 12:03:14 +09:00
Wenchen Fan	1b4e35d1a8	[SPARK-33651][SQL] Allow CREATE EXTERNAL TABLE with LOCATION for data source tables ### What changes were proposed in this pull request? This PR removes the restriction and allows CREATE EXTERNAL TABLE with LOCATION for data source tables. It also moves the check from the analyzer rule `ResolveSessionCatalog` to `SessionCatalog`, so that v2 session catalog can overwrite it. ### Why are the changes needed? It's an unnecessary behavior difference that Hive serde table can be created with `CREATE EXTERNAL TABLE` if LOCATION is present, while data source table doesn't allow `CREATE EXTERNAL TABLE` at all. ### Does this PR introduce _any_ user-facing change? Yes, now `CREATE EXTERNAL TABLE ... USING ... LOCATION ...` is allowed. ### How was this patch tested? new tests Closes #30595 from cloud-fan/minor. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-04 16:48:31 -08:00
allisonwang-db	960d6af75d	[SPARK-33472][SQL][FOLLOW-UP] Update RemoveRedundantSorts comment ### What changes were proposed in this pull request? This PR is a follow-up for #30373 that updates the comment for RemoveRedundantSorts in QueryExecution. ### Why are the changes needed? To update an incorrect comment. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? N/A Closes #30584 from allisonwang-db/spark-33472-followup. Authored-by: allisonwang-db <66282705+allisonwang-db@users.noreply.github.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-04 15:15:19 -08:00
Dongjoon Hyun	b6b45bc695	[SPARK-33141][SQL][FOLLOW-UP] Fix Scala 2.13 compilation ### What changes were proposed in this pull request? This PR aims to fix Scala 2.13 compilation. ### Why are the changes needed? To recover Scala 2.13. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Pass GitHub Action Scala 2.13 build job. Closes #30611 from dongjoon-hyun/SPARK-33141. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-04 15:04:18 -08:00
Dongjoon Hyun	de9818f043	[SPARK-33662][BUILD] Setting version to 3.2.0-SNAPSHOT ### What changes were proposed in this pull request? This PR aims to update `master` branch version to 3.2.0-SNAPSHOT. ### Why are the changes needed? Start to prepare Apache Spark 3.2.0. ### Does this PR introduce _any_ user-facing change? N/A. ### How was this patch tested? Pass the CIs. Closes #30606 from dongjoon-hyun/SPARK-3.2. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-12-04 14:10:42 -08:00
german	d671e053e9	[SPARK-33660][DOCS][SS] Fix Kafka Headers Documentation ### What changes were proposed in this pull request? Update kafka headers documentation, type is not longer a map but an array [jira](https://issues.apache.org/jira/browse/SPARK-33660) ### Why are the changes needed? To help users ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? It is only documentation Closes #30605 from Gschiavon/SPARK-33660-fix-kafka-headers-documentation. Authored-by: german <germanschiavon@gmail.com> Signed-off-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>	2020-12-05 06:51:54 +09:00

1 2 3 4 5 ...

28742 commits