ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
GuoPhilipse	892b600ce3	[SPARK-31790][DOCS] cast(long as timestamp) show different result between Hive and Spark ### What changes were proposed in this pull request? add docs for sql migration-guide ### Why are the changes needed? let user know more about the cast scenarios in which Hive and Spark generate different results ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? no need to test Closes #28605 from GuoPhilipse/spark-docs. Lead-authored-by: GuoPhilipse <guofei_ok@126.com> Co-authored-by: GuoPhilipse <46367746+GuoPhilipse@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-22 22:01:38 +09:00
Jungtaek Lim (HeartSaVioR)	d2bec5e265	[SPARK-31707][SQL] Revert SPARK-30098 Use default datasource as provider for CREATE TABLE syntax ### What changes were proposed in this pull request? This patch effectively reverts SPARK-30098 via below changes: * Removed the config * Removed the changes done in parser rule * Removed the usage of config in tests * Removed tests which depend on the config * Rolled back some tests to before SPARK-30098 which were affected by SPARK-30098 * Reflect the change into docs (migration doc, create table syntax) ### Why are the changes needed? SPARK-30098 brought confusion and frustration on using create table DDL query, and we agreed about the bad effect on the change. Please go through the [discussion thread](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-Resolve-ambiguous-parser-rule-between-two-quot-create-table-quot-s-td29051i20.html) to see the details. ### Does this PR introduce _any_ user-facing change? No, compared to Spark 2.4.x. End users tried to experiment with Spark 3.0.0 previews will see the change that the behavior is going back to Spark 2.4.x, but I believe we won't guarantee compatibility in preview releases. ### How was this patch tested? Existing UTs. Closes #28517 from HeartSaVioR/revert-SPARK-30098. Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-05-17 02:27:23 +00:00
Max Gekk	372ccba063	[SPARK-31639] Revert SPARK-27528 Use Parquet logical type TIMESTAMP_MICROS by default ### What changes were proposed in this pull request? This reverts commit `43a73e387c`. It sets `INT96` as the timestamp type while saving timestamps to parquet files. ### Why are the changes needed? To be compatible with Hive and Presto that don't support the `TIMESTAMP_MICROS` type in current stable releases. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? By existing test suites. Closes #28450 from MaxGekk/parquet-int96. Authored-by: Max Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-05-04 17:27:02 -07:00
Kazuaki Ishizaki	35fcc8d5c5	[MINOR][DOCS] Fix typo in documents ### What changes were proposed in this pull request? Fixed typo in `docs` directory and in `project/MimaExcludes.scala` ### Why are the changes needed? Better readability of documents ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? No test needed Closes #28447 from kiszk/typo_20200504. Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-05-04 16:53:50 +09:00
Yuanjian Li	7195a18bf2	[SPARK-27340][SS][TESTS][FOLLOW-UP] Rephrase API comments and simplify tests ### What changes were proposed in this pull request? - Rephrase the API doc for `Column.as` - Simplify the UTs ### Why are the changes needed? Address comments in https://github.com/apache/spark/pull/28326 ### Does this PR introduce any user-facing change? No ### How was this patch tested? New UT added. Closes #28390 from xuanyuanking/SPARK-27340-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-30 06:24:00 +00:00
Terry Kim	36803031e8	[SPARK-30282][SQL][FOLLOWUP] SHOW TBLPROPERTIES should support views ### What changes were proposed in this pull request? This PR addresses two things: - `SHOW TBLPROPERTIES` should supports view (a regression introduced by #26921) - `SHOW TBLPROPERTIES` on a temporary view should return empty result (2.4 behavior instead of throwing `AnalysisException`. ### Why are the changes needed? It's a bug. ### Does this PR introduce any user-facing change? Yes, now `SHOW TBLPROPERTIES` works on views: ``` scala> sql("CREATE VIEW view TBLPROPERTIES('p1'='v1', 'p2'='v2') AS SELECT 1 AS c1") scala> sql("SHOW TBLPROPERTIES view").show(truncate=false) +---------------------------------+-------------+ \|key \|value \| +---------------------------------+-------------+ \|view.catalogAndNamespace.numParts\|2 \| \|view.query.out.col.0 \|c1 \| \|view.query.out.numCols \|1 \| \|p2 \|v2 \| \|view.catalogAndNamespace.part.0 \|spark_catalog\| \|p1 \|v1 \| \|view.catalogAndNamespace.part.1 \|default \| +---------------------------------+-------------+ ``` And for a temporary view: ``` scala> sql("CREATE TEMPORARY VIEW tview TBLPROPERTIES('p1'='v1', 'p2'='v2') AS SELECT 1 AS c1") scala> sql("SHOW TBLPROPERTIES tview").show(truncate=false) +---+-----+ \|key\|value\| +---+-----+ +---+-----+ ``` ### How was this patch tested? Added tests. Closes #28375 from imback82/show_tblproperties_followup. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-29 07:06:45 +00:00
yi.wu	463c54419b	[SPARK-31010][SQL][DOC][FOLLOW-UP] Improve deprecated warning message for untyped scala udf ### What changes were proposed in this pull request? Give more friendly warning message/migration guide of deprecated scala udf to users. ### Why are the changes needed? User can not distinguish function signature between typed and untyped scala udf. Instead, we shall tell user what to do directly. ### Does this PR introduce any user-facing change? No, it's newly added in Spark 3.0. ### How was this patch tested? Pass Jenkins. Closes #28311 from Ngone51/update_udf_doc. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-24 19:10:18 +09:00
Yuming Wang	b11e42663b	[SPARK-31381][SPARK-29245][SQL] Upgrade built-in Hive 2.3.6 to 2.3.7 ### What changes were proposed in this pull request? Hive 2.3.7 fixed these issues: - HIVE-21508: ClassCastException when initializing HiveMetaStoreClient on JDK10 or newer - HIVE-21980:Parsing time can be high in case of deeply nested subqueries - HIVE-22249: Support Parquet through HCatalog ### Why are the changes needed? Fix CCE during creating HiveMetaStoreClient in JDK11 environment: [SPARK-29245](https://issues.apache.org/jira/browse/SPARK-29245). ### Does this PR introduce any user-facing change? No. ### How was this patch tested? - [x] Test Jenkins with Hadoop 2.7 (https://github.com/apache/spark/pull/28148#issuecomment-616757840) - [x] Test Jenkins with Hadoop 3.2 on JDK11 (https://github.com/apache/spark/pull/28148#issuecomment-616294353) - [x] Manual test with remote hive metastore. Hive side: ``` export JAVA_HOME=/usr/lib/jdk1.8.0_221 export PATH=$JAVA_HOME/bin:$PATH cd /usr/lib/hive-2.3.6 # Start Hive metastore with Hive 2.3.6 bin/schematool -dbType derby -initSchema --verbose bin/hive --service metastore ``` Spark side: ``` export JAVA_HOME=/usr/lib/jdk-11.0.3 export PATH=$JAVA_HOME/bin:$PATH build/sbt clean package -Phive -Phadoop-3.2 -Phive-thriftserver export SPARK_PREPEND_CLASSES=true bin/spark-sql --conf spark.hadoop.hive.metastore.uris=thrift://localhost:9083 ``` Closes #28148 from wangyum/SPARK-31381. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-04-20 13:38:24 -07:00
gatorsmile	6c792a79c1	[SPARK-31234][SQL][FOLLOW-UP] ResetCommand should not affect static SQL Configuration ### What changes were proposed in this pull request? This PR is the follow-up PR of https://github.com/apache/spark/pull/28003 - add a migration guide - add an end-to-end test case. ### Why are the changes needed? The original PR made the major behavior change in the user-facing RESET command. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Added a new end-to-end test Closes #28265 from gatorsmile/spark-31234followup. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2020-04-20 13:08:55 -07:00
gatorsmile	a3d83948b8	[SPARK-31351][DOC] Migration Guide Auditing for Spark 3.0 Release ### What changes were proposed in this pull request? This PR is to audit the migration guides in Spark 3.0 release: - correct the grammar errors - clean up some items - replace HTML table by markdown table ### Why are the changes needed? N/A ### Does this PR introduce any user-facing change? No ### How was this patch tested? Screenshot: ![screencapture-127-0-0-1-4000-sql-migration-guide-html-2020-04-04-21_36_29](https://user-images.githubusercontent.com/11567269/78467043-9477d800-76bd-11ea-8ab0-3d51ea5e9fa5.png) ![Screen Shot 2020-04-04 at 9 28 13 PM](https://user-images.githubusercontent.com/11567269/78467045-98a3f580-76bd-11ea-9e4b-927bf12e683a.png) ![Screen Shot 2020-04-04 at 9 28 02 PM](https://user-images.githubusercontent.com/11567269/78467046-98a3f580-76bd-11ea-8ea3-9f13cb8d200b.png) ![Screen Shot 2020-04-04 at 9 21 40 PM](https://user-images.githubusercontent.com/11567269/78467047-993c8c00-76bd-11ea-8c29-91afc68eb590.png) Closes #28125 from gatorsmile/updateMigrationGuide3.0. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-04-08 12:27:40 +09:00
Kent Yao	3c94a7c8f5	[SPARK-29311][SQL][FOLLOWUP] Add migration guide for extracting second from datetimes ### What changes were proposed in this pull request? Add migration guide for extracting second from datetimes ### Why are the changes needed? doc the behavior change for extract expression ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A, just passing jenkins Closes #28140 from yaooqinn/SPARK-29311. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-04-07 07:09:45 +00:00
Takeshi Yamamuro	d98df7626b	[SPARK-31325][SQL][WEB UI] Control a plan explain mode in the events of SQL listeners via SQLConf ### What changes were proposed in this pull request? This PR intends to add a new SQL config for controlling a plan explain mode in the events of (e.g., `SparkListenerSQLExecutionStart` and `SparkListenerSQLAdaptiveExecutionUpdate`) SQL listeners. In the current master, the output of `QueryExecution.toString` (this is equivalent to the "extended" explain mode) is stored in these events. I think it is useful to control the content via `SQLConf`. For example, the query "Details" content (TPCDS q66 query) of a SQL tab in a Spark web UI will be changed as follows; Before this PR: ![q66-extended](https://user-images.githubusercontent.com/692303/78211668-950b4580-74e8-11ea-90c6-db52d437534b.png) After this PR: ![q66-formatted](https://user-images.githubusercontent.com/692303/78211674-9ccaea00-74e8-11ea-9d1d-43c7e2b0f314.png) ### Why are the changes needed? For better usability. ### Does this PR introduce any user-facing change? Yes; since Spark 3.1, SQL UI data adopts the `formatted` mode for the query plan explain results. To restore the behavior before Spark 3.0, you can set `spark.sql.ui.explainMode` to `extended`. ### How was this patch tested? Added unit tests. Closes #28097 from maropu/SPARK-31325. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>	2020-04-02 21:09:16 -07:00
gatorsmile	b9eafcb526	[SPARK-31088][SQL] Add back HiveContext and createExternalTable ### What changes were proposed in this pull request? Based on the discussion in the mailing list [[Proposal] Modification to Spark's Semantic Versioning Policy](http://apache-spark-developers-list.1001551.n3.nabble.com/Proposal-Modification-to-Spark-s-Semantic-Versioning-Policy-td28938.html) , this PR is to add back the following APIs whose maintenance cost are relatively small. - HiveContext - createExternalTable APIs ### Why are the changes needed? Avoid breaking the APIs that are commonly used. ### Does this PR introduce any user-facing change? Adding back the APIs that were removed in 3.0 branch does not introduce the user-facing changes, because Spark 3.0 has not been released. ### How was this patch tested? add a new test suite for createExternalTable APIs. Closes #27815 from gatorsmile/addAPIsBack. Lead-authored-by: gatorsmile <gatorsmile@gmail.com> Co-authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: gatorsmile <gatorsmile@gmail.com>	2020-03-26 23:51:15 -07:00
Wenchen Fan	4f274a4de9	[SPARK-31147][SQL] Forbid CHAR type in non-Hive-Serde tables ### What changes were proposed in this pull request? Spark introduced CHAR type for hive compatibility but it only works for hive tables. CHAR type is never documented and is treated as STRING type for non-Hive tables. However, this leads to confusing behaviors Apache Spark 3.0.0-preview2 ``` spark-sql> CREATE TABLE t(a CHAR(3)); spark-sql> INSERT INTO TABLE t SELECT 'a '; spark-sql> SELECT a, length(a) FROM t; a 2 ``` Apache Spark 2.4.5 ``` spark-sql> CREATE TABLE t(a CHAR(3)); spark-sql> INSERT INTO TABLE t SELECT 'a '; spark-sql> SELECT a, length(a) FROM t; a 3 ``` According to the SQL standard, `CHAR(3)` should guarantee all the values are of length 3. Since `CHAR(3)` is treated as STRING so Spark doesn't guarantee it. This PR forbids CHAR type in non-Hive tables as it's not supported correctly. ### Why are the changes needed? avoid confusing/wrong behavior ### Does this PR introduce any user-facing change? yes, now users can't create/alter non-Hive tables with CHAR type. ### How was this patch tested? new tests Closes #27902 from cloud-fan/char. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-25 09:25:55 -07:00
Wenchen Fan	1d0f54951e	[SPARK-31205][SQL] support string literal as the second argument of date_add/date_sub functions ### What changes were proposed in this pull request? https://github.com/apache/spark/pull/26412 introduced a behavior change that `date_add`/`date_sub` functions can't accept string and double values in the second parameter. This is reasonable as it's error-prone to cast string/double to int at runtime. However, using string literals as function arguments is very common in SQL databases. To avoid breaking valid use cases that the string literal is indeed an integer, this PR proposes to add ansi_cast for string literal in date_add/date_sub functions. If the string value is not a valid integer, we fail at query compiling time because of constant folding. ### Why are the changes needed? avoid breaking changes ### Does this PR introduce any user-facing change? Yes, now 3.0 can run `date_add('2011-11-11', '1')` like 2.4 ### How was this patch tested? new tests. Closes #27965 from cloud-fan/string. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-03-24 12:07:22 +08:00
gatorsmile	4d4c00c1b5	[SPARK-31151][SQL][DOC] Reorganize the migration guide of SQL ### What changes were proposed in this pull request? The current migration guide of SQL is too long for most readers to find the needed info. This PR is to group the items in the migration guide of Spark SQL based on the corresponding components. Note. This PR does not change the contents of the migration guides. Attached figure is the screenshot after the change. ![screencapture-127-0-0-1-4000-sql-migration-guide-html-2020-03-14-12_00_40](https://user-images.githubusercontent.com/11567269/76688626-d3010200-65eb-11ea-9ce7-265bc90ebb2c.png) ### Why are the changes needed? The current migration guide of SQL is too long for most readers to find the needed info. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #27909 from gatorsmile/migrationGuideReorg. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-03-15 07:35:20 +09:00
gatorsmile	1c8526dc87	[SPARK-28093][FOLLOW-UP] Remove migration guide of TRIM changes ### What changes were proposed in this pull request? Since we reverted the original change in https://github.com/apache/spark/pull/27540, this PR is to remove the corresponding migration guide made in the commit https://github.com/apache/spark/pull/24948 ### Why are the changes needed? N/A ### Does this PR introduce any user-facing change? N/A ### How was this patch tested? N/A Closes #27896 from gatorsmile/SPARK-28093Followup. Authored-by: gatorsmile <gatorsmile@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-13 11:45:59 +09:00
Kent Yao	7b4b29e8d9	[SPARK-31131][SQL] Remove the unnecessary config spark.sql.legacy.timeParser.enabled ### What changes were proposed in this pull request? spark.sql.legacy.timeParser.enabled should be removed from SQLConf and the migration guide spark.sql.legacy.timeParsePolicy is the right one ### Why are the changes needed? fix doc ### Does this PR introduce any user-facing change? no ### How was this patch tested? Pass the jenkins Closes #27889 from yaooqinn/SPARK-31131. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-12 09:24:49 -07:00
Wenchen Fan	8efb71013d	[SPARK-31091] Revert SPARK-24640 Return `NULL` from `size(NULL)` by default ### What changes were proposed in this pull request? This PR reverts https://github.com/apache/spark/pull/26051 and https://github.com/apache/spark/pull/26066 ### Why are the changes needed? There is no standard requiring that `size(null)` must return null, and returning -1 looks reasonable as well. This is kind of a cosmetic change and we should avoid it if it breaks existing queries. This is similar to reverting TRIM function parameter order change. ### Does this PR introduce any user-facing change? Yes, change the behavior of `size(null)` back to be the same as 2.4. ### How was this patch tested? N/A Closes #27834 from cloud-fan/revert. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>	2020-03-11 09:55:24 -07:00
Takeshi Yamamuro	71c73d58f6	[SPARK-30279][SQL] Support 32 or more grouping attributes for GROUPING_ID ### What changes were proposed in this pull request? This pr intends to support 32 or more grouping attributes for GROUPING_ID. In the current master, an integer overflow can occur to compute grouping IDs; `e75d9afb2f/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/plans/logical/basicLogicalOperators.scala (L613)` For example, the query below generates wrong grouping IDs in the master; ``` scala> val numCols = 32 // or, 31 scala> val cols = (0 until numCols).map { i => s"c$i" } scala> sql(s"create table test_$numCols (${cols.map(c => s"$c int").mkString(",")}, v int) using parquet") scala> val insertVals = (0 until numCols).map { _ => 1 }.mkString(",") scala> sql(s"insert into test_$numCols values ($insertVals,3)") scala> sql(s"select grouping_id(), sum(v) from test_$numCols group by grouping sets ((${cols.mkString(",")}), (${cols.init.mkString(",")}))").show(10, false) scala> sql(s"drop table test_$numCols") // numCols = 32 +-------------+------+ \|grouping_id()\|sum(v)\| +-------------+------+ \|0 \|3 \| \|0 \|3 \| // Wrong Grouping ID +-------------+------+ // numCols = 31 +-------------+------+ \|grouping_id()\|sum(v)\| +-------------+------+ \|0 \|3 \| \|1 \|3 \| +-------------+------+ ``` To fix this issue, this pr change code to use long values for `GROUPING_ID` instead of int values. ### Why are the changes needed? To support more cases in `GROUPING_ID`. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added unit tests. Closes #26918 from maropu/FixGroupingIdIssue. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2020-03-06 16:57:03 +09:00
Wenchen Fan	807ea413b4	[SPARK-31019][SQL] make it clear that people can deduplicate map keys ### What changes were proposed in this pull request? rename the config and make it non-internal. ### Why are the changes needed? Now we fail the query if duplicated map keys are detected, and provide a legacy config to deduplicate it. However, we must provide a way to get users out of this situation, instead of just rejecting to run the query. This exit strategy should always be there, while legacy config indicates that it may be removed someday. ### Does this PR introduce any user-facing change? no, just rename a config which was added in 3.0 ### How was this patch tested? add more tests for the fail behavior. Closes #27772 from cloud-fan/map. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-05 20:43:52 +09:00
Yuanjian Li	f7f1948a8c	[SPARK-30289][FOLLOWUP][DOC] Update the migration guide for `spark.sql.legacy.ctePrecedencePolicy` ### What changes were proposed in this pull request? Fix the migration guide document for `spark.sql.legacy.ctePrecedence.enabled`, which is introduced in #27579. ### Why are the changes needed? The config value changed. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Document only. Closes #27782 from xuanyuanking/SPARK-30829-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-04 13:56:02 +09:00
iRakson	92a5ae2ae4	[SPARK-30234][SQL][FOLLOWUP] Rename `spark.sql.legacy.addDirectory.recursive.enabled` to `spark.sql.legacy.addSingleFileInAddFile` ### What changes were proposed in this pull request? Rename `spark.sql.legacy.addDirectory.recursive.enabled` to `spark.sql.legacy.addSingleFileInAddFile` ### Why are the changes needed? To follow the naming convention ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UTs. Closes #27725 from iRakson/SPARK-30234_CONFIG. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-03-01 10:55:41 +09:00
iRakson	a40a2f8338	[SPARK-27619][SQL][FOLLOWUP] Rename 'spark.sql.legacy.useHashOnMapType' to 'spark.sql.legacy.allowHashOnMapType' ### What changes were proposed in this pull request? Renamed configuration from `spark.sql.legacy.useHashOnMapType` to `spark.sql.legacy.allowHashOnMapType`. ### Why are the changes needed? Better readability of configuration. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UTs. Closes #27719 from iRakson/SPARK-27619_FOLLOWUP. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-28 22:57:50 +08:00
yi.wu	22dfd15a45	[SPARK-30937][DOC] Group Hive upgrade guides together ### What changes were proposed in this pull request? This PR groups all hive upgrade related migration guides inside Spark 3.0 together. Also add another behavior change of `ScriptTransform` in the new Hive section. ### Why are the changes needed? Make the doc more clearly to user. ### Does this PR introduce any user-facing change? No, new doc for Spark 3.0. ### How was this patch tested? N/A. Closes #27670 from Ngone51/hive_migration. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-27 21:29:42 +08:00
iRakson	c913b9d8b5	[SPARK-27619][SQL] MapType should be prohibited in hash expressions ### What changes were proposed in this pull request? `hash()` and `xxhash64()` cannot be used on elements of `Maptype`. A new configuration `spark.sql.legacy.useHashOnMapType` is introduced to allow users to restore the previous behaviour. When `spark.sql.legacy.useHashOnMapType` is set to false: ``` scala> spark.sql("select hash(map())"); org.apache.spark.sql.AnalysisException: cannot resolve 'hash(map())' due to data type mismatch: input to function hash cannot contain elements of MapType; line 1 pos 7; 'Project [unresolvedalias(hash(map(), 42), None)] +- OneRowRelation ``` when `spark.sql.legacy.useHashOnMapType` is set to true : ``` scala> spark.sql("set spark.sql.legacy.useHashOnMapType=true"); res3: org.apache.spark.sql.DataFrame = [key: string, value: string] scala> spark.sql("select hash(map())").first() res4: org.apache.spark.sql.Row = [42] ``` ### Why are the changes needed? As discussed in Jira, SparkSql's map hashcodes depends on their order of insertion which is not consistent with the normal scala behaviour which might confuse users. Code snippet from JIRA : ``` val a = spark.createDataset(Map(1->1, 2->2) :: Nil) val b = spark.createDataset(Map(2->2, 1->1) :: Nil) // Demonstration of how Scala Map equality is unaffected by insertion order: assert(Map(1->1, 2->2).hashCode() == Map(2->2, 1->1).hashCode()) assert(Map(1->1, 2->2) == Map(2->2, 1->1)) assert(a.first() == b.first()) // In contrast, this will print two different hashcodes: println(Seq(a, b).map(_.selectExpr("hash(*)").first())) ``` Also `MapType` is prohibited for aggregation / joins / equality comparisons #7819 and set operations #17236. ### Does this PR introduce any user-facing change? Yes. Now users cannot use hash functions on elements of `mapType`. To restore the previous behaviour set `spark.sql.legacy.useHashOnMapType` to true. ### How was this patch tested? UT added. Closes #27580 from iRakson/SPARK-27619. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-27 01:48:12 +08:00
yi.wu	82ce4753aa	[SPARK-26580][SQL][ML][FOLLOW-UP] Throw exception when use untyped UDF by default ### What changes were proposed in this pull request? This PR proposes to throw exception by default when user use untyped UDF(a.k.a `org.apache.spark.sql.functions.udf(AnyRef, DataType)`). And user could still use it by setting `spark.sql.legacy.useUnTypedUdf.enabled` to `true`. ### Why are the changes needed? According to #23498, since Spark 3.0, the untyped UDF will return the default value of the Java type if the input value is null. For example, `val f = udf((x: Int) => x, IntegerType)`, `f($"x")` will return 0 in Spark 3.0 but null in Spark 2.4. And the behavior change is introduced due to Spark3.0 is built with Scala 2.12 by default. As a result, this might change data silently and may cause correctness issue if user still expect `null` in some cases. Thus, we'd better to encourage user to use typed UDF to avoid this problem. ### Does this PR introduce any user-facing change? Yeah. User will hit exception now when use untyped UDF. ### How was this patch tested? Added test and updated some tests. Closes #27488 from Ngone51/spark_26580_followup. Lead-authored-by: yi.wu <yi.wu@databricks.com> Co-authored-by: wuyi <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-21 14:46:54 +08:00
Wenchen Fan	704d249a56	[SPARK-26071][FOLLOWUP] Improve migration guide of disallowing map type map key ### What changes were proposed in this pull request? mention the workaround if users do want to use map type as key, and add a test to demonstrate it. ### Why are the changes needed? it's better to provide an alternative when we ban something. ### Does this PR introduce any user-facing change? no ### How was this patch tested? N/A Closes #27621 from cloud-fan/map. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-20 22:10:04 +08:00
Wenchen Fan	c7bece3541	[SPARK-27528][FOLLOWUP] improve migration guide ### What changes were proposed in this pull request? mention that `INT96` timestamp is still useful for interoperability. ### Why are the changes needed? Give users more context of the behavior changes. ### Does this PR introduce any user-facing change? no ### How was this patch tested? N/A Closes #27622 from cloud-fan/parquet. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-19 22:26:56 +08:00
yi.wu	68d7edf949	[SPARK-30812][SQL][CORE] Revise boolean config name to comply with new config naming policy ### What changes were proposed in this pull request? Revise below config names to comply with [new config naming policy](http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-naming-policy-of-Spark-configs-td28875.html): SQL: * spark.sql.execution.subquery.reuse.enabled / [SPARK-27083](https://issues.apache.org/jira/browse/SPARK-27083) * spark.sql.legacy.allowNegativeScaleOfDecimal.enabled / [SPARK-30252](https://issues.apache.org/jira/browse/SPARK-30252) * spark.sql.adaptive.optimizeSkewedJoin.enabled / [SPARK-29544](https://issues.apache.org/jira/browse/SPARK-29544) * spark.sql.legacy.property.nonReserved / [SPARK-30183](https://issues.apache.org/jira/browse/SPARK-30183) * spark.sql.streaming.forceDeleteTempCheckpointLocation.enabled / [SPARK-26389](https://issues.apache.org/jira/browse/SPARK-26389) * spark.sql.analyzer.failAmbiguousSelfJoin.enabled / [SPARK-28344](https://issues.apache.org/jira/browse/SPARK-28344) * spark.sql.adaptive.shuffle.reducePostShufflePartitions.enabled / [SPARK-30074](https://issues.apache.org/jira/browse/SPARK-30074) * spark.sql.execution.pandas.arrowSafeTypeConversion / [SPARK-25811](https://issues.apache.org/jira/browse/SPARK-25811) * spark.sql.legacy.looseUpcast / [SPARK-24586](https://issues.apache.org/jira/browse/SPARK-24586) * spark.sql.legacy.arrayExistsFollowsThreeValuedLogic / [SPARK-28052](https://issues.apache.org/jira/browse/SPARK-28052) * spark.sql.sources.ignoreDataLocality.enabled / [SPARK-29189](https://issues.apache.org/jira/browse/SPARK-29189) * spark.sql.adaptive.shuffle.fetchShuffleBlocksInBatch.enabled / [SPARK-9853](https://issues.apache.org/jira/browse/SPARK-9853) CORE: * spark.eventLog.erasureCoding.enabled / [SPARK-25855](https://issues.apache.org/jira/browse/SPARK-25855) * spark.shuffle.readHostLocalDisk.enabled / [SPARK-30235](https://issues.apache.org/jira/browse/SPARK-30235) * spark.scheduler.listenerbus.logSlowEvent.enabled / [SPARK-29001](https://issues.apache.org/jira/browse/SPARK-29001) * spark.resources.coordinate.enable / [SPARK-27371](https://issues.apache.org/jira/browse/SPARK-27371) * spark.eventLog.logStageExecutorMetrics.enabled / [SPARK-23429](https://issues.apache.org/jira/browse/SPARK-23429) ### Why are the changes needed? To comply with the config naming policy. ### Does this PR introduce any user-facing change? No. Configurations listed above are all newly added in Spark 3.0. ### How was this patch tested? Pass Jenkins. Closes #27563 from Ngone51/revise_boolean_conf_name. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-18 20:39:50 +08:00
Yuming Wang	76ddb6d835	[SPARK-30755][SQL] Update migration guide and add actionable exception for HIVE-15167 ### What changes were proposed in this pull request? [HIVE-15167](https://issues.apache.org/jira/browse/HIVE-15167) removed the `SerDe` interface. This may break custom `SerDe` builds for Hive 1.2. This PR update the migration guide for this change. ### Why are the changes needed? Otherwise: ``` 2020-01-27 05:11:20.446 - stderr> 20/01/27 05:11:20 INFO DAGScheduler: ResultStage 2 (main at NativeMethodAccessorImpl.java:0) failed in 1.000 s due to Job aborted due to stage failure: Task 0 in stage 2.0 failed 4 times, most recent failure: Lost task 0.3 in stage 2.0 (TID 13, 10.110.21.210, executor 1): java.lang.NoClassDefFoundError: org/apache/hadoop/hive/serde2/SerDe 2020-01-27 05:11:20.446 - stderr> at java.lang.ClassLoader.defineClass1(Native Method) 2020-01-27 05:11:20.446 - stderr> at java.lang.ClassLoader.defineClass(ClassLoader.java:756) 2020-01-27 05:11:20.446 - stderr> at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142) 2020-01-27 05:11:20.446 - stderr> at java.net.URLClassLoader.defineClass(URLClassLoader.java:468) 2020-01-27 05:11:20.446 - stderr> at java.net.URLClassLoader.access$100(URLClassLoader.java:74) 2020-01-27 05:11:20.446 - stderr> at java.net.URLClassLoader$1.run(URLClassLoader.java:369) 2020-01-27 05:11:20.446 - stderr> at java.net.URLClassLoader$1.run(URLClassLoader.java:363) 2020-01-27 05:11:20.446 - stderr> at java.security.AccessController.doPrivileged(Native Method) 2020-01-27 05:11:20.446 - stderr> at java.net.URLClassLoader.findClass(URLClassLoader.java:362) 2020-01-27 05:11:20.446 - stderr> at java.lang.ClassLoader.loadClass(ClassLoader.java:418) 2020-01-27 05:11:20.446 - stderr> at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:352) 2020-01-27 05:11:20.446 - stderr> at java.lang.ClassLoader.loadClass(ClassLoader.java:405) 2020-01-27 05:11:20.446 - stderr> at java.lang.ClassLoader.loadClass(ClassLoader.java:351) 2020-01-27 05:11:20.446 - stderr> at java.lang.Class.forName0(Native Method) 2020-01-27 05:11:20.446 - stderr> at java.lang.Class.forName(Class.java:348) 2020-01-27 05:11:20.446 - stderr> at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializerClass(TableDesc.java:76) ..... ``` ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manual test Closes #27492 from wangyum/SPARK-30755. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-17 09:26:56 -08:00
Yuanjian Li	ab186e3659	[SPARK-25829][SQL] Add config `spark.sql.legacy.allowDuplicatedMapKeys` and change the default behavior ### What changes were proposed in this pull request? This is a follow-up for #23124, add a new config `spark.sql.legacy.allowDuplicatedMapKeys` to control the behavior of removing duplicated map keys in build-in functions. With the default value `false`, Spark will throw a RuntimeException while duplicated keys are found. ### Why are the changes needed? Prevent silent behavior changes. ### Does this PR introduce any user-facing change? Yes, new config added and the default behavior for duplicated map keys changed to RuntimeException thrown. ### How was this patch tested? Modify existing UT. Closes #27478 from xuanyuanking/SPARK-25892-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-17 22:06:58 +08:00
Yuanjian Li	01cc852982	[SPARK-30803][DOCS] Fix the home page link for Scala API document ### What changes were proposed in this pull request? Change the link to the Scala API document. ``` $ git grep "#org.apache.spark.package" docs/_layouts/global.html: <li><a href="api/scala/index.html#org.apache.spark.package">Scala</a></li> docs/index.md:* [Spark Scala API (Scaladoc)](api/scala/index.html#org.apache.spark.package) docs/rdd-programming-guide.md:[Scala](api/scala/#org.apache.spark.package), [Java](api/java/), [Python](api/python/) and [R](api/R/). ``` ### Why are the changes needed? The home page link for Scala API document is incorrect after upgrade to 3.0 ### Does this PR introduce any user-facing change? Document UI change only. ### How was this patch tested? Local test, attach screenshots below: Before: ![image](https://user-images.githubusercontent.com/4833765/74335713-c2385300-4dd7-11ea-95d8-f5a3639d2578.png) After: ![image](https://user-images.githubusercontent.com/4833765/74335727-cbc1bb00-4dd7-11ea-89d9-4dcc1310e679.png) Closes #27549 from xuanyuanking/scala-doc. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Sean Owen <srowen@gmail.com>	2020-02-16 09:55:03 -06:00
iRakson	926e3a1efe	[SPARK-30790] The dataType of map() should be map<null,null> ### What changes were proposed in this pull request? `spark.sql("select map()")` returns {}. After these changes it will return map<null,null> ### Why are the changes needed? After changes introduced due to #27521, it is important to maintain consistency while using map(). ### Does this PR introduce any user-facing change? Yes. Now map() will give map<null,null> instead of {}. ### How was this patch tested? UT added. Migration guide updated as well Closes #27542 from iRakson/SPARK-30790. Authored-by: iRakson <raksonrakesh@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-13 12:23:40 +08:00
HyukjinKwon	0045be766b	[SPARK-29462][SQL] The data type of "array()" should be array<null> ### What changes were proposed in this pull request? This brings https://github.com/apache/spark/pull/26324 back. It was reverted basically because, firstly Hive compatibility, and the lack of investigations in other DBMSes and ANSI. - In case of PostgreSQL seems coercing NULL literal to TEXT type. - Presto seems coercing `array() + array(1)` -> array of int. - Hive seems `array() + array(1)` -> array of strings Given that, the design choices have been differently made for some reasons. If we pick one of both, seems coercing to array of int makes much more sense. Another investigation was made offline internally. Seems ANSI SQL 2011, section 6.5 "<contextually typed value specification>" states: > If ES is specified, then let ET be the element type determined by the context in which ES appears. The declared type DT of ES is Case: > > a) If ES simply contains ARRAY, then ET ARRAY[0]. > > b) If ES simply contains MULTISET, then ET MULTISET. > > ES is effectively replaced by CAST ( ES AS DT ) From reading other related context, doing it to `NullType`. Given the investigation made, choosing to `null` seems correct, and we have a reference Presto now. Therefore, this PR proposes to bring it back. ### Why are the changes needed? When empty array is created, it should be declared as array<null>. ### Does this PR introduce any user-facing change? Yes, `array()` creates `array<null>`. Now `array(1) + array()` can correctly create `array(1)` instead of `array("1")`. ### How was this patch tested? Tested manually Closes #27521 from HyukjinKwon/SPARK-29462. Lead-authored-by: HyukjinKwon <gurwls223@apache.org> Co-authored-by: Aman Omer <amanomer1996@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-11 17:22:08 +09:00
Liang-Chi Hsieh	acfdb46a60	[SPARK-27946][SQL][FOLLOW-UP] Change doc and error message for SHOW CREATE TABLE ### What changes were proposed in this pull request? This is a follow-up for #24938 to tweak error message and migration doc. ### Why are the changes needed? Making user know workaround if SHOW CREATE TABLE doesn't work for some Hive tables. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing unit tests. Closes #27505 from viirya/SPARK-27946-followup. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>	2020-02-10 10:45:00 -08:00
Yuanjian Li	3db3e39f11	[SPARK-28228][SQL] Change the default behavior for name conflict in nested WITH clause ### What changes were proposed in this pull request? This is a follow-up for #25029, in this PR we throw an AnalysisException when name conflict is detected in nested WITH clause. In this way, the config `spark.sql.legacy.ctePrecedence.enabled` should be set explicitly for the expected behavior. ### Why are the changes needed? The original change might risky to end-users, it changes behavior silently. ### Does this PR introduce any user-facing change? Yes, change the config `spark.sql.legacy.ctePrecedence.enabled` as optional. ### How was this patch tested? New UT. Closes #27454 from xuanyuanking/SPARK-28228-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-08 14:10:28 -08:00
Maxim Gekk	459e757ed4	[SPARK-30668][SQL] Support `SimpleDateFormat` patterns in parsing timestamps/dates strings ### What changes were proposed in this pull request? In the PR, I propose to partially revert the commit `51a6ba0181`, and provide a legacy parser based on `FastDateFormat` which is compatible to `SimpleDateFormat`. To enable the legacy parser, set `spark.sql.legacy.timeParser.enabled` to `true`. ### Why are the changes needed? To allow users to restore old behavior in parsing timestamps/dates using `SimpleDateFormat` patterns. The main reason for restoring is `DateTimeFormatter`'s patterns are not fully compatible to `SimpleDateFormat` patterns, see https://issues.apache.org/jira/browse/SPARK-30668 ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? - Added new test to `DateFunctionsSuite` - Restored additional test cases in `JsonInferSchemaSuite`. Closes #27441 from MaxGekk/support-simpledateformat. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-02-05 18:48:45 +08:00
Liang-Chi Hsieh	7631275f97	[SPARK-25040][SQL][FOLLOWUP] Add legacy config for allowing empty strings for certain types in json parser ### What changes were proposed in this pull request? This is a follow-up for #22787. In #22787 we disallowed empty strings for json parser except for string and binary types. This follow-up adds a legacy config for restoring previous behavior of allowing empty string. ### Why are the changes needed? Adding a legacy config to make migration easy for Spark users. ### Does this PR introduce any user-facing change? Yes. If set this legacy config to true, the users can restore previous behavior prior to Spark 3.0.0. ### How was this patch tested? Unit test. Closes #27456 from viirya/SPARK-25040-followup. Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-04 17:22:23 -08:00
Maxim Gekk	0202b675af	[SPARK-26618][SQL][FOLLOWUP] Describe the behavior change of typed `TIMESTAMP`/`DATE` literals ### What changes were proposed in this pull request? In the PR, I propose to update the SQL migration guide, and clarify behavior change of typed `TIMESTAMP` and `DATE` literals for input strings without time zone information - local timestamp and date strings. ### Why are the changes needed? To inform users that the typed literals may change their behavior in Spark 3.0 because of different sources of the default time zone - JVM system time zone in Spark 2.4 and earlier, and `spark.sql.session.timeZone` in Spark 3.0. ### Does this PR introduce any user-facing change? No ### How was this patch tested? N/A Closes #27435 from MaxGekk/timestamp-lit-migration-guide. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-02-04 16:33:34 +09:00
Yuming Wang	cd5f03a3ba	[SPARK-27686][DOC][SQL] Update migration guide for make Hive 2.3 dependency by default ### What changes were proposed in this pull request? We have upgraded the built-in Hive from 1.2 to 2.3. This may need to set `spark.sql.hive.metastore.version` and `spark.sql.hive.metastore.jars` according to the version of your Hive metastore. Example: ``` --conf spark.sql.hive.metastore.version=1.2.1 --conf spark.sql.hive.metastore.jars=/root/hive-1.2.1-lib/* ``` Otherwise: ``` org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table spark_27686. Invalid method name: 'get_table_req'; at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:110) at org.apache.spark.sql.hive.HiveExternalCatalog.tableExists(HiveExternalCatalog.scala:841) at org.apache.spark.sql.catalyst.catalog.ExternalCatalogWithListener.tableExists(ExternalCatalogWithListener.scala:146) at org.apache.spark.sql.catalyst.catalog.SessionCatalog.tableExists(SessionCatalog.scala:431) at org.apache.spark.sql.execution.command.CreateDataSourceTableCommand.run(createDataSourceTables.scala:52) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult$lzycompute(commands.scala:70) at org.apache.spark.sql.execution.command.ExecutedCommandExec.sideEffectResult(commands.scala:68) at org.apache.spark.sql.execution.command.ExecutedCommandExec.executeCollect(commands.scala:79) at org.apache.spark.sql.Dataset.$anonfun$logicalPlan$1(Dataset.scala:226) at org.apache.spark.sql.Dataset.$anonfun$withAction$1(Dataset.scala:3487) at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$4(SQLExecution.scala:100) at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:160) at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:87) at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3485) at org.apache.spark.sql.Dataset.<init>(Dataset.scala:226) at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:96) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:607) ... 47 elided Caused by: org.apache.hadoop.hive.ql.metadata.HiveException: Unable to fetch table spark_27686. Invalid method name: 'get_table_req' at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1282) at org.apache.spark.sql.hive.client.HiveClientImpl.getRawTableOption(HiveClientImpl.scala:422) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$tableExists$1(HiveClientImpl.scala:436) at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23) at org.apache.spark.sql.hive.client.HiveClientImpl.$anonfun$withHiveState$1(HiveClientImpl.scala:322) at org.apache.spark.sql.hive.client.HiveClientImpl.liftedTree1$1(HiveClientImpl.scala:256) at org.apache.spark.sql.hive.client.HiveClientImpl.retryLocked(HiveClientImpl.scala:255) at org.apache.spark.sql.hive.client.HiveClientImpl.withHiveState(HiveClientImpl.scala:305) at org.apache.spark.sql.hive.client.HiveClientImpl.tableExists(HiveClientImpl.scala:436) at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$tableExists$1(HiveExternalCatalog.scala:841) at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:100) ... 63 more Caused by: org.apache.thrift.TApplicationException: Invalid method name: 'get_table_req' at org.apache.thrift.TServiceClient.receiveBase(TServiceClient.java:79) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.recv_get_table_req(ThriftHiveMetastore.java:1567) at org.apache.hadoop.hive.metastore.api.ThriftHiveMetastore$Client.get_table_req(ThriftHiveMetastore.java:1554) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient.getTable(HiveMetaStoreClient.java:1350) at org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient.getTable(SessionHiveMetaStoreClient.java:127) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.RetryingMetaStoreClient.invoke(RetryingMetaStoreClient.java:173) at com.sun.proxy.$Proxy38.getTable(Unknown Source) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.hive.metastore.HiveMetaStoreClient$SynchronizedHandler.invoke(HiveMetaStoreClient.java:2336) at com.sun.proxy.$Proxy38.getTable(Unknown Source) at org.apache.hadoop.hive.ql.metadata.Hive.getTable(Hive.java:1274) ... 74 more ``` ### Why are the changes needed? Improve documentation. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? ```SKIP_API=1 jekyll build```: ![image](https://user-images.githubusercontent.com/5399861/73531432-67a50b80-4455-11ea-9401-5cad12fd3d14.png) Closes #27161 from wangyum/SPARK-27686. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-02-01 20:50:47 -08:00
Liang-Chi Hsieh	8eecc20b11	[SPARK-27946][SQL] Hive DDL to Spark DDL conversion USING "show create table" ## What changes were proposed in this pull request? This patch adds a DDL command `SHOW CREATE TABLE AS SERDE`. It is used to generate Hive DDL for a Hive table. For original `SHOW CREATE TABLE`, it now shows Spark DDL always. If given a Hive table, it tries to generate Spark DDL. For Hive serde to data source conversion, this uses the existing mapping inside `HiveSerDe`. If can't find a mapping there, throws an analysis exception on unsupported serde configuration. It is arguably that some Hive fileformat + row serde might be mapped to Spark data source, e.g., CSV. It is not included in this PR. To be conservative, it may not be supported. For Hive serde properties, for now this doesn't save it to Spark DDL because it may not useful to keep Hive serde properties in Spark table. ## How was this patch tested? Added test. Closes #24938 from viirya/SPARK-27946. Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com> Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2020-01-31 19:55:25 -08:00
angerszhu	246c398d59	[SPARK-30435][DOC] Update doc of Supported Hive Features ### What changes were proposed in this pull request? add supported hive features ### Why are the changes needed? update doc ### Does this PR introduce any user-facing change? Before change UI info: ![image](https://user-images.githubusercontent.com/46485123/72592726-29302c80-393e-11ea-8f4d-76432d4cb658.png) After this pr: ![image](https://user-images.githubusercontent.com/46485123/72593569-42d27380-3940-11ea-91c7-f2998d476364.png) ![image](https://user-images.githubusercontent.com/46485123/72962218-afd98380-3dee-11ea-82a1-0bf533ebfd9f.png) ### How was this patch tested? For PR about Spark Doc Web UI, we need to show UI format before and after pr. We can build our local web server about spark docs with reference `$SPARK_PROJECT/docs/README.md` You should install python and ruby in your env and also install plugin like below ```sh $ sudo gem install jekyll jekyll-redirect-from rouge # Following is needed only for generating API docs $ sudo pip install sphinx pypandoc mkdocs $ sudo Rscript -e 'install.packages(c("knitr", "devtools", "rmarkdown"), repos="https://cloud.r-project.org/")' $ sudo Rscript -e 'devtools::install_version("roxygen2", version = "5.0.1", repos="https://cloud.r-project.org/")' $ sudo Rscript -e 'devtools::install_version("testthat", version = "1.0.2", repos="https://cloud.r-project.org/")' ``` Then we call `jekyll serve --watch` after build we see below message ``` ~/Documents/project/AngersZhu/spark/sql Moving back into docs dir. Making directory api/sql cp -r ../sql/site/. api/sql Source: /Users/angerszhu/Documents/project/AngersZhu/spark/docs Destination: /Users/angerszhu/Documents/project/AngersZhu/spark/docs/_site Incremental build: disabled. Enable with --incremental Generating... done in 24.717 seconds. Auto-regeneration: enabled for '/Users/angerszhu/Documents/project/AngersZhu/spark/docs' Server address: http://127.0.0.1:4000 Server running... press ctrl-c to stop. ``` Visit http://127.0.0.1:4000 to get your newest change in doc web. Closes #27106 from AngersZhuuuu/SPARK-30435. Authored-by: angerszhu <angers.zhu@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-29 20:55:29 -08:00
Takeshi Yamamuro	ec1fb6b4e1	[SPARK-30234][SQL][FOLLOWUP] Add `.enabled` in the suffix of the ADD FILE legacy option ### What changes were proposed in this pull request? This pr intends to rename `spark.sql.legacy.addDirectory.recursive` into `spark.sql.legacy.addDirectory.recursive.enabled`. ### Why are the changes needed? For consistent option names. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? N/A Closes #27372 from maropu/SPARK-30234-FOLLOWUP. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2020-01-29 12:23:59 +09:00
Kent Yao	f2d71f5838	[SPARK-30591][SQL] Remove the nonstandard SET OWNER syntax for namespaces ### What changes were proposed in this pull request? This pr removes the nonstandard `SET OWNER` syntax for namespaces and changes the owner reserved properties from `ownerName` and `ownerType` to `owner`. ### Why are the changes needed? the `SET OWNER` syntax for namespaces is hive-specific and non-sql standard, we need a more future-proofing design before we implement user-facing changes for SQL security issues ### Does this PR introduce any user-facing change? no, just revert an unpublic syntax ### How was this patch tested? modified uts Closes #27300 from yaooqinn/SPARK-30591. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-22 16:00:05 +08:00
yi.wu	ff39c9271c	[SPARK-30252][SQL] Disallow negative scale of Decimal ### What changes were proposed in this pull request? This PR propose to disallow negative `scale` of `Decimal` in Spark. And this PR brings two behavior changes: 1) for literals like `1.23E4BD` or `1.23E4`(with `spark.sql.legacy.exponentLiteralAsDecimal.enabled`=true, see [SPARK-29956](https://issues.apache.org/jira/browse/SPARK-29956)), we set its `(precision, scale)` to (5, 0) rather than (3, -2); 2) add negative `scale` check inside the decimal method if it exposes to set `scale` explicitly. If check fails, `AnalysisException` throws. And user could still use `spark.sql.legacy.allowNegativeScaleOfDecimal.enabled` to restore the previous behavior. ### Why are the changes needed? According to SQL standard, > 4.4.2 Characteristics of numbers An exact numeric type has a precision P and a scale S. P is a positive integer that determines the number of significant digits in a particular radix R, where R is either 2 or 10. S is a non-negative integer. scale of Decimal should always be non-negative. And other mainstream databases, like Presto, PostgreSQL, also don't allow negative scale. Presto: ``` presto:default> create table t (i decimal(2, -1)); Query 20191213_081238_00017_i448h failed: line 1:30: mismatched input '-'. Expecting: <integer>, <type> create table t (i decimal(2, -1)) ``` PostgrelSQL: ``` postgres=# create table t(i decimal(2, -1)); ERROR: NUMERIC scale -1 must be between 0 and precision 2 LINE 1: create table t(i decimal(2, -1)); ^ ``` And, actually, Spark itself already doesn't allow to create table with negative decimal types using SQL: ``` scala> spark.sql("create table t(i decimal(2, -1))"); org.apache.spark.sql.catalyst.parser.ParseException: no viable alternative at input 'create table t(i decimal(2, -'(line 1, pos 28) == SQL == create table t(i decimal(2, -1)) ----------------------------^^^ at org.apache.spark.sql.catalyst.parser.ParseException.withCommand(ParseDriver.scala:263) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:130) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:76) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605) ... 35 elided ``` However, it is still possible to create such table or `DatFrame` using Spark SQL programming API: ``` scala> val tb = CatalogTable( TableIdentifier("test", None), CatalogTableType.MANAGED, CatalogStorageFormat.empty, StructType(StructField("i", DecimalType(2, -1) ) :: Nil)) ``` ``` scala> spark.sql("SELECT 1.23E4BD") res2: org.apache.spark.sql.DataFrame = [1.23E+4: decimal(3,-2)] ``` while, these two different behavior could make user confused. On the other side, even if user creates such table or `DataFrame` with negative scale decimal type, it can't write data out if using format, like `parquet` or `orc`. Because these formats have their own check for negative scale and fail on it. ``` scala> spark.sql("SELECT 1.23E4BD").write.saveAsTable("parquet") 19/12/13 17:37:04 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0) java.lang.IllegalArgumentException: Invalid DECIMAL scale: -2 at org.apache.parquet.Preconditions.checkArgument(Preconditions.java:53) at org.apache.parquet.schema.Types$BasePrimitiveBuilder.decimalMetadata(Types.java:495) at org.apache.parquet.schema.Types$BasePrimitiveBuilder.build(Types.java:403) at org.apache.parquet.schema.Types$BasePrimitiveBuilder.build(Types.java:309) at org.apache.parquet.schema.Types$Builder.named(Types.java:290) at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:428) at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.convertField(ParquetSchemaConverter.scala:334) at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.$anonfun$convert$2(ParquetSchemaConverter.scala:326) at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238) at scala.collection.Iterator.foreach(Iterator.scala:941) at scala.collection.Iterator.foreach$(Iterator.scala:941) at scala.collection.AbstractIterator.foreach(Iterator.scala:1429) at scala.collection.IterableLike.foreach(IterableLike.scala:74) at scala.collection.IterableLike.foreach$(IterableLike.scala:73) at org.apache.spark.sql.types.StructType.foreach(StructType.scala:99) at scala.collection.TraversableLike.map(TraversableLike.scala:238) at scala.collection.TraversableLike.map$(TraversableLike.scala:231) at org.apache.spark.sql.types.StructType.map(StructType.scala:99) at org.apache.spark.sql.execution.datasources.parquet.SparkToParquetSchemaConverter.convert(ParquetSchemaConverter.scala:326) at org.apache.spark.sql.execution.datasources.parquet.ParquetWriteSupport.init(ParquetWriteSupport.scala:97) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:388) at org.apache.parquet.hadoop.ParquetOutputFormat.getRecordWriter(ParquetOutputFormat.java:349) at org.apache.spark.sql.execution.datasources.parquet.ParquetOutputWriter.<init>(ParquetOutputWriter.scala:37) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anon$1.newInstance(ParquetFileFormat.scala:150) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.newOutputWriter(FileFormatDataWriter.scala:124) at org.apache.spark.sql.execution.datasources.SingleDirectoryDataWriter.<init>(FileFormatDataWriter.scala:109) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:264) at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:205) at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90) at org.apache.spark.scheduler.Task.run(Task.scala:127) at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:441) at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:444) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748) ``` So, I think it would be better to disallow negative scale totally and make behaviors above be consistent. ### Does this PR introduce any user-facing change? Yes, if `spark.sql.legacy.allowNegativeScaleOfDecimal.enabled=false`, user couldn't create Decimal value with negative scale anymore. ### How was this patch tested? Added new tests in `ExpressionParserSuite` and `DecimalSuite`; Updated `SQLQueryTestSuite`. Closes #26881 from Ngone51/nonnegative-scale. Authored-by: yi.wu <yi.wu@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-21 21:09:48 +08:00
Kent Yao	24efa43826	[SPARK-30019][SQL] Add the owner property to v2 table ### What changes were proposed in this pull request? Add `owner` property to v2 table, it is reversed by `TableCatalog`, indicates the table's owner. ### Why are the changes needed? enhance ownership management of catalog API ### Does this PR introduce any user-facing change? yes, add 1 reserved property - `owner` , and it is not allowed to use in OPTIONS/TBLPROPERTIES anymore, only if legacy on ### How was this patch tested? add uts Closes #27249 from yaooqinn/SPARK-30019. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2020-01-21 10:37:49 +08:00
Terry Kim	19a10597a8	[SPARK-30282][DOCS][FOLLOWUP] Update SQL migration guide for SHOW TBLPROPERTIES ### What changes were proposed in this pull request? This PR adds a migration guide for `SHOW TBLPROPERTIES` for Apache Spark 3.0.0. ### Why are the changes needed? The behavior of `SHOW TBLPROPERTIES` changed when the table does not exist. The migration guide reflects this user facing change. ### Does this PR introduce any user-facing change? Yes. This is a documentation change. ### How was this patch tested? No tests were added because this is a doc change. Closes #27276 from imback82/SPARK-30282-followup. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-19 14:44:12 -08:00
Dongjoon Hyun	505693c282	[SPARK-28152][DOCS][FOLLOWUP] Add a migration guide for MsSQLServer JDBC dialect ### What changes were proposed in this pull request? This PR adds a migration guide for MsSQLServer JDBC dialect for Apache Spark 2.4.4 and 2.4.5. ### Why are the changes needed? Apache Spark 2.4.4 updates the type mapping correctly according to MS SQL Server, but missed to mention that in the migration guide. In addition, 2.4.4 adds a configuration for the legacy behavior. ### Does this PR introduce any user-facing change? Yes. This is a documentation change. ![screenshot](https://user-images.githubusercontent.com/9700541/72649944-d6517780-3933-11ea-92be-9d4bf38e2eda.png) ### How was this patch tested? Manually generate and see the doc. Closes #27270 from dongjoon-hyun/SPARK-28152-DOC. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-17 17:20:15 -08:00
Dongjoon Hyun	fdbded3f71	[SPARK-30312][DOCS][FOLLOWUP] Add a migration guide ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/26956 to add a migration document for 2.4.5. ### Why are the changes needed? New legacy configuration will restore the previous behavior safely. ### Does this PR introduce any user-facing change? This PR updates the doc. <img width="763" alt="screenshot" src="https://user-images.githubusercontent.com/9700541/72639939-9da5a400-391b-11ea-87b1-14bca15db5a6.png"> ### How was this patch tested? Build the document and see the change manually. Closes #27269 from dongjoon-hyun/SPARK-30312. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2020-01-17 13:40:50 -08:00

1 2

87 commits