ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Ali Afroozeh	68034a8056	[SPARK-30072][SQL] Create dedicated planner for subqueries ### What changes were proposed in this pull request? This PR changes subquery planning by calling the planner and plan preparation rules on the subquery plan directly. Before we were creating a `QueryExecution` instance for subqueries to get the executedPlan. This would re-run analysis and optimization on the subqueries plan. Running the analysis again on an optimized query plan can have unwanted consequences, as some rules, for example `DecimalPrecision`, are not idempotent. As an example, consider the expression `1.7 * avg(a)` which after applying the `DecimalPrecision` rule becomes: ``` promote_precision(1.7) * promote_precision(avg(a)) ``` After the optimization, more specifically the constant folding rule, this expression becomes: ``` 1.7 * promote_precision(avg(a)) ``` Now if we run the analyzer on this optimized query again, we will get: ``` promote_precision(1.7) * promote_precision(promote_precision(avg(a))) ``` Which will later optimized as: ``` 1.7 * promote_precision(promote_precision(avg(a))) ``` As can be seen, re-running the analysis and optimization on this expression results in an expression with extra nested promote_preceision nodes. Adding unneeded nodes to the plan is problematic because it can eliminate situations where we can reuse the plan. We opted to introduce dedicated planners for subuqueries, instead of making the DecimalPrecision rule idempotent, because this eliminates this entire category of problems. Another benefit is that planning time for subqueries is reduced. ### How was this patch tested? Unit tests Closes #26705 from dbaliafroozeh/CreateDedicatedPlannerForSubqueries. Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com> Signed-off-by: herman <herman@databricks.com>	2019-12-02 20:56:40 +01:00
Wenchen Fan	e271664a01	[MINOR][SQL] Rename config name to spark.sql.analyzer.failAmbiguousSelfJoin.enabled ### What changes were proposed in this pull request? add `.enabled` postfix to `spark.sql.analyzer.failAmbiguousSelfJoin`. ### Why are the changes needed? to follow the existing naming style ### Does this PR introduce any user-facing change? no ### How was this patch tested? not needed Closes #26694 from cloud-fan/conf. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-02 21:05:06 +08:00
Kent Yao	4e073f3c50	[SPARK-30047][SQL] Support interval types in UnsafeRow ### What changes were proposed in this pull request? Optimize aggregates on interval values from sort-based to hash-based, and we can use the `org.apache.spark.sql.catalyst.expressions.RowBasedKeyValueBatch` for better performance. ### Why are the changes needed? improve aggerates ### Does this PR introduce any user-facing change? no ### How was this patch tested? add ut and existing ones Closes #26680 from yaooqinn/SPARK-30047. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-02 20:47:23 +08:00
LantaoJin	04a5b8f5f8	[SPARK-29839][SQL] Supporting STORED AS in CREATE TABLE LIKE ### What changes were proposed in this pull request? In SPARK-29421 (#26097) , we can specify a different table provider for `CREATE TABLE LIKE` via `USING provider`. Hive support `STORED AS` new file format syntax: ```sql CREATE TABLE tbl(a int) STORED AS TEXTFILE; CREATE TABLE tbl2 LIKE tbl STORED AS PARQUET; ``` For Hive compatibility, we should also support `STORED AS` in `CREATE TABLE LIKE`. ### Why are the changes needed? See https://github.com/apache/spark/pull/26097#issue-327424759 ### Does this PR introduce any user-facing change? Add a new syntax based on current CTL: CREATE TABLE tbl2 LIKE tbl [STORED AS hiveFormat]; ### How was this patch tested? Add UTs. Closes #26466 from LantaoJin/SPARK-29839. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-02 16:11:58 +08:00
Yuanjian Li	169415ffac	[SPARK-30025][CORE] Continuous shuffle block fetching should be disabled by default when the old fetch protocol is used ### What changes were proposed in this pull request? Disable continuous shuffle block fetching when the old fetch protocol in use. ### Why are the changes needed? The new feature of continuous shuffle block fetching depends on the latest version of the shuffle fetch protocol. We should keep this constraint in `BlockStoreShuffleReader.fetchContinuousBlocksInBatch`. ### Does this PR introduce any user-facing change? Users will not get the exception related to continuous shuffle block fetching when old version of the external shuffle service is used. ### How was this patch tested? Existing UT. Closes #26663 from xuanyuanking/SPARK-30025. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-02 15:59:12 +08:00
HyukjinKwon	51e69feb49	[SPARK-29851][SQL][FOLLOW-UP] Use foreach instead of misusing map ### What changes were proposed in this pull request? This PR proposes to use foreach instead of misusing map as a small followup of #26476. This could cause some weird errors potentially and it's not a good practice anyway. See also SPARK-16694 ### Why are the changes needed? To avoid potential issues like SPARK-16694 ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests should cover. Closes #26729 from HyukjinKwon/SPARK-29851. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-02 13:40:00 +09:00
Yuanjian Li	d1465a1b0d	[SPARK-30074][SQL] The maxNumPostShufflePartitions config should obey reducePostShufflePartitions enabled ### What changes were proposed in this pull request? 1. Make maxNumPostShufflePartitions config obey reducePostShfflePartitions config. 2. Update the description for all the SQLConf affected by `spark.sql.adaptive.enabled`. ### Why are the changes needed? Make the relation between these confs clearer. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing UT. Closes #26664 from xuanyuanking/SPARK-9853-follow. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-02 12:37:06 +08:00
wuyi	87ebfaf003	[SPARK-29956][SQL] A literal number with an exponent should be parsed to Double ### What changes were proposed in this pull request? For a literal number with an exponent(e.g. 1e-45, 1E2), we'd parse it to Double by default rather than Decimal. And user could still use `spark.sql.legacy.exponentLiteralToDecimal.enabled=true` to fall back to previous behavior. ### Why are the changes needed? According to ANSI standard of SQL, we see that the (part of) definition of `literal` : ``` <approximate numeric literal> ::= <mantissa> E <exponent> ``` which indicates that a literal number with an exponent should be approximate numeric(e.g. Double) rather than exact numeric(e.g. Decimal). And when we test Presto, we found that Presto also conforms to this standard: ``` presto:default> select typeof(1E2); _col0 -------- double (1 row) ``` ``` presto:default> select typeof(1.2); _col0 -------------- decimal(2,1) (1 row) ``` We also find that, actually, literals like `1E2` are parsed as Double before Spark2.1, but changed to Decimal after #14828 due to The difference between the two confuses most users as it said. But we also see support(from DB2 test) of original behavior at #14828 (comment). Although, we also see that PostgreSQL has its own implementation: ``` postgres=# select pg_typeof(1E2); pg_typeof ----------- numeric (1 row) postgres=# select pg_typeof(1.2); pg_typeof ----------- numeric (1 row) ``` We still think that Spark should also conform to this standard while considering SQL standard and Spark own history and majority DBMS and also user experience. ### Does this PR introduce any user-facing change? Yes. For `1E2`, before this PR: ``` scala> spark.sql("select 1E2") res0: org.apache.spark.sql.DataFrame = [1E+2: decimal(1,-2)] ``` After this PR: ``` scala> spark.sql("select 1E2") res0: org.apache.spark.sql.DataFrame = [100.0: double] ``` And for `1E-45`, before this PR: ``` org.apache.spark.sql.catalyst.parser.ParseException: decimal can only support precision up to 38 == SQL == select 1E-45 at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:131) at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48) at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:76) at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605) at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111) at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605) ... 47 elided ``` after this PR: ``` scala> spark.sql("select 1E-45"); res1: org.apache.spark.sql.DataFrame = [1.0E-45: double] ``` And before this PR, user may feel super weird to see that `select 1e40` works but `select 1e-40 fails`. And now, both of them work well. ### How was this patch tested? updated `literals.sql.out` and `ansi/literals.sql.out` Closes #26595 from Ngone51/SPARK-29956. Authored-by: wuyi <ngone_5451@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-12-02 11:34:56 +08:00
Yuming Wang	708ab57f37	[SPARK-28461][SQL] Pad Decimal numbers with trailing zeros to the scale of the column ## What changes were proposed in this pull request? [HIVE-12063](https://issues.apache.org/jira/browse/HIVE-12063) improved pad decimal numbers with trailing zeros to the scale of the column. The following description is copied from the description of HIVE-12063. > HIVE-7373 was to address the problems of trimming tailing zeros by Hive, which caused many problems including treating 0.0, 0.00 and so on as 0, which has different precision/scale. Please refer to HIVE-7373 description. However, HIVE-7373 was reverted by HIVE-8745 while the underlying problems remained. HIVE-11835 was resolved recently to address one of the problems, where 0.0, 0.00, and so on cannot be read into decimal(1,1). However, HIVE-11835 didn't address the problem of showing as 0 in query result for any decimal values such as 0.0, 0.00, etc. This causes confusion as 0 and 0.0 have different precision/scale than 0. The proposal here is to pad zeros for query result to the type's scale. This not only removes the confusion described above, but also aligns with many other DBs. Internal decimal number representation doesn't change, however. Spark SQL: ```sql // bin/spark-sql spark-sql> select cast(1 as decimal(38, 18)); 1 spark-sql> // bin/beeline 0: jdbc:hive2://localhost:10000/default> select cast(1 as decimal(38, 18)); +----------------------------+--+ \| CAST(1 AS DECIMAL(38,18)) \| +----------------------------+--+ \| 1.000000000000000000 \| +----------------------------+--+ // bin/spark-shell scala> spark.sql("select cast(1 as decimal(38, 18))").show(false) +-------------------------+ \|CAST(1 AS DECIMAL(38,18))\| +-------------------------+ \|1.000000000000000000 \| +-------------------------+ // bin/pyspark >>> spark.sql("select cast(1 as decimal(38, 18))").show() +-------------------------+ \|CAST(1 AS DECIMAL(38,18))\| +-------------------------+ \| 1.000000000000000000\| +-------------------------+ // bin/sparkR > showDF(sql("SELECT cast(1 as decimal(38, 18))")) +-------------------------+ \|CAST(1 AS DECIMAL(38,18))\| +-------------------------+ \| 1.000000000000000000\| +-------------------------+ ``` PostgreSQL: ```sql postgres=# select cast(1 as decimal(38, 18)); numeric ---------------------- 1.000000000000000000 (1 row) ``` Presto: ```sql presto> select cast(1 as decimal(38, 18)); _col0 ---------------------- 1.000000000000000000 (1 row) ``` ## How was this patch tested? unit tests and manual test: ```sql spark-sql> select cast(1 as decimal(38, 18)); 1.000000000000000000 ``` Spark SQL Upgrading Guide: ![image](https://user-images.githubusercontent.com/5399861/69649620-4405c380-10a8-11ea-84b1-6ee675663b98.png) Closes #26697 from wangyum/SPARK-28461. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-12-02 09:02:39 +09:00
Dongjoon Hyun	9cd174a7c9	Revert "[SPARK-28461][SQL] Pad Decimal numbers with trailing zeros to the scale of the column" This reverts commit `19af1fe3a2`.	2019-11-27 11:07:08 -08:00
fuwhu	16da714ea5	[SPARK-29979][SQL][FOLLOW-UP] improve the output of DesribeTableExec ### What changes were proposed in this pull request? refine the output of "DESC TABLE" command. After this PR, the output of "DESC TABLE" command is like below : ``` id bigint data string # Partitioning Part 0 id # Detailed Table Information Name testca.table_name Comment this is a test table Location /tmp/testcat/table_name Provider foo Table Properties [bar=baz] ``` ### Why are the changes needed? Currently, "DESC TABLE" will show reserved properties (eg. location, comment) in the "Table Property" section. Since reserved properties are different from common properties, displaying reserved properties together with other table detailed information and displaying other properties in single field should be reasonable, and it is consistent with hive and DescribeTableCommand action. ### Does this PR introduce any user-facing change? yes, the output of "DESC TABLE" command is refined as above. ### How was this patch tested? Update existing unit tests. Closes #26677 from fuwhu/SPARK-29979-FOLLOWUP-1. Authored-by: fuwhu <bestwwg@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-27 23:16:53 +08:00
Yuming Wang	19af1fe3a2	[SPARK-28461][SQL] Pad Decimal numbers with trailing zeros to the scale of the column ## What changes were proposed in this pull request? [HIVE-12063](https://issues.apache.org/jira/browse/HIVE-12063) improved pad decimal numbers with trailing zeros to the scale of the column. The following description is copied from the description of HIVE-12063. > HIVE-7373 was to address the problems of trimming tailing zeros by Hive, which caused many problems including treating 0.0, 0.00 and so on as 0, which has different precision/scale. Please refer to HIVE-7373 description. However, HIVE-7373 was reverted by HIVE-8745 while the underlying problems remained. HIVE-11835 was resolved recently to address one of the problems, where 0.0, 0.00, and so on cannot be read into decimal(1,1). However, HIVE-11835 didn't address the problem of showing as 0 in query result for any decimal values such as 0.0, 0.00, etc. This causes confusion as 0 and 0.0 have different precision/scale than 0. The proposal here is to pad zeros for query result to the type's scale. This not only removes the confusion described above, but also aligns with many other DBs. Internal decimal number representation doesn't change, however. Spark SQL: ```sql // bin/spark-sql spark-sql> select cast(1 as decimal(38, 18)); 1 spark-sql> // bin/beeline 0: jdbc:hive2://localhost:10000/default> select cast(1 as decimal(38, 18)); +----------------------------+--+ \| CAST(1 AS DECIMAL(38,18)) \| +----------------------------+--+ \| 1.000000000000000000 \| +----------------------------+--+ // bin/spark-shell scala> spark.sql("select cast(1 as decimal(38, 18))").show(false) +-------------------------+ \|CAST(1 AS DECIMAL(38,18))\| +-------------------------+ \|1.000000000000000000 \| +-------------------------+ // bin/pyspark >>> spark.sql("select cast(1 as decimal(38, 18))").show() +-------------------------+ \|CAST(1 AS DECIMAL(38,18))\| +-------------------------+ \| 1.000000000000000000\| +-------------------------+ // bin/sparkR > showDF(sql("SELECT cast(1 as decimal(38, 18))")) +-------------------------+ \|CAST(1 AS DECIMAL(38,18))\| +-------------------------+ \| 1.000000000000000000\| +-------------------------+ ``` PostgreSQL: ```sql postgres=# select cast(1 as decimal(38, 18)); numeric ---------------------- 1.000000000000000000 (1 row) ``` Presto: ```sql presto> select cast(1 as decimal(38, 18)); _col0 ---------------------- 1.000000000000000000 (1 row) ``` ## How was this patch tested? unit tests and manual test: ```sql spark-sql> select cast(1 as decimal(38, 18)); 1.000000000000000000 ``` Spark SQL Upgrading Guide: ![image](https://user-images.githubusercontent.com/5399861/69649620-4405c380-10a8-11ea-84b1-6ee675663b98.png) Closes #25214 from wangyum/SPARK-28461. Authored-by: Yuming Wang <yumwang@ebay.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-11-27 18:13:33 +09:00
wuyi	a58d91b159	[SPARK-29768][SQL] Column pruning through nondeterministic expressions ### What changes were proposed in this pull request? Support columnar pruning through non-deterministic expressions. ### Why are the changes needed? In some cases, columns can still be pruned even though nondeterministic expressions appears. e.g. for the plan `Filter('a = 1, Project(Seq('a, rand() as 'r), LogicalRelation('a, 'b)))`, we shall still prune column b while non-deterministic expression appears. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added a new test file: `ScanOperationSuite`. Added test in `FileSourceStrategySuite` to verify the right prune behavior for both DS v1 and v2. Closes #26629 from Ngone51/SPARK-29768. Authored-by: wuyi <ngone_5451@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-27 15:37:01 +08:00
Kent Yao	4fd585d2c5	[SPARK-30008][SQL] The dataType of collect_list/collect_set aggs should be ArrayType(_, false) ### What changes were proposed in this pull request? ```scala // Do not allow null values. We follow the semantics of Hive's collect_list/collect_set here. // See: org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMkCollectionEvaluator ``` These two functions do not allow null values as they are defined, so their elements should not contain null. ### Why are the changes needed? Casting collect_list(a) to ArrayType(_, false) fails before this fix. ### Does this PR introduce any user-facing change? no ### How was this patch tested? add ut Closes #26651 from yaooqinn/SPARK-30008. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-26 20:40:21 -08:00
Kent Yao	ed0c33fdd4	[SPARK-30026][SQL] Whitespaces can be identified as delimiters in interval string ### What changes were proposed in this pull request? We are now able to handle whitespaces for integral and fractional types, and the leading or trailing whitespaces for interval, date, and timestamps. But the current interval parser is not able to identify whitespaces as separates as PostgreSQL can do. This PR makes the whitespaces handling be consistent for nterval values. Typed interval literal, multi-unit representation, and casting from strings are all supported. ```sql postgres=# select interval E'1 \t day'; interval ---------- 1 day (1 row) postgres=# select interval E'1\t' day; interval ---------- 1 day (1 row) ``` ### Why are the changes needed? Whitespace handling should be consistent for interval value, and across different types in Spark. PostgreSQL feature parity. ### Does this PR introduce any user-facing change? Yes, the interval string of multi-units values which separated by whitespaces can be valid now. ### How was this patch tested? add ut. Closes #26662 from yaooqinn/SPARK-30026. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-27 01:20:38 +08:00
Sean Owen	29018025ba	[SPARK-30009][CORE][SQL] Support different floating-point Ordering for Scala 2.12 / 2.13 ### What changes were proposed in this pull request? Make separate source trees for Scala 2.12/2.13 in order to accommodate mutually-incompatible support for Ordering of double, float. Note: This isn't the last change that will need a split source tree for 2.13. But this particular change could go several ways: - (Split source tree) - Inline the Scala 2.12 implementation - Reflection For this change alone any are possible, and splitting the source tree is a bit overkill. But if it will be necessary for other JIRAs (see umbrella SPARK-25075), then it might be the easiest way to implement this. ### Why are the changes needed? Scala 2.13 split Ordering.Double into Ordering.Double.TotalOrdering and Ordering.Double.IeeeOrdering. Neither can be used in a single build that supports 2.12 and 2.13. TotalOrdering works like java.lang.Double.compare. IeeeOrdering works like Scala 2.12 Ordering.Double. They differ in how NaN is handled - compares always above other values? or always compares as 'false'? In theory they have different uses: TotalOrdering is important if floating-point values are sorted. IeeeOrdering behaves like 2.12 and JVM comparison operators. I chose TotalOrdering as I think we care more about stable sorting, and because elsewhere we rely on java.lang comparisons. It is also possible to support with two methods. ### Does this PR introduce any user-facing change? Pending tests, will see if it obviously affects any sort order. We need to see if it changes NaN sort order. ### How was this patch tested? Existing tests so far. Closes #26654 from srowen/SPARK-30009. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-26 08:25:53 -08:00
Huaxin Gao	373c2c3f44	[SPARK-29862][SQL] CREATE (OR REPLACE) ... VIEW should look up catalog/table like v2 commands ### What changes were proposed in this pull request? Add CreateViewStatement and make CREARE VIEW go through the same catalog/table resolution framework of v2 commands. ### Why are the changes needed? It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g. ``` USE my_catalog DESC v // success and describe the view v from my_catalog CREATE VIEW v AS SELECT 1 // report view not found as there is no view v in the session catalog ``` ### Does this PR introduce any user-facing change? Yes. When running CREATE VIEW ... Spark fails the command if the current catalog is set to a v2 catalog, or the view name specified a v2 catalog. ### How was this patch tested? unit tests Closes #26649 from huaxingao/spark-29862. Authored-by: Huaxin Gao <huaxing@us.ibm.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-26 14:10:46 +08:00
Kent Yao	8b0121bea8	[MINOR][DOC] Fix the CalendarIntervalType description ### What changes were proposed in this pull request? fix the overdue and incorrect description about CalendarIntervalType ### Why are the changes needed? api doc correctness ### Does this PR introduce any user-facing change? no ### How was this patch tested? no Closes #26659 from yaooqinn/intervaldoc. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-26 12:49:56 +08:00
fuwhu	29ebd9336c	[SPARK-29979][SQL] Add basic/reserved property key constants in TableCatalog and SupportsNamespaces ### What changes were proposed in this pull request? Add "comment" and "location" property key constants in TableCatalog and SupportNamespaces. And update code of implementation classes to use these constants instead of hard code. ### Why are the changes needed? Currently, some basic/reserved keys (eg. "location", "comment") of table and namespace properties are hard coded or defined in specific logical plan implementation class. These keys can be centralized in TableCatalog and SupportsNamespaces interface and shared across different implementation classes. ### Does this PR introduce any user-facing change? no ### How was this patch tested? Existing unit test Closes #26617 from fuwhu/SPARK-29979. Authored-by: fuwhu <bestwwg@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-26 01:24:43 +08:00
Kent Yao	de21f28f8a	[SPARK-29986][SQL] casting string to date/timestamp/interval should trim all whitespaces ### What changes were proposed in this pull request? A java like string trim method trims all whitespaces that less or equal than 0x20. currently, our UTF8String handle the space =0x20 ONLY. This is not suitable for many cases in Spark, like trim for interval strings, date, timestamps, PostgreSQL like cast string to boolean. ### Why are the changes needed? improve the white spaces handling in UTF8String, also with some bugs fixed ### Does this PR introduce any user-facing change? yes, string with `control character` at either end can be convert to date/timestamp and interval now ### How was this patch tested? add ut Closes #26626 from yaooqinn/SPARK-29986. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-25 14:37:04 +08:00
Kent Yao	5cf475d288	[SPARK-30000][SQL] Trim the string when cast string type to decimals ### What changes were proposed in this pull request? https://bugs.openjdk.java.net/browse/JDK-8170259 https://bugs.openjdk.java.net/browse/JDK-8170563 When we cast string type to decimal type, we rely on java.math. BigDecimal. It can't accept leading and training spaces, as you can see in the above links. This behavior is not consistent with other numeric types now. we need to fix it and keep consistency. ### Why are the changes needed? make string to numeric types be consistent ### Does this PR introduce any user-facing change? yes, string removed trailing or leading white spaces will be able to convert to decimal if the trimmed is valid ### How was this patch tested? 1. modify ut #### Benchmark ```scala /* * Licensed to the Apache Software Foundation (ASF) under one or more * contributor license agreements. See the NOTICE file distributed with * this work for additional information regarding copyright ownership. * The ASF licenses this file to You under the Apache License, Version 2.0 * (the "License"); you may not use this file except in compliance with * the License. You may obtain a copy of the License at * * http://www.apache.org/licenses/LICENSE-2.0 * * Unless required by applicable law or agreed to in writing, software * distributed under the License is distributed on an "AS IS" BASIS, * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. * See the License for the specific language governing permissions and * limitations under the License. / package org.apache.spark.sql.execution.benchmark import org.apache.spark.benchmark.Benchmark /* * Benchmark trim the string when casting string type to Boolean/Numeric types. * To run this benchmark: * {{{ * 1. without sbt: * bin/spark-submit --class <this class> --jars <spark core test jar> <spark sql test jar> * 2. build/sbt "sql/test:runMain <this class>" * 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>" * Results will be written to "benchmarks/CastBenchmark-results.txt". * }}} / object CastBenchmark extends SqlBasedBenchmark { override def runBenchmarkSuite(mainArgs: Array[String]): Unit = { val title = "Cast String to Integral" runBenchmark(title) { withTempPath { dir => val N = 500L << 14 val df = spark.range(N) val types = Seq("decimal") (1 to 5).by(2).foreach { i => df.selectExpr(s"concat(id, '${" " i}') as str") .write.mode("overwrite").parquet(dir + i.toString) } val benchmark = new Benchmark(title, N, minNumIters = 5, output = output) Seq(true, false).foreach { trim => types.foreach { t => val str = if (trim) "trim(str)" else "str" val expr = s"cast($str as $t) as c_$t" (1 to 5).by(2).foreach { i => benchmark.addCase(expr + s" - with $i spaces") { _ => spark.read.parquet(dir + i.toString).selectExpr(expr).collect() } } } } benchmark.run() } } } } ``` #### string trim vs not trim ```java [info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.1 [info] Intel(R) Core(TM) i9-9980HK CPU 2.40GHz [info] Cast String to Integral: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] cast(trim(str) as decimal) as c_decimal - with 1 spaces 3362 5486 NaN 2.4 410.4 1.0X [info] cast(trim(str) as decimal) as c_decimal - with 3 spaces 3251 5655 NaN 2.5 396.8 1.0X [info] cast(trim(str) as decimal) as c_decimal - with 5 spaces 3208 5725 NaN 2.6 391.7 1.0X [info] cast(str as decimal) as c_decimal - with 1 spaces 13962 16233 1354 0.6 1704.3 0.2X [info] cast(str as decimal) as c_decimal - with 3 spaces 14273 14444 179 0.6 1742.4 0.2X [info] cast(str as decimal) as c_decimal - with 5 spaces 14318 14535 125 0.6 1747.8 0.2X ``` #### string trim vs this fix ```java [info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.1 [info] Intel(R) Core(TM) i9-9980HK CPU 2.40GHz [info] Cast String to Integral: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative [info] ------------------------------------------------------------------------------------------------------------------------ [info] cast(trim(str) as decimal) as c_decimal - with 1 spaces 3265 6299 NaN 2.5 398.6 1.0X [info] cast(trim(str) as decimal) as c_decimal - with 3 spaces 3183 6241 693 2.6 388.5 1.0X [info] cast(trim(str) as decimal) as c_decimal - with 5 spaces 3167 5923 1151 2.6 386.7 1.0X [info] cast(str as decimal) as c_decimal - with 1 spaces 3161 5838 1126 2.6 385.9 1.0X [info] cast(str as decimal) as c_decimal - with 3 spaces 3046 3457 837 2.7 371.8 1.1X [info] cast(str as decimal) as c_decimal - with 5 spaces 3053 4445 NaN 2.7 372.7 1.1X [info] ``` Closes #26640 from yaooqinn/SPARK-30000. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-25 12:47:07 +08:00
Sean Owen	13896e4eae	[SPARK-30013][SQL] For scala 2.13, omit parens in various BigDecimal value() methods ### What changes were proposed in this pull request? Omit parens on calls like BigDecimal.longValue() ### Why are the changes needed? For some reason, this won't compile in Scala 2.13. The calls are otherwise equivalent in 2.12. ### Does this PR introduce any user-facing change? No ### How was this patch tested? Existing tests Closes #26653 from srowen/SPARK-30013. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-24 18:23:34 -08:00
Takeshi Yamamuro	3f3a18fff1	[SPARK-24690][SQL] Add a config to control plan stats computation in LogicalRelation ### What changes were proposed in this pull request? This pr proposes a new independent config so that `LogicalRelation` could use `rowCount` to compute data statistics in logical plans even if CBO disabled. In the master, we currently cannot enable `StarSchemaDetection.reorderStarJoins` because we need to turn off CBO to enable it but `StarSchemaDetection` internally references the `rowCount` that is used in LogicalRelation if CBO disabled. ### Why are the changes needed? Plan stats are pretty useful other than CBO, e.g., star-schema detector and dynamic partition pruning. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Added tests in `DataFrameJoinSuite`. Closes #21668 from maropu/PlanStatsConf. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-24 08:30:24 -08:00
Dongjoon Hyun	13338eaa95	[SPARK-29554][SQL][FOLLOWUP] Rename Version to SparkVersion ### What changes were proposed in this pull request? This is a follow-up of https://github.com/apache/spark/pull/26209 . This renames class `Version` to class `SparkVersion`. ### Why are the changes needed? According to the review comment, this uses more specific class name. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Pass the Jenkins with the existing tests. Closes #26647 from dongjoon-hyun/SPARK-29554. Authored-by: Dongjoon Hyun <dhyun@apple.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-23 19:53:52 -08:00
HyukjinKwon	fc7a37b147	[SPARK-30003][SQL] Do not throw stack overflow exception in non-root unknown hint resolution ### What changes were proposed in this pull request? This is rather a followup of https://github.com/apache/spark/pull/25464 (see https://github.com/apache/spark/pull/25464/files#r349543286) It will cause an infinite recursion via mapping children - we should return the hint rather than its parent plan in unknown hint resolution. ### Why are the changes needed? Prevent Stack over flow during hint resolution. ### Does this PR introduce any user-facing change? Yes, it avoids stack overflow exception It was caused by https://github.com/apache/spark/pull/25464 and this is only in the master. No behaviour changes to end users as it happened only in the master. ### How was this patch tested? Unittest was added. Closes #26642 from HyukjinKwon/SPARK-30003. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-11-23 17:24:56 +09:00
Wenchen Fan	6e581cf164	[SPARK-29893][SQL][FOLLOWUP] code cleanup for local shuffle reader ### What changes were proposed in this pull request? A few cleanups for https://github.com/apache/spark/pull/26516: 1. move the calculating of partition start indices from the RDD to the rule. We can reuse code from "shrink number of reducers" in the future if we split partitions by size. 2. only check extra shuffles when adding local readers to the probe side. 3. add comments. 4. simplify the config name: `optimizedLocalShuffleReader` -> `localShuffleReader` ### Why are the changes needed? make code more maintainable. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #26625 from cloud-fan/aqe. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Xiao Li <gatorsmile@gmail.com>	2019-11-22 10:26:54 -08:00
Wenchen Fan	6b4b6a87cd	[SPARK-29558][SQL] ResolveTables and ResolveRelations should be order-insensitive ### What changes were proposed in this pull request? Make `ResolveRelations` call `ResolveTables` at the beginning, and make `ResolveTables` call `ResolveTempViews`(newly added) at the beginning, to ensure the relation resolution priority. ### Why are the changes needed? To resolve an `UnresolvedRelation`, the general process is: 1. try to resolve to (global) temp view first. If it's not a temp view, move on 2. if the table name specifies a catalog, lookup the table from the specified catalog. Otherwise, lookup table from the current catalog. 3. when looking up table from session catalog, return a v1 relation if the table provider is v1. Currently, this process is done by 2 rules: `ResolveTables` and `ResolveRelations`. To avoid rule conflicts, we add a lot of checks: 1. `ResolveTables` only resolves `UnresolvedRelation` if it's not a temp view and the resolved table is not v1. 2. `ResolveRelations` only resolves `UnresolvedRelation` if the table name has less than 2 parts. This requires to run `ResolveTables` before `ResolveRelations`, otherwise we may resolve a v2 table to a v1 relation. To clearly guarantee the resolution priority, and avoid massive changes, this PR proposes to call one rule in another rule to ensure the rule execution order. Now the process is simple: 1. first run `ResolveTempViews`, see if we can resolve relation to temp view 2. then run `ResolveTables`, see if we can resolve relation to v2 tables. 3. finally run `ResolveRelations`, see if we can resolve relation to v1 tables. ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #26214 from cloud-fan/resolve. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Ryan Blue <blue@apache.org>	2019-11-21 09:47:42 -08:00
Ximo Guanter	54c5087a3a	[SPARK-29248][SQL] provider number of partitions when creating v2 data writer factory ### What changes were proposed in this pull request? When implementing a ScanBuilder, we require the implementor to provide the schema of the data and the number of partitions. However, when someone is implementing WriteBuilder we only pass them the schema, but not the number of partitions. This is an asymetrical developer experience. This PR adds a PhysicalWriteInfo interface that is passed to createBatchWriterFactory and createStreamingWriterFactory that adds the number of partitions of the data that is going to be written. ### Why are the changes needed? Passing in the number of partitions on the WriteBuilder would enable data sources to provision their write targets before starting to write. For example: it could be used to provision a Kafka topic with a specific number of partitions it could be used to scale a microservice prior to sending the data to it it could be used to create a DsV2 that sends the data to another spark cluster (currently not possible since the reader wouldn't be able to know the number of partitions) ### Does this PR introduce any user-facing change? No ### How was this patch tested? Tests passed Closes #26591 from edrevo/temp. Authored-by: Ximo Guanter <joaquin.guantergonzalbez@telefonica.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-22 00:19:25 +08:00
Takeshi Yamamuro	cdcd43cbf2	[SPARK-29977][SQL] Remove newMutableProjection/newOrdering/newNaturalAscendingOrdering from SparkPlan ### What changes were proposed in this pull request? This is to refactor `SparkPlan` code; it mainly removed `newMutableProjection`/`newOrdering`/`newNaturalAscendingOrdering` from `SparkPlan`. The other modifications are listed below; - Move `BaseOrdering` from `o.a.s.sqlcatalyst.expressions.codegen.GenerateOrdering.scala` to `o.a.s.sqlcatalyst.expressions.ordering.scala` - `RowOrdering` extends `CodeGeneratorWithInterpretedFallback ` for `BaseOrdering` - Remove the unused variables (`subexpressionEliminationEnabled` and `codeGenFallBack`) from `SparkPlan` ### Why are the changes needed? For better code/test coverage. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing. Closes #26615 from maropu/RefactorOrdering. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-21 23:51:12 +08:00
Kent Yao	7a70670345	[SPARK-29961][SQL] Implement builtin function - typeof ### What changes were proposed in this pull request? Add typeof function for Spark to get the underlying type of value. ```sql -- !query 0 select typeof(1) -- !query 0 schema struct<typeof(1):string> -- !query 0 output int -- !query 1 select typeof(1.2) -- !query 1 schema struct<typeof(1.2):string> -- !query 1 output decimal(2,1) -- !query 2 select typeof(array(1, 2)) -- !query 2 schema struct<typeof(array(1, 2)):string> -- !query 2 output array<int> -- !query 3 select typeof(a) from (values (1), (2), (3.1)) t(a) -- !query 3 schema struct<typeof(a):string> -- !query 3 output decimal(11,1) decimal(11,1) decimal(11,1) ``` ##### presto ```sql presto> select typeof(array[1]); _col0 ---------------- array(integer) (1 row) ``` ##### PostgreSQL ```sql postgres=# select pg_typeof(a) from (values (1), (2), (3.0)) t(a); pg_typeof ----------- numeric numeric numeric (3 rows) ``` ##### impala https://issues.apache.org/jira/browse/IMPALA-1597 ### Why are the changes needed? a function which is better we have to help us debug, test, develop ... ### Does this PR introduce any user-facing change? add a new function ### How was this patch tested? add ut and example Closes #26599 from yaooqinn/SPARK-29961. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-11-21 10:28:32 +09:00
LantaoJin	06e203b856	[SPARK-29911][SQL] Uncache cached tables when session closed ### What changes were proposed in this pull request? The local temporary view is session-scoped. Its lifetime is the lifetime of the session that created it. But now cache data is cross-session. Its lifetime is the lifetime of the Spark application. That's will cause the memory leak if cache a local temporary view in memory when the session closed. In this PR, we uncache the cached data of local temporary view when session closed. This PR doesn't impact the cached data of global temp view and persisted view. How to reproduce: 1. create a local temporary view v1 2. cache it in memory 3. close session without drop table v1. The application will hold the memory forever. In a long running thrift server scenario. It's worse. ```shell 0: jdbc:hive2://localhost:10000> CACHE TABLE testCacheTable AS SELECT 1; CACHE TABLE testCacheTable AS SELECT 1; +---------+--+ \| Result \| +---------+--+ +---------+--+ No rows selected (1.498 seconds) 0: jdbc:hive2://localhost:10000> !close !close Closing: 0: jdbc:hive2://localhost:10000 0: jdbc:hive2://localhost:10000 (closed)> !connect 'jdbc:hive2://localhost:10000' !connect 'jdbc:hive2://localhost:10000' Connecting to jdbc:hive2://localhost:10000 Enter username for jdbc:hive2://localhost:10000: lajin Enter password for jdbc:hive2://localhost:10000: *** Connected to: Spark SQL (version 3.0.0-SNAPSHOT) Driver: Hive JDBC (version 1.2.1.spark2) Transaction isolation: TRANSACTION_REPEATABLE_READ 1: jdbc:hive2://localhost:10000> select * from testCacheTable; select * from testCacheTable; Error: Error running query: org.apache.spark.sql.AnalysisException: Table or view not found: testCacheTable; line 1 pos 14; 'Project [*] +- 'UnresolvedRelation [testCacheTable] (state=,code=0) ``` <img width="1047" alt="Screen Shot 2019-11-15 at 2 03 49 PM" src="https://user-images.githubusercontent.com/1853780/68923527-7ca8c180-07b9-11ea-9cc7-74f276c46840.png"> ### Why are the changes needed? Resolve memory leak for thrift server ### Does this PR introduce any user-facing change? No ### How was this patch tested? Manual test in UI storage tab And add an UT Closes #26543 from LantaoJin/SPARK-29911. Authored-by: LantaoJin <jinlantao@gmail.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-11-20 18:19:30 -06:00
Sean Owen	1febd373ea	[MINOR][TESTS] Replace JVM assert with JUnit Assert in tests ### What changes were proposed in this pull request? Use JUnit assertions in tests uniformly, not JVM assert() statements. ### Why are the changes needed? assert() statements do not produce as useful errors when they fail, and, if they were somehow disabled, would fail to test anything. ### Does this PR introduce any user-facing change? No. The assertion logic should be identical. ### How was this patch tested? Existing tests. Closes #26581 from srowen/assertToJUnit. Authored-by: Sean Owen <sean.owen@databricks.com> Signed-off-by: Sean Owen <sean.owen@databricks.com>	2019-11-20 14:04:15 -06:00
Yuanjian Li	23b3c4fafd	[SPARK-29951][SQL] Make the behavior of Postgre dialect independent of ansi mode config ### What changes were proposed in this pull request? Fix the inconsistent behavior of build-in function SQL LEFT/RIGHT. ### Why are the changes needed? As the comment in https://github.com/apache/spark/pull/26497#discussion_r345708065, Postgre dialect should not be affected by the ANSI mode config. During reran the existing tests, only the LEFT/RIGHT build-in SQL function broke the assumption. We fix this by following https://www.postgresql.org/docs/12/sql-keywords-appendix.html: `LEFT/RIGHT reserved (can be function or type)` ### Does this PR introduce any user-facing change? Yes, the Postgre dialect will not be affected by the ANSI mode config. ### How was this patch tested? Existing UT. Closes #26584 from xuanyuanking/SPARK-29951. Authored-by: Yuanjian Li <xyliyuanjian@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-21 00:56:48 +08:00
Takeshi Yamamuro	0032d85153	[SPARK-29968][SQL] Remove the Predicate code from SparkPlan ### What changes were proposed in this pull request? This is to refactor Predicate code; it mainly removed `newPredicate` from `SparkPlan`. Modifications are listed below; - Move `Predicate` from `o.a.s.sqlcatalyst.expressions.codegen.GeneratePredicate.scala` to `o.a.s.sqlcatalyst.expressions.predicates.scala` - To resolve the name conflict, rename `o.a.s.sqlcatalyst.expressions.codegen.Predicate` to `o.a.s.sqlcatalyst.expressions.BasePredicate` - Extend `CodeGeneratorWithInterpretedFallback ` for `BasePredicate` This comes from the cloud-fan suggestion: https://github.com/apache/spark/pull/26420#discussion_r348005497 ### Why are the changes needed? For better code/test coverage. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #26604 from maropu/RefactorPredicate. Authored-by: Takeshi Yamamuro <yamamuro@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-20 21:13:51 +08:00
Nikita Konda	5a70af7a6c	[SPARK-29029][SQL] Use AttributeMap in PhysicalOperation.collectProjectsAndFilters ### What changes were proposed in this pull request? This PR fixes the issue of substituting aliases while collecting filters in `PhysicalOperation.collectProjectsAndFilters`. When the `AttributeReference` in alias map differs from the `AttributeReference` in filter condition only in qualifier, it does not substitute alias and throws exception saying `key videoid#47L not found` in the following scenario. ``` [1] Project [userid#0] +- [2] Filter (isnotnull(videoid#47L) && NOT (videoid#47L = 30)) +- [3] Project [factorial(videoid#1) AS videoid#47L, userid#0] +- [4] Filter (isnotnull(avebitrate#2) && (avebitrate#2 < 10)) +- [5] Relation[userid#0,videoid#1,avebitrate#2] ``` ### Why are the changes needed? We need to use `AttributeMap` where the key is `AttributeReference`'s `ExprId` instead of `Map[Attribute, Expression]` while collecting and substituting aliases in `PhysicalOperation.collectProjectsAndFilters`. ### Does this PR introduce any user-facing change? No ### How was this patch tested? New unit tests were added in `TestPhysicalOperation` which reproduces the bug Closes #25761 from nikitagkonda/SPARK-29029-use-attributemap-for-aliasmap-in-physicaloperation. Authored-by: Nikita Konda <nikita.konda@workday.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-19 20:01:42 -08:00
Wenchen Fan	9e58b10c8e	[SPARK-29945][SQL] do not handle negative sign specially in the parser ### What changes were proposed in this pull request? Remove the special handling of the negative sign in the parser (interval literal and type constructor) ### Why are the changes needed? The negative sign is an operator (UnaryMinus). We don't need to handle it specially, which is kind of doing constant folding at parser side. ### Does this PR introduce any user-facing change? The error message becomes a little different. Now it reports type mismatch for the `-` operator. ### How was this patch tested? existing tests Closes #26578 from cloud-fan/interval. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>	2019-11-20 11:08:04 +09:00
Maxim Gekk	40b8a08b8b	[SPARK-29963][SQL][TESTS] Check formatting timestamps up to microsecond precision by JSON/CSV datasource ### What changes were proposed in this pull request? In the PR, I propose to add tests from the commit `47cb1f359a` for Spark 2.4 that check formatting of timestamp strings for various seconds fractions. ### Why are the changes needed? To make sure that current behavior is the same as in Spark 2.4 ### Does this PR introduce any user-facing change? No ### How was this patch tested? By running `CSVSuite`, `JsonFunctionsSuite` and `TimestampFormatterSuite`. Closes #26601 from MaxGekk/format-timestamp-micros-tests. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-11-20 10:34:25 +09:00
Kent Yao	79ed4ae2db	[SPARK-29926][SQL] Fix weird interval string whose value is only a dangling decimal point ### What changes were proposed in this pull request? Currently, we support to parse '1. second' to 1s or even '. second' to 0s. ```sql -- !query 118 select interval '1. seconds' -- !query 118 schema struct<1 seconds:interval> -- !query 118 output 1 seconds -- !query 119 select interval '. seconds' -- !query 119 schema struct<0 seconds:interval> -- !query 119 output 0 seconds ``` ```sql postgres=# select interval '1. second'; ERROR: invalid input syntax for type interval: "1. second" LINE 1: select interval '1. second'; postgres=# select interval '. second'; ERROR: invalid input syntax for type interval: ". second" LINE 1: select interval '. second'; ``` We fix this by fixing the new interval parser's VALUE_FRACTIONAL_PART state With further digging, we found that 1. is valid in python, r, scala, and presto and so on... so this PR ONLY forbid the invalid interval value in the form of '. seconds'. ### Why are the changes needed? bug fix ### Does this PR introduce any user-facing change? yes, now we treat '. second' .... as invalid intervals ### How was this patch tested? add ut Closes #26573 from yaooqinn/SPARK-29926. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-19 21:01:26 +08:00
Wenchen Fan	16134d6d0f	[SPARK-29948][SQL] make the default alias consistent between date, timestamp and interval ### What changes were proposed in this pull request? Update `Literal.sql` to make date, timestamp and interval consistent. They should all use the `TYPE 'value'` format. ### Why are the changes needed? Make the default alias consistent. For example, without this patch we will see ``` scala> sql("select interval '1 day', date '2000-10-10'").show +------+-----------------+ \|1 days\|DATE '2000-10-10'\| +------+-----------------+ \|1 days\| 2000-10-10\| +------+-----------------+ ``` ### Does this PR introduce any user-facing change? no ### How was this patch tested? existing tests Closes #26579 from cloud-fan/sql. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-19 15:37:35 +08:00
Terry Kim	3d45779b68	[SPARK-29728][SQL] Datasource V2: Support ALTER TABLE RENAME TO ### What changes were proposed in this pull request? This PR adds `ALTER TABLE a.b.c RENAME TO x.y.x` support for V2 catalogs. ### Why are the changes needed? The current implementation doesn't support this command V2 catalogs. ### Does this PR introduce any user-facing change? Yes, now the renaming table works for v2 catalogs: ``` scala> spark.sql("SHOW TABLES IN testcat.ns1.ns2").show +---------+---------+ \|namespace\|tableName\| +---------+---------+ \| ns1.ns2\| old\| +---------+---------+ scala> spark.sql("ALTER TABLE testcat.ns1.ns2.old RENAME TO testcat.ns1.ns2.new").show scala> spark.sql("SHOW TABLES IN testcat.ns1.ns2").show +---------+---------+ \|namespace\|tableName\| +---------+---------+ \| ns1.ns2\| new\| +---------+---------+ ``` ### How was this patch tested? Added unit tests. Closes #26539 from imback82/rename_table. Authored-by: Terry Kim <yuminkim@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-19 12:03:29 +08:00
HyukjinKwon	882f54b0a3	[SPARK-29870][SQL][FOLLOW-UP] Keep CalendarInterval's toString ### What changes were proposed in this pull request? This is a followup of https://github.com/apache/spark/pull/26418. This PR removed `CalendarInterval`'s `toString` with an unfinished changes. ### Why are the changes needed? 1. Ideally we should make each PR isolated and separate targeting one issue without touching unrelated codes. 2. There are some other places where the string formats were exposed to users. For example: ```scala scala> sql("select interval 1 days as a").selectExpr("to_csv(struct(a))").show() ``` ``` +--------------------------+ \|to_csv(named_struct(a, a))\| +--------------------------+ \| "CalendarInterval...\| +--------------------------+ ``` 3. Such fixes: ```diff private def writeMapData( map: MapData, mapType: MapType, fieldWriter: ValueWriter): Unit = { val keyArray = map.keyArray() + val keyString = mapType.keyType match { + case CalendarIntervalType => + (i: Int) => IntervalUtils.toMultiUnitsString(keyArray.getInterval(i)) + case _ => (i: Int) => keyArray.get(i, mapType.keyType).toString + } ``` can cause performance regression due to type dispatch for each map. ### Does this PR introduce any user-facing change? Yes, see 2. case above. ### How was this patch tested? Manually tested. Closes #26572 from HyukjinKwon/SPARK-29783. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-11-19 09:11:41 +09:00
HyukjinKwon	8469614c05	[SPARK-25694][SQL][FOLLOW-UP] Move 'spark.sql.defaultUrlStreamHandlerFactory.enabled' into StaticSQLConf.scala ### What changes were proposed in this pull request? This PR is a followup of https://github.com/apache/spark/pull/26530 and proposes to move the configuration `spark.sql.defaultUrlStreamHandlerFactory.enabled` to `StaticSQLConf.scala` for consistency. ### Why are the changes needed? To put the similar configurations together and for readability. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Manually tested as described in https://github.com/apache/spark/pull/26530. Closes #26570 from HyukjinKwon/SPARK-25694. Authored-by: HyukjinKwon <gurwls223@apache.org> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-11-19 09:08:20 +09:00
Kent Yao	ae6b711b26	[SPARK-29941][SQL] Add ansi type aliases for char and decimal ### What changes were proposed in this pull request? Checked with SQL Standard and PostgreSQL > CHAR is equivalent to CHARACTER. DEC is equivalent to DECIMAL. INT is equivalent to INTEGER. VARCHAR is equivalent to CHARACTER VARYING. ... ```sql postgres=# select dec '1.0'; numeric --------- 1.0 (1 row) postgres=# select CHARACTER '. second'; bpchar ---------- . second (1 row) postgres=# select CHAR '. second'; bpchar ---------- . second (1 row) ``` ### Why are the changes needed? For better ansi support ### Does this PR introduce any user-facing change? yes, we add character as char and dec as decimal ### How was this patch tested? add ut Closes #26574 from yaooqinn/SPARK-29941. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-18 23:30:31 +08:00
fuwhu	c32e228689	[SPARK-29859][SQL] ALTER DATABASE (SET LOCATION) should look up catalog like v2 commands ### What changes were proposed in this pull request? Add AlterNamespaceSetLocationStatement, AlterNamespaceSetLocation, AlterNamespaceSetLocationExec to make ALTER DATABASE (SET LOCATION) look up catalog like v2 commands. And also refine the code of AlterNamespaceSetProperties, AlterNamespaceSetPropertiesExec, DescribeNamespace, DescribeNamespaceExec to use SupportsNamespaces instead of CatalogPlugin for catalog parameter. ### Why are the changes needed? It's important to make all the commands have the same catalog/namespace resolution behavior, to avoid confusing end-users. ### Does this PR introduce any user-facing change? Yes, add "ALTER NAMESPACE ... SET LOCATION" whose function is same as "ALTER DATABASE ... SET LOCATION" and "ALTER SCHEMA ... SET LOCATION". ### How was this patch tested? New unit tests Closes #26562 from fuwhu/SPARK-29859. Authored-by: fuwhu <bestwwg@163.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-18 20:40:23 +08:00
Kent Yao	50f6d930da	[SPARK-29870][SQL] Unify the logic of multi-units interval string to CalendarInterval ### What changes were proposed in this pull request? We now have two different implementation for multi-units interval strings to CalendarInterval type values. One is used to covert interval string literals to CalendarInterval. This approach will re-delegate the interval string to spark parser which handles the string as a `singleInterval` -> `multiUnitsInterval` -> eventually call `IntervalUtils.fromUnitStrings` The other is used in `Cast`, which eventually calls `IntervalUtils.stringToInterval`. This approach is ~10 times faster than the other. We should unify these two for better performance and simple logic. this pr uses the 2nd approach. ### Why are the changes needed? We should unify these two for better performance and simple logic. ### Does this PR introduce any user-facing change? no ### How was this patch tested? we shall not fail on existing uts Closes #26491 from yaooqinn/SPARK-29870. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-18 15:50:06 +08:00
Kent Yao	5cebe587c7	[SPARK-29783][SQL] Support SQL Standard/ISO_8601 output style for interval type ### What changes were proposed in this pull request? Add 3 interval output types which are named as `SQL_STANDARD`, `ISO_8601`, `MULTI_UNITS`. And we add a new conf `spark.sql.dialect.intervalOutputStyle` for this. The `MULTI_UNITS` style displays the interval values in the former behavior and it is the default. The newly added `SQL_STANDARD`, `ISO_8601` styles can be found in the following table. Style \| conf \| Year-Month Interval \| Day-Time Interval \| Mixed Interval -- \| -- \| -- \| -- \| -- Format With Time Unit Designators \| MULTI_UNITS \| 1 year 2 mons \| 1 days 2 hours 3 minutes 4.123456 seconds \| interval 1 days 2 hours 3 minutes 4.123456 seconds SQL STANDARD \| SQL_STANDARD \| 1-2 \| 3 4:05:06 \| -1-2 3 -4:05:06 ISO8601 Basic Format\| ISO_8601\| P1Y2M\| P3DT4H5M6S\|P-1Y-2M3D-4H-5M-6S ### Why are the changes needed? for ANSI SQL support ### Does this PR introduce any user-facing change? yes，interval out now has 3 output styles ### How was this patch tested? add new unit tests cc cloud-fan maropu MaxGekk HyukjinKwon thanks. Closes #26418 from yaooqinn/SPARK-29783. Authored-by: Kent Yao <yaooqinn@hotmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-18 15:42:22 +08:00
gschiavon	73912379d0	[SPARK-29020][SQL] Improving array_sort behaviour ### What changes were proposed in this pull request? I've noticed that there are two functions to sort arrays sort_array and array_sort. sort_array is from 1.5.0 and it has the possibility of ordering both ascending and descending array_sort is from 2.4.0 and it only has the possibility of ordering in ascending. Basically I just added the possibility of ordering either ascending or descending using array_sort. I think it would be good to have unified behaviours and not having to user sort_array when you want to order in descending order. Imagine that you are new to spark, I'd like to be able to sort array using the newest spark functions. ### Why are the changes needed? Basically to be able to sort the array in descending order using array_sort instead of using sort_array from 1.5.0 ### Does this PR introduce any user-facing change? Yes, now you are able to sort the array in descending order. Note that it has the same behaviour with nulls than sort_array ### How was this patch tested? Test's added This is the link to the [jira](https://issues.apache.org/jira/browse/SPARK-29020) Closes #25728 from Gschiavon/improving-array-sort. Lead-authored-by: gschiavon <german.schiavon@lifullconnect.com> Co-authored-by: Takuya UESHIN <ueshin@databricks.com> Co-authored-by: gschiavon <Gschiavon@users.noreply.github.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>	2019-11-18 16:07:05 +09:00
xy_xin	d83cacfcf5	[SPARK-29907][SQL] Move DELETE/UPDATE/MERGE relative rules to dmlStatementNoWith to support cte ### What changes were proposed in this pull request? SPARK-27444 introduced `dmlStatementNoWith` so that any dml that needs cte support can leverage it. It be better if we move DELETE/UPDATE/MERGE rules to `dmlStatementNoWith`. ### Why are the changes needed? Wit this change, we can support syntax like "With t AS (SELECT) DELETE FROM xxx", and so as UPDATE/MERGE. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? New cases added. Closes #26536 from xianyinxin/SPARK-29907. Authored-by: xy_xin <xianyin.xxy@alibaba-inc.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>	2019-11-18 11:48:56 +08:00
Maxim Gekk	5eb8973f87	[SPARK-29930][SQL] Remove SQL configs declared to be removed in Spark 3.0 ### What changes were proposed in this pull request? In the PR, I propose to remove the following SQL configs: 1. `spark.sql.fromJsonForceNullableSchema` 2. `spark.sql.legacy.compareDateTimestampInTimestamp` 3. `spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation` that are declared to be removed in Spark 3.0 ### Why are the changes needed? To make code cleaner and improve maintainability. ### Does this PR introduce any user-facing change? Yes ### How was this patch tested? By `TypeCoercionSuite`, `JsonExpressionsSuite` and `DDLSuite`. Closes #26559 from MaxGekk/remove-sql-configs. Authored-by: Maxim Gekk <max.gekk@gmail.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-17 10:14:04 -08:00
fuwhu	388a737b98	[SPARK-29858][SQL] ALTER DATABASE (SET DBPROPERTIES) should look up catalog like v2 commands ### What changes were proposed in this pull request? Add AlterNamespaceSetPropertiesStatement, AlterNamespaceSetProperties and AlterNamespaceSetPropertiesExec to make ALTER DATABASE (SET DBPROPERTIES) command look up catalog like v2 commands. ### Why are the changes needed? It's important to make all the commands have the same catalog/namespace resolution behavior, to avoid confusing end-users. ### Does this PR introduce any user-facing change? Yes, add "ALTER NAMESPACE ... SET (DBPROPERTIES \| PROPERTIES) ..." whose function is same as "ALTER DATABASE ... SET DBPROPERTIES ..." and "ALTER SCHEMA ... SET DBPROPERTIES ...". ### How was this patch tested? New unit test Closes #26551 from fuwhu/SPARK-29858. Authored-by: fuwhu <bestwwg@163.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>	2019-11-16 19:50:02 -08:00

1 2 3 4 5 ...

4012 commits