### What changes were proposed in this pull request?
I execute some query as` select get_json_object(ytag, '$.y1') AS y1 from t4`; SparkSQL return null but Hive return correct results.
In my production environment, ytag is a json wrapped by single quotes,as follows
```
{'y1': 'shuma', 'y2': 'shuma:shouji'}
{'y1': 'jiaoyu', 'y2': 'jiaoyu:gaokao'}
{'y1': 'yule', 'y2': 'yule:mingxing'}
```
Then l realized some functions including get_json_object and json_tuple does not support single quotes json parsing.
So l provide this PR to resolve the question.
### Why are the changes needed?
Enabled for Hive compatibility
### Does this PR introduce any user-facing change?
NO
### How was this patch tested?
NEW TESTS
Closes#26965 from wenfang6/enableSingleQuotesJsonForSparkSQL.
Authored-by: wenfang <wenfang@360.cn>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Rename `spark.sql.streaming.forceDeleteTempCheckpointLocation` to `spark.sql.streaming.forceDeleteTempCheckpointLocation.enabled`.
### Why are the changes needed?
To follow the other boolean conf naming convention.
### Does this PR introduce any user-facing change?
No, as this config is newly added in 3.0.
### How was this patch tested?
Pass Jenkins.
Closes#26981 from Ngone51/SPARK-26389-FOLLOWUP.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
accuracyExpression can accept Long which may cause overflow error.
accuracyExpression can accept fractions which are implicitly floored.
accuracyExpression can accept null which is implicitly changed to 0.
percentageExpression can accept null but cause MatchError.
percentageExpression can accept ArrayType(_, nullable=true) in which the nulls are implicitly changed to zeros.
##### cases
```sql
select percentile_approx(10.0, 0.5, 2147483648); -- overflow and fail
select percentile_approx(10.0, 0.5, 4294967297); -- overflow but success
select percentile_approx(10.0, 0.5, null); -- null cast to 0
select percentile_approx(10.0, 0.5, 1.2); -- 1.2 cast to 1
select percentile_approx(10.0, null, 1); -- scala.MatchError
select percentile_approx(10.0, array(0.2, 0.4, null), 1); -- null cast to zero.
```
##### behavior before
```sql
+select percentile_approx(10.0, 0.5, 2147483648)
+org.apache.spark.sql.AnalysisException
+cannot resolve 'percentile_approx(10.0BD, CAST(0.5BD AS DOUBLE), CAST(2147483648L AS INT))' due to data type mismatch: The accuracy provided must be a positive integer literal (current value = -2147483648); line 1 pos 7
+
+select percentile_approx(10.0, 0.5, 4294967297)
+10.0
+
+select percentile_approx(10.0, 0.5, null)
+org.apache.spark.sql.AnalysisException
+cannot resolve 'percentile_approx(10.0BD, CAST(0.5BD AS DOUBLE), CAST(NULL AS INT))' due to data type mismatch: The accuracy provided must be a positive integer literal (current value = 0); line 1 pos 7
+
+select percentile_approx(10.0, 0.5, 1.2)
+10.0
+
+select percentile_approx(10.0, null, 1)
+scala.MatchError
+null
+
+
+select percentile_approx(10.0, array(0.2, 0.4, null), 1)
+[10.0,10.0,10.0]
```
##### behavior after
```sql
+select percentile_approx(10.0, 0.5, 2147483648)
+10.0
+
+select percentile_approx(10.0, 0.5, 4294967297)
+10.0
+
+select percentile_approx(10.0, 0.5, null)
+org.apache.spark.sql.AnalysisException
+cannot resolve 'percentile_approx(10.0BD, 0.5BD, NULL)' due to data type mismatch: argument 3 requires integral type, however, 'NULL' is of null type.; line 1 pos 7
+
+select percentile_approx(10.0, 0.5, 1.2)
+org.apache.spark.sql.AnalysisException
+cannot resolve 'percentile_approx(10.0BD, 0.5BD, 1.2BD)' due to data type mismatch: argument 3 requires integral type, however, '1.2BD' is of decimal(2,1) type.; line 1 pos 7
+
+select percentile_approx(10.0, null, 1)
+java.lang.IllegalArgumentException
+The value of percentage must be be between 0.0 and 1.0, but got null
+
+select percentile_approx(10.0, array(0.2, 0.4, null), 1)
+java.lang.IllegalArgumentException
+Each value of the percentage array must be be between 0.0 and 1.0, but got [0.2,0.4,null]
```
### Why are the changes needed?
bug fix
### Does this PR introduce any user-facing change?
yes, fix some improper usages of percentile_approx as cases list above
### How was this patch tested?
add ut
Closes#26905 from yaooqinn/SPARK-30266.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Set `isFinalPlan=true` before posting the final AdaptiveSparkPlan event (`SparkListenerSQLAdaptiveExecutionUpdate`)
### Why are the changes needed?
Otherwise, any attempt to listen on the final event by pattern matching `isFinalPlan=true` would fail
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
All tests in `AdaptiveQueryExecSuite` are exteneded with a verification that a `SparkListenerSQLAdaptiveExecutionUpdate` event with `isFinalPlan=True` exists
Closes#26983 from manuzhang/spark-30331.
Authored-by: manuzhang <owenzhang1990@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
While querying using **desc** , if column name is not entered exactly as per the column name given during the table creation, the colstats are wrong. fetching of col stats has been made case insensitive.
### Why are the changes needed?
functions like **analyze**, etc support case insensitive retrieval of column data.
### Does this PR introduce any user-facing change?
NO
### How was this patch tested?
<!--
Unit test has been rewritten and tested.
Closes#26927 from PavithraRamachandran/desc_caseinsensitive.
Authored-by: Pavithra Ramachandran <pavi.rams@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This is an alternative solution to https://github.com/apache/spark/pull/25095.
SQLMetrics use -1 as init value as a work around for [SPARK-11013](https://issues.apache.org/jira/browse/SPARK-11013.) However, it may bring out some badcases as https://github.com/apache/spark/pull/26726 reporting. In fact, we only need to reserve -1 when doing min max statistics in `SQLMetrics.stringValue` so that we can filter out those not initialized accumulators.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing UTs
Closes#26899 from WangGuangxin/sqlmetrics.
Authored-by: wangguangxin.cn <wangguangxin.cn@bytedance.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR adds and exposes the options, 'recursiveFileLookup' and 'pathGlobFilter' in file sources 'mergeSchema' in ORC, into documentation.
- `recursiveFileLookup` at file sources: https://github.com/apache/spark/pull/24830 ([SPARK-27627](https://issues.apache.org/jira/browse/SPARK-27627))
- `pathGlobFilter` at file sources: https://github.com/apache/spark/pull/24518 ([SPARK-27990](https://issues.apache.org/jira/browse/SPARK-27990))
- `mergeSchema` at ORC: https://github.com/apache/spark/pull/24043 ([SPARK-11412](https://issues.apache.org/jira/browse/SPARK-11412))
**Note that** `timeZone` option was not moved from `DataFrameReader.options` as I assume it will likely affect other datasources as well once DSv2 is complete.
### Why are the changes needed?
To document available options in sources properly.
### Does this PR introduce any user-facing change?
In PySpark, `pathGlobFilter` can be set via `DataFrameReader.(text|orc|parquet|json|csv)` and `DataStreamReader.(text|orc|parquet|json|csv)`.
### How was this patch tested?
Manually built the doc and checked the output. Option setting in PySpark is rather a logical change. I manually tested one only:
```bash
$ ls -al tmp
...
-rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 aa
-rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 ab
-rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 ac
-rw-r--r-- 1 hyukjin.kwon staff 3 Dec 20 12:19 cc
```
```python
>>> spark.read.text("tmp", pathGlobFilter="*c").show()
```
```
+-----+
|value|
+-----+
| ac|
| cc|
+-----+
```
Closes#26958 from HyukjinKwon/doc-followup.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR is a follow-up to https://github.com/apache/spark/pull/25001
### Why are the changes needed?
No
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Pass the Jenkins with the newly update test files.
Closes#26949 from beliefer/uncomment-like-escape-tests.
Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Fixed typo in `docs` directory and in other directories
1. Find typo in `docs` and apply fixes to files in all directories
2. Fix `the the` -> `the`
### Why are the changes needed?
Better readability of documents
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
No test needed
Closes#26976 from kiszk/typo_20191221.
Authored-by: Kazuaki Ishizaki <ishizaki@jp.ibm.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
a followup of https://github.com/apache/spark/pull/26576, which mistakenly removes the UI update of the final plan.
### Why are the changes needed?
fix mistake.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
existing tests
Closes#26968 from cloud-fan/fix.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
When querying a partitioned table with format `org.apache.hive.hcatalog.data.JsonSerDe` and more than one task runs in each executor concurrently, the following exception is encountered:
`java.lang.ClassCastException: java.util.ArrayList cannot be cast to org.apache.hive.hcatalog.data.HCatRecord`
The exception occurs in `HadoopTableReader.fillObject`.
`org.apache.hive.hcatalog.data.JsonSerDe#initialize` populates a `cachedObjectInspector` field by calling `HCatRecordObjectInspectorFactory.getHCatRecordObjectInspector`, which is not thread-safe; this `cachedObjectInspector` is returned by `JsonSerDe#getObjectInspector`.
We protect against this Hive bug by synchronizing on an object when we need to call `initialize` on `org.apache.hadoop.hive.serde2.Deserializer` instances (which may be `JsonSerDe` instances). By doing so, the `ObjectInspector` for the `Deserializer` of the partitions of the JSON table and that of the table `SerDe` are the same cached `ObjectInspector` and `HadoopTableReader.fillObject` then works correctly. (If the `ObjectInspector`s are different, then a bug in `HCatRecordObjectInspector` causes an `ArrayList` to be created instead of an `HCatRecord`, resulting in the `ClassCastException` that is seen.)
### Why are the changes needed?
To avoid HIVE-15773 / HIVE-21752.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Tested manually on a cluster with a partitioned JSON table and running a query using more than one core per executor. Before this change, the ClassCastException happens consistently. With this change it does not happen.
Closes#26895 from wypoon/SPARK-17398.
Authored-by: Wing Yew Poon <wypoon@cloudera.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
### What changes were proposed in this pull request?
Remove usages of Guava that no longer work in Guava 27, and replace with workalikes. I'll comment on key types of changes below.
### Why are the changes needed?
Hadoop 3.2.1 uses Guava 27, so this helps us avoid problems running on Hadoop 3.2.1+ and generally lowers our exposure to Guava.
### Does this PR introduce any user-facing change?
Should not be, but see notes below on hash codes and toString.
### How was this patch tested?
Existing tests will verify whether these changes break anything for Guava 14.
I manually built with an updated version and it compiles with Guava 27; tests running manually locally now.
Closes#26911 from srowen/SPARK-30272.
Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Add batching support in Alter table add partition flow. Also calculate new partition sizes faster by doing listing in parallel.
### Why are the changes needed?
This PR split the the single createPartitions() call AlterTableAddPartition flow into smaller batches, which could prevent
- SocketTimeoutException: Adding thousand of partitions in Hive metastore itself takes lot of time. Because of this hive client fails with SocketTimeoutException.
- Hive metastore from OOM (caused by millions of partitions).
It will also try to gather stats (total size of all files in all new partitions) faster by parallely listing the new partition paths.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added UT.
Also tested on a cluster in HDI with 15000 partitions with remote metastore server. Without batching - operation fails with SocketTimeoutException, With batching it finishes in 25 mins.
Closes#26569 from prakharjain09/add_partition_batching_r1.
Authored-by: Prakhar Jain <prakharjain09@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
In this PR, For a given metrics id we are checking if the driver side accumulator's value is greater than max of all stages value. If it's true, then we are removing that entry from the Hashmap. By doing this, for this metrics, "driver" would be displayed on the UI(As the driver would have the maximum value)
### Why are the changes needed?
This PR fixes https://issues.apache.org/jira/browse/SPARK-30300. Currently driver's metric value is not compared while caluculating the max.
### Does this PR introduce any user-facing change?
For the metrics where driver's value is greater than max of all stages, this is the change.
Previous : (min, median, max (stageId 0( attemptId 1): taskId 2))
Now: (min, median, max (driver))
### How was this patch tested?
Ran unit tests.
Closes#26941 from nartal1/SPARK-30300.
Authored-by: Niranjan Artal <nartal@nvidia.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
### What changes were proposed in this pull request?
When date and timestamp values are fields of arrays, maps, etc, we convert them to hive string using `toString`. This makes the result wrong before the default transition ’1582-10-15‘.
https://bugs.openjdk.java.net/browse/JDK-8061577?focusedCommentId=13566712&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-13566712
cases to reproduce:
```sql
+-- !query 47
+select array(cast('1582-10-13' as date), date '1582-10-14', date '1582-10-15', null)
+-- !query 47 schema
+struct<array(CAST(1582-10-13 AS DATE), DATE '1582-10-14', DATE '1582-10-15', CAST(NULL AS DATE)):array<date>>
+-- !query 47 output
+[1582-10-03,1582-10-04,1582-10-15,null]
+
+
+-- !query 48
+select cast('1582-10-13' as date), date '1582-10-14', date '1582-10-15'
+-- !query 48 schema
+struct<CAST(1582-10-13 AS DATE):date,DATE '1582-10-14':date,DATE '1582-10-15':date>
+-- !query 48 output
+1582-10-13 1582-10-14 1582-10-15
```
other refencences
https://github.com/h2database/h2database/issues/831
### Why are the changes needed?
bug fix
### Does this PR introduce any user-facing change?
yes, complex types containing datetimes in `spark-sql `script and thrift server can result same as self-contained spark app or `spark-shell` script
### How was this patch tested?
add uts
Closes#26942 from yaooqinn/SPARK-30301.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
AQE need catch the exception when doing materialize. And then user can get more information about the exception when enable AQE.
### Why are the changes needed?
provide more cause about the exception when doing materialize
### Does this PR introduce any user-facing change?
Before this PR, the error in the added unit test is
java.lang.RuntimeException: Invalid bucket file file:///${SPARK_HOME}/assembly/spark-warehouse/org.apache.spark.sql.execution.adaptive.AdaptiveQueryExecSuite/bucketed_table/part-00000-3551343c-d003-4ada-82c8-45c712a72efe-c000.snappy.parquet
After this PR, the error in the added unit test is:
org.apache.spark.SparkException: Adaptive execution failed due to stage materialization failures.
### How was this patch tested?
Add a new ut
Closes#26931 from JkSelf/catchMoreException.
Authored-by: jiake <ke.a.jia@intel.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request?
When we reuse exchanges in AQE, what we produce is `ReuseQueryStage(QueryStage(Exchange))`. This PR changes it to `QueryStage(ReusedExchange(Exchange))`.
This PR also fixes an issue in `LocalShuffleReaderExec.outputPartitioning`. We can only preserve the partitioning if we read one mapper per task.
### Why are the changes needed?
`QueryStage` is light-weighted and we don't need to reuse its instance. What we really care is to reuse the exchange instance, which has heavy states (e.g. broadcasted valued, submitted map stage).
To simplify the framework, we should use the existing `ReusedExchange` node to do the reuse work, instead of creating a new node.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
existing tests
Closes#26952 from cloud-fan/aqe.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request?
This PR is to use `expressionWithAlias` for remaining functions for which alias name can be used. Remaining functions are:
`Average, First, Last, ApproximatePercentile, StddevSamp, VarianceSamp`
PR https://github.com/apache/spark/pull/26712 introduced `expressionWithAlias`
### Why are the changes needed?
Error message is wrong when alias name is used for above mentioned functions.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Manually
Closes#26808 from amanomer/fncAlias.
Lead-authored-by: Aman Omer <amanomer1996@gmail.com>
Co-authored-by: Aman Omer <40591404+amanomer@users.noreply.github.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Added the `sealed` keyword to the `Filter` class
### Why are the changes needed?
To do not miss handling of new filters in a datasource in the future. For example, `AlwaysTrue` and `AlwaysFalse` were added recently by https://github.com/apache/spark/pull/23606
### Does this PR introduce any user-facing change?
Should not.
### How was this patch tested?
By existing tests.
Closes#26950 from MaxGekk/sealed-filter.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This patch addresses missing metric, the number of output rows for streaming aggregation with append mode. Other modes are correctly measuring it.
### Why are the changes needed?
Without the patch, the value for such metric is always 0.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Unit test added. Also manually tested with below query:
> query
```
import spark.implicits._
spark.conf.set("spark.sql.shuffle.partitions", "5")
val df = spark.readStream
.format("rate")
.option("rowsPerSecond", 1000)
.load()
.withWatermark("timestamp", "5 seconds")
.selectExpr("timestamp", "mod(value, 100) as mod", "value")
.groupBy(window($"timestamp", "10 seconds"), $"mod")
.agg(max("value").as("max_value"), min("value").as("min_value"), avg("value").as("avg_value"))
val query = df
.writeStream
.format("memory")
.option("queryName", "test")
.outputMode("append")
.start()
query.awaitTermination()
```
> before the patch
![screenshot-before-SPARK-29450](https://user-images.githubusercontent.com/1317309/69023217-58d7bc80-0a01-11ea-8cac-40f1cced6d16.png)
> after the patch
![screenshot-after-SPARK-29450](https://user-images.githubusercontent.com/1317309/69023221-5c6b4380-0a01-11ea-8a66-7bf1b7d09fc7.png)
Closes#26104 from HeartSaVioR/SPARK-29450.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
As mentioned in https://github.com/apache/spark/pull/26548#pullrequestreview-334345333, some test cases in `RecordBinaryComparatorSuite` use a fixed arrayOffset when writing to long arrays, this could lead to weird stuff including crashing with a SIGSEGV.
This PR fix the problem by computing the arrayOffset based on `Platform.LONG_ARRAY_OFFSET`.
### How was this patch tested?
Tested locally. Previously, when we try to add `System.gc()` between write into long array and compare by RecordBinaryComparator, there is a chance to hit JVM crash with SIGSEGV like:
```
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007efc66970bcb, pid=11831, tid=0x00007efc0f9f9700
#
# JRE version: OpenJDK Runtime Environment (8.0_222-b10) (build 1.8.0_222-8u222-b10-1ubuntu1~16.04.1-b10)
# Java VM: OpenJDK 64-Bit Server VM (25.222-b10 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V [libjvm.so+0x5fbbcb]
#
# Core dump written. Default location: /home/jenkins/workspace/sql/core/core or core.11831
#
# An error report file with more information is saved as:
# /home/jenkins/workspace/sql/core/hs_err_pid11831.log
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
#
```
After the fix those test cases didn't crash the JVM anymore.
Closes#26939 from jiangxb1987/rbc.
Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
This reverts commit 8e667db5d8.
Closes#26940 from gengliangwang/revert_Spark_29629.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
Since [25001](https://github.com/apache/spark/pull/25001), spark support like escape syntax.
We should also sync the escape used by `LikeSimplification`.
Avoid optimize failed.
No.
Add UT.
Closes#26880 from ulysses-you/SPARK-30254.
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
We could get the Partitions Statistics Info.But in some specail case, The Info like totalSize,rawDataSize,rowCount maybe empty. When we do some ddls like
`desc formatted partition` ,the NumberFormatException is showed as below:
```
spark-sql> desc formatted table1 partition(year='2019', month='10', day='17', hour='23');
19/10/19 00:02:40 ERROR SparkSQLDriver: Failed in [desc formatted table1 partition(year='2019', month='10', day='17', hour='23')]
java.lang.NumberFormatException: Zero length BigInteger
at java.math.BigInteger.(BigInteger.java:411)
at java.math.BigInteger.(BigInteger.java:597)
at scala.math.BigInt$.apply(BigInt.scala:77)
at org.apache.spark.sql.hive.client.HiveClientImpl$$anonfun$31.apply(HiveClientImpl.scala:1056)
```
Although we can use 'Analyze table partition ' to update the totalSize,rawDataSize or rowCount, it's unresonable for normal SQL to throw NumberFormatException for Empty totalSize.We should fix the empty case when readHiveStats.
### Why are the changes needed?
This is a related to the robustness of the code and may lead to unexpected exception in some unpredictable situation.Here is the case:
<img width="981" alt="image" src="https://user-images.githubusercontent.com/20614350/70845771-7b88b400-1e8d-11ea-95b0-df5c58097d7d.png">
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
manual
Closes#26892 from southernriver/SPARK-30262.
Authored-by: chenliang <southernriver@163.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This pull request add support for Arrow MapType into Spark SQL.
### Why are the changes needed?
Without this change User's of spark are not able to query data in spark if one of columns is stored as map and Apache Arrow execution mode is preferred by user.
More info: https://issues.apache.org/jira/projects/SPARK/issues/SPARK-29493
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Introduced few unit tests around map type in existing arrow test suit
Closes#26512 from jalpan-randeri/feature-arrow-java-map-type.
Authored-by: Jalpan Randeri <randerij@amazon.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
fix listing up format issues in Dataset API Doc (scala & java)
### Why are the changes needed?
improve doc
### Does this PR introduce any user-facing change?
yes, API doc changing
### How was this patch tested?
no
Closes#26922 from yaooqinn/datasetdoc.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Use `TypeCoercion.findWiderTypeForTwo()` instead of `TypeCoercion.findTightestCommonType()` while preprocessing `inputTypes` in `ArrayContains`.
### Why are the changes needed?
`TypeCoercion.findWiderTypeForTwo()` also handles cases for DecimalType.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Test cases to be added.
Closes#26811 from amanomer/29600.
Authored-by: Aman Omer <amanomer1996@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
It's an obvious bug: currently when analyzing partition stats, we use old table stats to compare with newly computed stats to decide whether it should update stats or not.
### Why are the changes needed?
bug fix
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
add new tests
Closes#26908 from wzhfy/failto_update_part_stats.
Authored-by: Zhenhua Wang <wzh_zju@163.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
A followup for #26699, clear the size field for interval column cache, which is needless and can reduce the memory cost.
### Why are the changes needed?
followup
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
existing ut.
Closes#26906 from yaooqinn/SPARK-30066-f.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Now spark use `ObjectInspectorCopyOption.JAVA` as oi option which will convert any string to UTF-8 string. When write non UTF-8 code data, then `EFBFBD` will appear.
We should use `ObjectInspectorCopyOption.DEFAULT` to support pass the bytes.
### Why are the changes needed?
Here is the way to reproduce:
1. make a file contains 16 radix 'AABBCC' which is not the UTF-8 code.
2. create table test1 (c string) location '$file_path';
3. select hex(c) from test1; // AABBCC
4. craete table test2 (c string) as select c from test1;
5. select hex(c) from test2; // EFBFBDEFBFBDEFBFBD
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Closes#26831 from ulysses-you/SPARK-30201.
Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR applies the current namespace for the single-part table name if the current catalog is a non-session catalog.
Note that the reason the current namespace is not applied for the session catalog is that the single-part name could be referencing a temp view which doesn't belong to any namespaces. The empty namespace for a table inside the session catalog is resolved by the session catalog implementation.
### Why are the changes needed?
It's fixing the following bug where the current namespace is not respected:
```
sql("CREATE TABLE testcat.ns.t USING foo AS SELECT 1 AS id")
sql("USE testcat.ns")
sql("SHOW CURRENT NAMESPACE").show
+-------+---------+
|catalog|namespace|
+-------+---------+
|testcat| ns|
+-------+---------+
// `t` is not resolved since the current namespace `ns` is not used.
sql("DESCRIBE t").show
Failed to analyze query: org.apache.spark.sql.AnalysisException: Table not found: t;;
```
### Does this PR introduce any user-facing change?
Yes, the above `DESCRIBE` command will succeed.
### How was this patch tested?
Added tests.
Closes#26894 from imback82/current_namespace.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
1. Revert "Preparing development version 3.0.1-SNAPSHOT": 56dcd79
2. Revert "Preparing Spark release v3.0.0-preview2-rc2": c216ef1
### Why are the changes needed?
Shouldn't change master.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
manual test:
https://github.com/apache/spark/compare/5de5e46..wangyum:revert-masterCloses#26915 from wangyum/revert-master.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
### What changes were proposed in this pull request?
In the PR, I propose to move all tests that use deprecated Spark APIs to separate test classes, and add the annotation:
```scala
deprecated("This test suite will be removed.", "3.0.0")
```
The annotation suppress warnings from already deprecated methods and classes.
### Why are the changes needed?
The warnings about deprecated Spark APIs in tests does not indicate any issues because the tests use such APIs intentionally. Eliminating the warnings allows to highlight other warnings that could show real problems.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By existing test suites and by
- DeprecatedAvroFunctionsSuite
- DeprecatedDateFunctionsSuite
- DeprecatedDatasetAggregatorSuite
- DeprecatedStreamingAggregationSuite
- DeprecatedWholeStageCodegenSuite
Closes#26885 from MaxGekk/eliminate-deprecate-warnings.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
SPARK-30209 discusses about adding additional metrics such as stageId, attempId and taskId for max metrics. We have the data required to display in LiveStageMetrics. Need to capture and pass these metrics to display on the UI. To minimize memory used for variables, we are saving maximum of each metric id per stage. So per stage additional memory usage is (#metrics * 4 * sizeof(Long)).
Then max is calculated for each metric id among all stages which is passed in the stringValue method. Memory used is minimal. Ran the benchmark for runtime. Stage.Proc time has increased to around 1.5-2.5x but the Aggregate time has decreased.
### Why are the changes needed?
These additional metrics stageId, attemptId and taskId could help in debugging the jobs quicker. For a given operator, it will be easy to identify the task which is taking maximum time to complete from the SQL tab itself.
### Does this PR introduce any user-facing change?
Yes. stageId, attemptId and taskId is shown only for executor side metrics. For driver metrics, "(driver)" is displayed on UI.
![image (3)](https://user-images.githubusercontent.com/50492963/70763041-929d9980-1d07-11ea-940f-88ac6bdce9b5.png)
"Driver"
![image (4)](https://user-images.githubusercontent.com/50492963/70763043-94675d00-1d07-11ea-95ab-3478728cb435.png)
### How was this patch tested?
Manually tested, ran benchmark script for runtime.
Closes#26843 from nartal1/SPARK-30209.
Authored-by: Niranjan Artal <nartal@nvidia.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
### What changes were proposed in this pull request?
This PR adds the documentation of the new `mode` added to `Dataset.explain`.
### Why are the changes needed?
To let users know the new modes.
### Does this PR introduce any user-facing change?
No (doc-only change).
### How was this patch tested?
Manually built the doc:
![Screen Shot 2019-12-16 at 3 34 28 PM](https://user-images.githubusercontent.com/6477701/70884617-d64f1680-2019-11ea-9336-247ade7f8768.png)
Closes#26903 from HyukjinKwon/SPARK-30200-doc.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
update DS v2 API to support add/alter column with column position
### Why are the changes needed?
We have a parser rule for column position, but we fail the query if it's specified, because the builtin catalog can't support add/alter column with column position.
Since we have the catalog plugin API now, we should let the catalog implementation to decide if it supports column position or not.
### Does this PR introduce any user-facing change?
not yet
### How was this patch tested?
new tests
Closes#26817 from cloud-fan/parser.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
As discussed in https://github.com/apache/spark/pull/26741#discussion_r357504518, `LookupCatalog.AsTemporaryViewIdentifier` is no longer used and can be removed.
### Why are the changes needed?
Code clean up
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Removed tests that were testing solely `AsTemporaryViewIdentifier` extractor.
Closes#26897 from imback82/30104-followup.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to fix documentation for slide function. Fixed the spacing issue and added some parameter related info.
### Why are the changes needed?
Documentation improvement
### Does this PR introduce any user-facing change?
No (doc-only change).
### How was this patch tested?
Manually tested by documentation build.
Closes#26896 from bboutkov/pyspark_doc_fix.
Authored-by: Boris Boutkov <boris.boutkov@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR mainly targets:
1. Expose only explain(mode: String) in Scala side
2. Clean up related codes
- Hide `ExplainMode` under private `execution` package. No particular reason but just because `ExplainUtils` exists there
- Use `case object` + `trait` pattern in `ExplainMode` to look after `ParseMode`.
- Move `Dataset.toExplainString` to `QueryExecution.explainString` to look after `QueryExecution.simpleString`, and deduplicate the codes at `ExplainCommand`.
- Use `ExplainMode` in `ExplainCommand` too.
- Add `explainString` to `PythonSQLUtils` to avoid unexpected test failure of PySpark during refactoring Scala codes side.
### Why are the changes needed?
To minimised exposed APIs, deduplicate, and clean up.
### Does this PR introduce any user-facing change?
`Dataset.explain(mode: ExplainMode)` will be removed (which only exists in master).
### How was this patch tested?
Manually tested and existing tests should cover.
Closes#26898 from HyukjinKwon/SPARK-30200-followup.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
In the PR, I propose to replace `setJacksonOptions()` in `JSONOptions` by `buildJsonFactory()` which builds `JsonFactory` using `JsonFactoryBuilder`. This allows to avoid using **deprecated** feature configurations from `JsonParser.Feature`.
### Why are the changes needed?
- The changes eliminate the following compilation warnings in `sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JSONOptions.scala`:
```
Warning:Warning:line (137)Java enum ALLOW_NUMERIC_LEADING_ZEROS in Java enum Feature is deprecated: see corresponding Javadoc for more information.
factory.configure(JsonParser.Feature.ALLOW_NUMERIC_LEADING_ZEROS, allowNumericLeadingZeros)
Warning:Warning:line (138)Java enum ALLOW_NON_NUMERIC_NUMBERS in Java enum Feature is deprecated: see corresponding Javadoc for more information.
factory.configure(JsonParser.Feature.ALLOW_NON_NUMERIC_NUMBERS, allowNonNumericNumbers)
Warning:Warning:line (139)Java enum ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER in Java enum Feature is deprecated: see corresponding Javadoc for more information.
factory.configure(JsonParser.Feature.ALLOW_BACKSLASH_ESCAPING_ANY_CHARACTER,
Warning:Warning:line (141)Java enum ALLOW_UNQUOTED_CONTROL_CHARS in Java enum Feature is deprecated: see corresponding Javadoc for more information.
factory.configure(JsonParser.Feature.ALLOW_UNQUOTED_CONTROL_CHARS, allowUnquotedControlChars)
```
- This put together building JsonFactory and set options from JSONOptions. So, we will not forget to call `setJacksonOptions` in the future.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By `JsonSuite`, `JsonFunctionsSuite`, `JsonExpressionsSuite`.
Closes#26797 from MaxGekk/eliminate-warning.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Fix bug : CREATE TABLE throw error when session catalog specified explicitly.
### Why are the changes needed?
Currently, Spark throw error when the session catalog is specified explicitly in "CREATE TABLE" and "CREATE TABLE AS SELECT" command, eg.
> CREATE TABLE spark_catalog.tbl USING json AS SELECT 1 AS i;
the error message is like below:
> 19/12/14 10:56:08 INFO HiveMetaStore: 0: get_table : db=spark_catalog tbl=tbl
> 19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr cmd=get_table : db=spark_catalog tbl=tbl
> 19/12/14 10:56:08 INFO HiveMetaStore: 0: get_database: spark_catalog
> 19/12/14 10:56:08 INFO audit: ugi=fuwhu ip=unknown-ip-addr cmd=get_database: spark_catalog
> 19/12/14 10:56:08 WARN ObjectStore: Failed to get database spark_catalog, returning NoSuchObjectException
> Error in query: Database 'spark_catalog' not found;
### Does this PR introduce any user-facing change?
Yes, after this PR, "CREATE TALBE" and "CREATE TABLE AS SELECT" can complete successfully when session catalog "spark_catalog" specified explicitly.
### How was this patch tested?
New unit tests added.
Closes#26887 from fuwhu/SPARK-30259.
Authored-by: fuwhu <bestwwg@163.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This pr is a followup of #26861 to address minor comments from viirya.
### Why are the changes needed?
For better error messages.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Manually tested.
Closes#26886 from maropu/SPARK-30231-FOLLOWUP.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Columnar execution support for interval types
### Why are the changes needed?
support cache tables with interval columns
improve performance too
### Does this PR introduce any user-facing change?
Yes cache table with accept interval columns
### How was this patch tested?
add ut
Closes#26699 from yaooqinn/SPARK-30066.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Add a timeout configuration for StreamingQuery.stop()
### Why are the changes needed?
The stop() method on a Streaming Query awaits the termination of the stream execution thread. However, the stream execution thread may block forever depending on the streaming source implementation (like in Kafka, which runs UninterruptibleThreads).
This causes control flow applications to hang indefinitely as well. We'd like to introduce a timeout to stop the execution thread, so that the control flow thread can decide to do an action if a timeout is hit.
### Does this PR introduce any user-facing change?
By default, no. If the timeout configuration is set, then a TimeoutException will be thrown if a stream cannot be stopped within the given timeout.
### How was this patch tested?
Unit tests
Closes#26771 from brkyvz/stopTimeout.
Lead-authored-by: Burak Yavuz <brkyvz@gmail.com>
Co-authored-by: Burak Yavuz <burak@databricks.com>
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
This reverts commit cada5beef7.
Closes#26883 from gengliangwang/revert.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
If a table name is qualified with session catalog name `spark_catalog`, the `DROP TABLE` command fails.
For example, the following
```
sql("CREATE TABLE tbl USING json AS SELECT 1 AS i")
sql("DROP TABLE spark_catalog.tbl")
```
fails with:
```
org.apache.spark.sql.catalyst.analysis.NoSuchDatabaseException: Database 'spark_catalog' not found;
at org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists(ExternalCatalog.scala:42)
at org.apache.spark.sql.catalyst.catalog.ExternalCatalog.requireDbExists$(ExternalCatalog.scala:40)
at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.requireDbExists(InMemoryCatalog.scala:45)
at org.apache.spark.sql.catalyst.catalog.InMemoryCatalog.tableExists(InMemoryCatalog.scala:336)
```
This PR correctly resolves `spark_catalog` as a catalog.
### Why are the changes needed?
It's fixing a bug.
### Does this PR introduce any user-facing change?
Yes, now, the `spark_catalog.tbl` in the above example is dropped as expected.
### How was this patch tested?
Added a test.
Closes#26878 from imback82/fix_drop_table.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This pr intends to support explain modes implemented in #26829 for PySpark.
### Why are the changes needed?
For better debugging info. in PySpark dataframes.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added UTs.
Closes#26861 from maropu/ExplainModeInPython.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This patch adds close() method to the DataWriter interface, which will become the place to cleanup the resource.
### Why are the changes needed?
The lifecycle of DataWriter instance ends at either commit() or abort(). That makes datasource implementors to feel they can place resource cleanup in both sides, but abort() can be called when commit() fails; so they have to ensure they don't do double-cleanup if cleanup is not idempotent.
### Does this PR introduce any user-facing change?
Depends on the definition of user; if they're developers of custom DSv2 source, they have to add close() in their DataWriter implementations. It's OK to just add close() with empty content as they should have already dealt with resource cleanup in commit/abort, but they would love to migrate the resource cleanup logic to close() as it avoids double cleanup. If they're just end users using the provided DSv2 source (regardless of built-in/3rd party), no change.
### How was this patch tested?
Existing tests.
Closes#26855 from HeartSaVioR/SPARK-30227.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add DropFunctionStatement and make DROP FUNCTION go through the same catalog/table resolution framework of v2 commands.
### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing
DROP FUNCTION namespace.function
### Does this PR introduce any user-facing change?
Yes. When running DROP FUNCTION namespace.function Spark fails the command if the current catalog is set to a v2 catalog.
### How was this patch tested?
Unit tests.
Closes#26854 from planga82/feature/SPARK-30040_DropFunctionV2Catalog.
Authored-by: Pablo Langa <soypab@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR exposes the existing logic for nested schema pruning to all sources, which is in line with the description of `SupportsPushDownRequiredColumns` .
Right now, `SchemaPruning` (rule, not helper utility) is applied in the optimizer directly on certain instances of `Table` ignoring `SupportsPushDownRequiredColumns` that is part of `ScanBuilder`. I think it would be cleaner to perform schema pruning and filter push-down in one place. Therefore, this PR moves all the logic into `V2ScanRelationPushDown`.
### Why are the changes needed?
This change allows all V2 data sources to benefit from nested column pruning (if they support it).
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
This PR mostly relies on existing tests. On top, it adds one test to verify that top-level schema pruning works as well as one test for predicates with subqueries.
Closes#26751 from aokolnychyi/nested-schema-pruning-ds-v2.
Authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
### What changes were proposed in this pull request?
Check the partition column data type and only allow string and integral types in hive partition pruning.
### Why are the changes needed?
Currently we only support string and integral types in hive partition pruning, but the check is done for literals. If the predicate is `InSet`, then there is no literal and we may pass an unsupported partition predicate to Hive and cause problems.
### Does this PR introduce any user-facing change?
yes. fix a bug. A query fails before and can run now.
### How was this patch tested?
a new test
Closes#26871 from cloud-fan/bug.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Since [25001](https://github.com/apache/spark/pull/25001), spark support like escape syntax.
But '%' and '_' is the reserve char in `Like` expression. We can not use them as escape char.
### Why are the changes needed?
Avoid some unexpect problem when using like escape syntax.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Add UT.
Closes#26860 from ulysses-you/SPARK-30230.
Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to add `PushedFilters` into metadata to show the pushed filters in Parquet DSv2 implementation. In case of ORC, it is already added at https://github.com/apache/spark/pull/24719/files#diff-0fc82694b20da3cd2cbb07206920eef7R62-R64
### Why are the changes needed?
In order for users to be able to debug, and to match with ORC.
### Does this PR introduce any user-facing change?
```scala
spark.range(10).write.mode("overwrite").parquet("/tmp/foo")
spark.read.parquet("/tmp/foo").filter("5 > id").explain()
```
**Before:**
```
== Physical Plan ==
*(1) Project [id#20L]
+- *(1) Filter (isnotnull(id#20L) AND (5 > id#20L))
+- *(1) ColumnarToRow
+- BatchScan[id#20L] ParquetScan Location: InMemoryFileIndex[file:/tmp/foo], ReadSchema: struct<id:bigint>
```
**After:**
```
== Physical Plan ==
*(1) Project [id#13L]
+- *(1) Filter (isnotnull(id#13L) AND (5 > id#13L))
+- *(1) ColumnarToRow
+- BatchScan[id#13L] ParquetScan Location: InMemoryFileIndex[file:/tmp/foo], ReadSchema: struct<id:bigint>, PushedFilters: [IsNotNull(id), LessThan(id,5)]
```
### How was this patch tested?
Unittest were added and manually tested.
Closes#26857 from HyukjinKwon/SPARK-30162.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Fixed typo in exception message of HashedRelations
### Why are the changes needed?
Better exception messages
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
No tests needed
Closes#26822 from aaron-lau/master.
Authored-by: Aaron Lau <aaron.lau@datadoghq.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
- Replace `Seq[String]` by `Seq[_]` in `StopWordsRemoverSuite` because `String` type is unchecked due erasure.
- Throw an exception for default case in `MLTest.checkNominalOnDF` because we don't expect other attribute types currently.
- Explicitly cast float to double in `BigDecimal(y)`. This is what the `apply()` method does for `float`s.
- Replace deprecated `verifyZeroInteractions` by `verifyNoInteractions`.
- Equivalent replacement of `\0` by `\u0000` in `CSVExprUtilsSuite`
- Import `scala.language.implicitConversions` in `CollectionExpressionsSuite`, `HashExpressionsSuite` and in `ExpressionParserSuite`.
### Why are the changes needed?
The changes fix compiler warnings showed in the JIRA ticket https://issues.apache.org/jira/browse/SPARK-30170 . Eliminating the warning highlights other warnings which could take more attention to real problems.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By existing test suites `StopWordsRemoverSuite`, `AnalysisExternalCatalogSuite`, `CSVExprUtilsSuite`, `CollectionExpressionsSuite`, `HashExpressionsSuite`, `ExpressionParserSuite` and sub-tests of `MLTest`.
Closes#26799 from MaxGekk/eliminate-warning-2.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
`add file "abc.txt"` and `add file 'abc.txt'` are not supported.
For these two spark sql gives `FileNotFoundException`.
Only `add file abc.txt` is supported currently.
After these changes path can be given as quoted text for ADD FILE, ADD JAR, LIST FILE, LIST JAR commands in spark-sql
### Why are the changes needed?
In many of the spark-sql commands (like create table ,etc )we write path in quoted format only. To maintain this consistency we should support quoted format with this command as well.
### Does this PR introduce any user-facing change?
Yes. Now users can write path with quotes.
### How was this patch tested?
Manually tested.
Closes#26779 from iRakson/SPARK-30150.
Authored-by: root1 <raksonrakesh@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This is a follow up to #26741 to address the following:
1. V2 catalog named `global_temp` should always be masked.
2. #26741 introduces `CatalogAndIdentifer` that supersedes `CatalogObjectIdentfier`. This PR removes `CatalogObjectIdentfier` and its usages and replace them with `CatalogAndIdentifer`.
3. `CatalogObjectIdentifier(catalog, ident) if !isSessionCatalog(catalog)` and `CatalogObjectIdentifier(catalog, ident) if isSessionCatalog(catalog)` are replaced with `NonSessionCatalogAndIdentifier` and `SessionCatalogAndIdentifier` respectively.
### Why are the changes needed?
To fix an existing with handling v2 catalog named `global_temp` and to simplify the code base.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Added new tests.
Closes#26853 from imback82/lookup_table.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Currently `ShuffleQueryStageExec `contain the mutable status, eg `mapOutputStatisticsFuture `variable. So It is not easy to pass when we copy `ShuffleQueryStageExec`. This PR will put the `mapOutputStatisticsFuture ` variable from `ShuffleQueryStageExec` to `ShuffleExchangeExec`. And then we can pass the value of `mapOutputStatisticsFuture ` when copying.
### Why are the changes needed?
In order to remove the mutable status in `ShuffleQueryStageExec`
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing uts
Closes#26846 from JkSelf/removeMutableVariable.
Authored-by: jiake <ke.a.jia@intel.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Add DescribeFunctionsStatement and make DESCRIBE FUNCTIONS go through the same catalog/table resolution framework of v2 commands.
### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing
DESCRIBE FUNCTIONS namespace.function
### Does this PR introduce any user-facing change?
Yes. When running DESCRIBE FUNCTIONS namespace.function Spark fails the command if the current catalog is set to a v2 catalog.
### How was this patch tested?
Unit tests.
Closes#26840 from planga82/feature/SPARK-30038_DescribeFunction_V2Catalog.
Authored-by: Pablo Langa <soypab@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
See https://issues.apache.org/jira/browse/SPARK-30195 for the background; I won't repeat it here. This is sort of a grab-bag of related issues.
### Why are the changes needed?
To cross-compile with Scala 2.13 later.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests for 2.12. I've been manually checking that this actually resolves the compile problems in 2.13 separately.
Closes#26826 from srowen/SPARK-30195.
Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
In the PR, I propose new implementation of `fromDayTimeString` which strictly parses strings in day-time formats to intervals. New implementation accepts only strings that match to a pattern defined by the `from` and `to`. Here is the mapping of user's bounds and patterns:
- `[+|-]D+ H[H]:m[m]:s[s][.SSSSSSSSS]` for **DAY TO SECOND**
- `[+|-]D+ H[H]:m[m]` for **DAY TO MINUTE**
- `[+|-]D+ H[H]` for **DAY TO HOUR**
- `[+|-]H[H]:m[m]s[s][.SSSSSSSSS]` for **HOUR TO SECOND**
- `[+|-]H[H]:m[m]` for **HOUR TO MINUTE**
- `[+|-]m[m]:s[s][.SSSSSSSSS]` for **MINUTE TO SECOND**
Closes#26327Closes#26358
### Why are the changes needed?
- Improve user experience with Spark SQL, and respect to the bound specified by users.
- Behave the same as other broadly used DBMS - Oracle and MySQL.
### Does this PR introduce any user-facing change?
Yes, before:
```sql
spark-sql> SELECT INTERVAL '10 11:12:13.123' HOUR TO MINUTE;
interval 1 weeks 3 days 11 hours 12 minutes
```
After:
```sql
spark-sql> SELECT INTERVAL '10 11:12:13.123' HOUR TO MINUTE;
Error in query:
requirement failed: Interval string must match day-time format of '^(?<sign>[+|-])?(?<hour>\d{1,2}):(?<minute>\d{1,2})$': 10 11:12:13.123(line 1, pos 16)
== SQL ==
SELECT INTERVAL '10 11:12:13.123' HOUR TO MINUTE
----------------^^^
```
### How was this patch tested?
- Added tests to `IntervalUtilsSuite`
- By `ExpressionParserSuite`
- Updated `literals.sql`
Closes#26473 from MaxGekk/strict-from-daytime-string.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This pr is a follow-up of #26829 to fix typos in ExplainMode.
### Why are the changes needed?
For better docs.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
N/A
Closes#26851 from maropu/SPARK-30200-FOLLOWUP.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
`global_temp` is used as a database name to access global temp views. The current catalog lookup logic considers only the first element of multi-part name when it resolves a catalog. This results in using the session catalog even `global_temp` is used as a table name under v2 catalog. This PR addresses this by making sure multi-part name has two elements before using the session catalog.
### Why are the changes needed?
Currently, 'global_temp' can be used as a table name in certain commands (CREATE) but not in others (DESCRIBE):
```
// Assume "spark.sql.globalTempDatabase" is set to "global_temp".
sql(s"CREATE TABLE testcat.t (id bigint, data string) USING foo")
sql(s"CREATE TABLE testcat.global_temp (id bigint, data string) USING foo")
sql("USE testcat")
sql(s"DESCRIBE TABLE t").show
+---------------+---------+-------+
| col_name|data_type|comment|
+---------------+---------+-------+
| id| bigint| |
| data| string| |
| | | |
| # Partitioning| | |
|Not partitioned| | |
+---------------+---------+-------+
sql(s"DESCRIBE TABLE global_temp").show
org.apache.spark.sql.AnalysisException: Table not found: global_temp;;
'DescribeTable 'UnresolvedV2Relation [global_temp], org.apache.spark.sql.connector.InMemoryTableSessionCatalog2f1af64f, `global_temp`, false
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis(CheckAnalysis.scala:47)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis.failAnalysis$(CheckAnalysis.scala:46)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:122)
```
### Does this PR introduce any user-facing change?
Yes, `sql(s"DESCRIBE TABLE global_temp").show` in the above example now displays:
```
+---------------+---------+-------+
| col_name|data_type|comment|
+---------------+---------+-------+
| id| bigint| |
| data| string| |
| | | |
| # Partitioning| | |
|Not partitioned| | |
+---------------+---------+-------+
```
instead of throwing an exception.
### How was this patch tested?
Added new tests.
Closes#26741 from imback82/global_temp.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Another continuation of https://github.com/apache/spark/pull/26748
### Why are the changes needed?
To cleanly cross compile with Scala 2.13.
### Does this PR introduce any user-facing change?
None.
### How was this patch tested?
Existing tests
Closes#26842 from srowen/SPARK-29392.4.
Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
The types decimal and numeric are equivalent. Both types are part of the SQL standard.
the real type is 4 bytes, variable-precision, inexact, 6 decimal digits precision, same as our float, part of the SQL standard.
### Why are the changes needed?
improve sql standard support
other dbs
https://www.postgresql.org/docs/9.3/datatype-numeric.htmlhttps://prestodb.io/docs/current/language/types.html#floating-pointhttp://www.sqlservertutorial.net/sql-server-basics/sql-server-data-types/
MySQL treats REAL as a synonym for DOUBLE PRECISION (a nonstandard variation), unless the REAL_AS_FLOAT SQL mode is enabled.
In MySQL, NUMERIC is implemented as DECIMAL, so the following remarks about DECIMAL apply equally to NUMERIC.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
add ut
Closes#26537 from yaooqinn/SPARK-29587.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Minor change, rm `DELETE FROM` from unsupported hive native operation, because it is supported in parser.
### Why are the changes needed?
clear ambiguous ambiguous
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
no
Closes#26836 from yaooqinn/SPARK-28351.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This pr intends to add `ExplainMode` for explaining `Dataset/DataFrame` with a given format mode (`ExplainMode`). `ExplainMode` has four types along with the SQL EXPLAIN command: `Simple`, `Extended`, `Codegen`, `Cost`, and `Formatted`.
For example, this pr enables users to explain DataFrame/Dataset with the `FORMATTED` format implemented in #24759;
```
scala> spark.range(10).groupBy("id").count().explain(ExplainMode.Formatted)
== Physical Plan ==
* HashAggregate (3)
+- * HashAggregate (2)
+- * Range (1)
(1) Range [codegen id : 1]
Output: [id#0L]
(2) HashAggregate [codegen id : 1]
Input: [id#0L]
(3) HashAggregate [codegen id : 1]
Input: [id#0L, count#8L]
```
This comes from [the cloud-fan suggestion.](https://github.com/apache/spark/pull/24759#issuecomment-560211270)
### Why are the changes needed?
To follow the SQL EXPLAIN command.
### Does this PR introduce any user-facing change?
No, this is just for a new API in Dataset.
### How was this patch tested?
Add tests in `ExplainSuite`.
Closes#26829 from maropu/DatasetExplain.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Reprocess all PostgreSQL dialect related PRs, listing in order:
- #25158: PostgreSQL integral division support [revert]
- #25170: UT changes for the integral division support [revert]
- #25458: Accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type. [revert]
- #25697: Combine below 2 feature tags into "spark.sql.dialect" [revert]
- #26112: Date substraction support [keep the ANSI-compliant part]
- #26444: Rename config "spark.sql.ansi.enabled" to "spark.sql.dialect.spark.ansi.enabled" [revert]
- #26463: Cast to boolean support for PostgreSQL dialect [revert]
- #26584: Make the behavior of Postgre dialect independent of ansi mode config [keep the ANSI-compliant part]
### Why are the changes needed?
As the discussion in http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-PostgreSQL-dialect-td28417.html, we need to remove PostgreSQL dialect form code base for several reasons:
1. The current approach makes the codebase complicated and hard to maintain.
2. Fully migrating PostgreSQL workloads to Spark SQL is not our focus for now.
### Does this PR introduce any user-facing change?
Yes, the config `spark.sql.dialect` will be removed.
### How was this patch tested?
Existing UT.
Closes#26763 from xuanyuanking/SPARK-30125.
Lead-authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR renames `normalizeFilters` in `DataSourceStrategy` to be more generic as the logic is not specific to filters.
### Why are the changes needed?
These changes are needed to support PR #26751.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#26830 from aokolnychyi/rename-normalize-exprs.
Authored-by: Anton Okolnychyi <aokolnychyi@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Issue better error message when user-specified schema and not match relation schema
### Why are the changes needed?
Inspired by https://github.com/apache/spark/pull/25248#issuecomment-559594305, user could get a weird error message when type mapping behavior change between Spark schema and datasource schema(e.g. JDBC). Instead of saying "SomeProvider does not allow user-specified schemas.", we'd better tell user what is really happening here to make user be more clearly about the error.
### Does this PR introduce any user-facing change?
Yes, user will see error message changes.
### How was this patch tested?
Updated existed tests.
Closes#26781 from Ngone51/dev-mismatch-schema.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
improve the temporary functions test in SingleSessionSuite by verifying the result in a query
### Why are the changes needed?
### Does this PR introduce any user-facing change?
### How was this patch tested?
Closes#26812 from leoluan2009/SPARK-30179.
Authored-by: Luan <xuluan@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Use Seq instead of Array in sc.parallelize, with reference types.
Remove usage of WrappedArray.
### Why are the changes needed?
These both enable building on Scala 2.13.
### Does this PR introduce any user-facing change?
None
### How was this patch tested?
Existing tests
Closes#26787 from srowen/SPARK-30158.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This patch fixes the Java code style violations in SPARK-30159 (#26788) which are caught by lint-java (Github Action caught it and I can reproduce it locally). Looks like Jenkins build may have different policy on checking Java style check or less accurate.
### Why are the changes needed?
Java linter starts complaining.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
lint-java passed locally
This closes#26819Closes#26818 from HeartSaVioR/SPARK-30159-FOLLOWUP.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Before this PR, the method `checkAnswer` in Object `QueryTest` returns an optional string. It doesn't throw exceptions when errors happen.
The actual exceptions are thrown in the trait `QueryTest`.
However, there are some test suites(`StreamSuite`, `SessionStateSuite`, `BinaryFileFormatSuite`, etc.) that use the no-op method `QueryTest.checkAnswer` and expect it to fail test cases when the execution results don't match the expected answers.
After this PR:
1. the method `checkAnswer` in Object `QueryTest` will fail tests on errors or unexpected results.
2. add a new method `getErrorMessageInCheckAnswer`, which is exactly the same as the previous version of `checkAnswer`. There are some test suites use this one to customize the test failure message.
3. for the test suites that extend the trait `QueryTest`, we should use the method `checkAnswer` directly, instead of calling the method from Object `QueryTest`.
### Why are the changes needed?
We should fix these method calls to perform actual validations in test suites.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing unit tests.
Closes#26788 from gengliangwang/fixCheckAnswer.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
separate the configuration keys "spark.sql.optimizer.maxIterations" and "spark.sql.analyzer.maxIterations".
### Why are the changes needed?
Currently, both Analyzer and Optimizer use conf "spark.sql.optimizer.maxIterations" to set the max iterations to run, which is a little confusing.
It is clearer to add a new conf "spark.sql.analyzer.maxIterations" for analyzer max iterations.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
Existing unit tests.
Closes#26766 from fuwhu/SPARK-30138.
Authored-by: fuwhu <bestwwg@163.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This PR introduces a method `expressionWithAlias` in class `FunctionRegistry` which is used to register function's constructor. Currently, `expressionWithAlias` is used to register `BoolAnd` & `BoolOr`.
### Why are the changes needed?
Error message is wrong when alias name is used for `BoolAnd` & `BoolOr`.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Tested manually.
For query,
`select every('true');`
Output before this PR,
> Error in query: cannot resolve 'bool_and('true')' due to data type mismatch: Input to function 'bool_and' should have been boolean, but it's [string].; line 1 pos 7;
After this PR,
> Error in query: cannot resolve 'every('true')' due to data type mismatch: Input to function 'every' should have been boolean, but it's [string].; line 1 pos 7;
Closes#26712 from amanomer/29883.
Authored-by: Aman Omer <amanomer1996@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add ShowFunctionsStatement and make SHOW FUNCTIONS go through the same catalog/table resolution framework of v2 commands.
We don’t have this methods in the catalog to implement an V2 command
* catalog.listFunctions
### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing
`SHOW FUNCTIONS LIKE namespace.function`
### Does this PR introduce any user-facing change?
Yes. When running SHOW FUNCTIONS LIKE namespace.function Spark fails the command if the current catalog is set to a v2 catalog.
### How was this patch tested?
Unit tests.
Closes#26667 from planga82/feature/SPARK-29922_ShowFunctions_V2Catalog.
Authored-by: Pablo Langa <soypab@gmail.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
### What changes were proposed in this pull request?
Now, we trim the string when casting string value to those `canCast` types values, e.g. int, double, decimal, interval, date, timestamps, except for boolean.
This behavior makes type cast and coercion inconsistency in Spark.
Not fitting ANSI SQL standard either.
```
If TD is boolean, then
Case:
a) If SD is character string, then SV is replaced by
TRIM ( BOTH ' ' FROM VE )
Case:
i) If the rules for literal in Subclause 5.3, “literal”, can be applied to SV to determine a valid
value of the data type TD, then let TV be that value.
ii) Otherwise, an exception condition is raised: data exception — invalid character value for cast.
b) If SD is boolean, then TV is SV
```
In this pull request, we trim all the whitespaces from both ends of the string before converting it to a bool value. This behavior is as same as others, but a bit different from sql standard, which trim only spaces.
### Why are the changes needed?
Type cast/coercion consistency
### Does this PR introduce any user-facing change?
yes, string with whitespaces in both ends will be trimmed before converted to booleans.
e.g. `select cast('\t true' as boolean)` results `true` now, before this pr it's `null`
### How was this patch tested?
add unit tests
Closes#26776 from yaooqinn/SPARK-30147.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Optimized QueryExecution.scala#writePlans().
### Why are the changes needed?
If any query fails in Analysis phase and gets AnalysisException, there is no need to execute further phases since those will return a same result i.e, AnalysisException.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Manually
Closes#26778 from amanomer/optExplain.
Authored-by: Aman Omer <amanomer1996@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Rename internal method LegacyTypeStringParser.parse() to parseString().
### Why are the changes needed?
In Scala 2.13, the parse() definition clashes with supertype declarations.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing tests.
Closes#26784 from srowen/SPARK-30155.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
In this PR, we propose to use the value of `spark.sql.source.default` as the provider for `CREATE TABLE` syntax instead of `hive` in Spark 3.0.
And to help the migration, we introduce a legacy conf `spark.sql.legacy.respectHiveDefaultProvider.enabled` and set its default to `false`.
### Why are the changes needed?
1. Currently, `CREATE TABLE` syntax use hive provider to create table while `DataFrameWriter.saveAsTable` API using the value of `spark.sql.source.default` as a provider to create table. It would be better to make them consistent.
2. User may gets confused in some cases. For example:
```
CREATE TABLE t1 (c1 INT) USING PARQUET;
CREATE TABLE t2 (c1 INT);
```
In these two DDLs, use may think that `t2` should also use parquet as default provider since Spark always advertise parquet as the default format. However, it's hive in this case.
On the other hand, if we omit the USING clause in a CTAS statement, we do pick parquet by default if `spark.sql.hive.convertCATS=true`:
```
CREATE TABLE t3 USING PARQUET AS SELECT 1 AS VALUE;
CREATE TABLE t4 AS SELECT 1 AS VALUE;
```
And these two cases together can be really confusing.
3. Now, Spark SQL is very independent and popular. We do not need to be fully consistent with Hive's behavior.
### Does this PR introduce any user-facing change?
Yes, before this PR, using `CREATE TABLE` syntax will use hive provider. But now, it use the value of `spark.sql.source.default` as its provider.
### How was this patch tested?
Added tests in `DDLParserSuite` and `HiveDDlSuite`.
Closes#26736 from Ngone51/dev-create-table-using-parquet-by-default.
Lead-authored-by: wuyi <yi.wu@databricks.com>
Co-authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This patch proposes to allow insert overwrite same table if using dynamic partition overwrite.
### Why are the changes needed?
Currently, Insert overwrite cannot overwrite to same table even it is dynamic partition overwrite. But for dynamic partition overwrite, we do not delete partition directories ahead. We write to staging directories and move data to final partition directories. We should be able to insert overwrite to same table under dynamic partition overwrite.
This enables users to read data from a table and insert overwrite to same table by using dynamic partition overwrite. Because this is not allowed for now, users need to write to other temporary location and move it back to the table.
### Does this PR introduce any user-facing change?
Yes. Users can insert overwrite same table if using dynamic partition overwrite.
### How was this patch tested?
Unit test.
Closes#26752 from viirya/dynamic-overwrite-same-table.
Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Classes like DoubleAsIfIntegral are not found in Scala 2.13, but used in the current build. This change 'inlines' the 2.12 implementation and makes it work with both 2.12 and 2.13.
### Why are the changes needed?
To cross-compile with 2.13.
### Does this PR introduce any user-facing change?
It should not as it copies in 2.12's existing behavior.
### How was this patch tested?
Existing tests.
Closes#26769 from srowen/SPARK-30011.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
The syntax 'LIKE predicate: ESCAPE clause' is a ANSI SQL.
For example:
```
select 'abcSpark_13sd' LIKE '%Spark\\_%'; //true
select 'abcSpark_13sd' LIKE '%Spark/_%'; //false
select 'abcSpark_13sd' LIKE '%Spark"_%'; //false
select 'abcSpark_13sd' LIKE '%Spark/_%' ESCAPE '/'; //true
select 'abcSpark_13sd' LIKE '%Spark"_%' ESCAPE '"'; //true
select 'abcSpark%13sd' LIKE '%Spark\\%%'; //true
select 'abcSpark%13sd' LIKE '%Spark/%%'; //false
select 'abcSpark%13sd' LIKE '%Spark"%%'; //false
select 'abcSpark%13sd' LIKE '%Spark/%%' ESCAPE '/'; //true
select 'abcSpark%13sd' LIKE '%Spark"%%' ESCAPE '"'; //true
select 'abcSpark\\13sd' LIKE '%Spark\\\\_%'; //true
select 'abcSpark/13sd' LIKE '%Spark//_%'; //false
select 'abcSpark"13sd' LIKE '%Spark""_%'; //false
select 'abcSpark/13sd' LIKE '%Spark//_%' ESCAPE '/'; //true
select 'abcSpark"13sd' LIKE '%Spark""_%' ESCAPE '"'; //true
```
But Spark SQL only supports 'LIKE predicate'.
Note: If the input string or pattern string is null, then the result is null too.
There are some mainstream database support the syntax.
**PostgreSQL:**
https://www.postgresql.org/docs/11/functions-matching.html
**Vertica:**
https://www.vertica.com/docs/9.2.x/HTML/Content/Authoring/SQLReferenceManual/LanguageElements/Predicates/LIKE-predicate.htm?zoom_highlight=like%20escape
**MySQL:**
https://dev.mysql.com/doc/refman/5.6/en/string-comparison-functions.html
**Oracle:**
https://docs.oracle.com/en/database/oracle/oracle-database/19/jjdbc/JDBC-reference-information.html#GUID-5D371A5B-D7F6-42EB-8C0D-D317F3C53708https://docs.oracle.com/en/database/oracle/oracle-database/19/sqlrf/Pattern-matching-Conditions.html#GUID-0779657B-06A8-441F-90C5-044B47862A0A
## How was this patch tested?
Exists UT and new UT.
This PR merged to my production environment and runs above sql:
```
spark-sql> select 'abcSpark_13sd' LIKE '%Spark\\_%';
true
Time taken: 0.119 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark_13sd' LIKE '%Spark/_%';
false
Time taken: 0.103 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark_13sd' LIKE '%Spark"_%';
false
Time taken: 0.096 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark_13sd' LIKE '%Spark/_%' ESCAPE '/';
true
Time taken: 0.096 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark_13sd' LIKE '%Spark"_%' ESCAPE '"';
true
Time taken: 0.092 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark%13sd' LIKE '%Spark\\%%';
true
Time taken: 0.109 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark%13sd' LIKE '%Spark/%%';
false
Time taken: 0.1 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark%13sd' LIKE '%Spark"%%';
false
Time taken: 0.081 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark%13sd' LIKE '%Spark/%%' ESCAPE '/';
true
Time taken: 0.095 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark%13sd' LIKE '%Spark"%%' ESCAPE '"';
true
Time taken: 0.113 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark\\13sd' LIKE '%Spark\\\\_%';
true
Time taken: 0.078 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark/13sd' LIKE '%Spark//_%';
false
Time taken: 0.067 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark"13sd' LIKE '%Spark""_%';
false
Time taken: 0.084 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark/13sd' LIKE '%Spark//_%' ESCAPE '/';
true
Time taken: 0.091 seconds, Fetched 1 row(s)
spark-sql> select 'abcSpark"13sd' LIKE '%Spark""_%' ESCAPE '"';
true
Time taken: 0.091 seconds, Fetched 1 row(s)
```
I create a table and its schema is:
```
spark-sql> desc formatted gja_test;
key string NULL
value string NULL
other string NULL
# Detailed Table Information
Database test
Table gja_test
Owner test
Created Time Wed Apr 10 11:06:15 CST 2019
Last Access Thu Jan 01 08:00:00 CST 1970
Created By Spark 2.4.1-SNAPSHOT
Type MANAGED
Provider hive
Table Properties [transient_lastDdlTime=1563443838]
Statistics 26 bytes
Location hdfs://namenode.xxx:9000/home/test/hive/warehouse/test.db/gja_test
Serde Library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
InputFormat org.apache.hadoop.mapred.TextInputFormat
OutputFormat org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Storage Properties [field.delim= , serialization.format= ]
Partition Provider Catalog
Time taken: 0.642 seconds, Fetched 21 row(s)
```
Table `gja_test` exists three rows of data.
```
spark-sql> select * from gja_test;
a A ao
b B bo
"__ """__ "
Time taken: 0.665 seconds, Fetched 3 row(s)
```
At finally, I test this function:
```
spark-sql> select * from gja_test where key like value escape '"';
"__ """__ "
Time taken: 0.687 seconds, Fetched 1 row(s)
```
Closes#25001 from beliefer/ansi-sql-like.
Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
### What changes were proposed in this pull request?
This PR makes `Analyzer.ResolveRelations` responsible for looking up both v1 and v2 tables from the session catalog and create an appropriate relation.
### Why are the changes needed?
Currently there are two issues:
1. As described in [SPARK-29966](https://issues.apache.org/jira/browse/SPARK-29966), the logic for resolving relation can load a table twice, which is a perf regression (e.g., Hive metastore can be accessed twice).
2. As described in [SPARK-30001](https://issues.apache.org/jira/browse/SPARK-30001), if a catalog name is specified for v1 tables, the query fails:
```
scala> sql("create table t using csv as select 1 as i")
res2: org.apache.spark.sql.DataFrame = []
scala> sql("select * from t").show
+---+
| i|
+---+
| 1|
+---+
scala> sql("select * from spark_catalog.t").show
org.apache.spark.sql.AnalysisException: Table or view not found: spark_catalog.t; line 1 pos 14;
'Project [*]
+- 'UnresolvedRelation [spark_catalog, t]
```
### Does this PR introduce any user-facing change?
Yes. Now the catalog name is resolved correctly:
```
scala> sql("create table t using csv as select 1 as i")
res0: org.apache.spark.sql.DataFrame = []
scala> sql("select * from t").show
+---+
| i|
+---+
| 1|
+---+
scala> sql("select * from spark_catalog.t").show
+---+
| i|
+---+
| 1|
+---+
```
### How was this patch tested?
Added new tests.
Closes#26684 from imback82/resolve_relation.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
A bug fixed about the code in getBlockHosts() function. In the case "The fragment ends at a position within this block", the end of fragment should be before the end of block,where the "end of block" means `b.getOffset + b.getLength`,not `b.getLength`.
### Why are the changes needed?
When comparing the fragment end and the block end,we should use fragment's `offset + length`,and then compare to the block's `b.getOffset + b.getLength`, not the block's length.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
No test.
Closes#26650 from mdianjun/fix-getBlockHosts.
Authored-by: madianjun <madianjun@jd.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This patch prevents the cleanup operation in FileStreamSource if the source files belong to the FileStreamSink. This is needed because the output of FileStreamSink can be read with multiple Spark queries and queries will read the files based on the metadata log, which won't reflect the cleanup.
To simplify the logic, the patch only takes care of the case of when the source path without glob pattern refers to the output directory of FileStreamSink, via checking FileStreamSource to see whether it leverages metadata directory or not to list the source files.
### Why are the changes needed?
Without this patch, if end users turn on cleanup option with the path which is the output of FileStreamSink, there may be out of sync between metadata and available files which may break other queries reading the path.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Added UT.
Closes#26590 from HeartSaVioR/SPARK-29953.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
TL;DR - this is more of the same change in https://github.com/apache/spark/pull/26748
I told you it'd be iterative!
Closes#26765 from srowen/SPARK-29392.3.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
# What changes were proposed in this pull request?
Add an analyzer rule to convert unresolved `Add`, `Subtract`, etc. to `TimeAdd`, `DateAdd`, etc. according to the following policy:
```scala
/**
* For [[Add]]:
* 1. if both side are interval, stays the same;
* 2. else if one side is interval, turns it to [[TimeAdd]];
* 3. else if one side is date, turns it to [[DateAdd]] ;
* 4. else stays the same.
*
* For [[Subtract]]:
* 1. if both side are interval, stays the same;
* 2. else if the right side is an interval, turns it to [[TimeSub]];
* 3. else if one side is timestamp, turns it to [[SubtractTimestamps]];
* 4. else if the right side is date, turns it to [[DateDiff]]/[[SubtractDates]];
* 5. else if the left side is date, turns it to [[DateSub]];
* 6. else turns it to stays the same.
*
* For [[Multiply]]:
* 1. If one side is interval, turns it to [[MultiplyInterval]];
* 2. otherwise, stays the same.
*
* For [[Divide]]:
* 1. If the left side is interval, turns it to [[DivideInterval]];
* 2. otherwise, stays the same.
*/
```
Besides, we change datetime functions from implicit cast types to strict ones, all available type coercions happen in `DateTimeOperations` coercion rule.
### Why are the changes needed?
Feature Parity between PostgreSQL and Spark, and make the null semantic consistent with Spark.
### Does this PR introduce any user-facing change?
1. date_add/date_sub functions only accept int/tinynit/smallint as the second arg, double/string etc, are forbidden like hive, which produce weird results.
### How was this patch tested?
add ut
Closes#26412 from yaooqinn/SPARK-29774.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Keep the owner of a database when executing alter database commands
### Why are the changes needed?
Spark will inadvertently delete the owner of a database for executing databases ddls
### Does this PR introduce any user-facing change?
NO
### How was this patch tested?
add and modify uts
Closes#26080 from yaooqinn/SPARK-29425.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
There is an issue for InSubquery expression.
For example, there are two tables `ta` and `tb` created by the below statements.
```
sql("create table ta(id Decimal(18,0)) using parquet")
sql("create table tb(id Decimal(19,0)) using parquet")
```
This statement below would thrown dataType mismatch exception.
```
sql("select * from ta where id in (select id from tb)").show()
```
However, this similar statement could execute successfully.
```
sql("select * from ta where id in ((select id from tb))").show()
```
The root cause is that, for `InSubquery` expression, it does not find a common type for two decimalType like `In` expression.
Besides that, for `InSubquery` expression, it also does not find a common type for DecimalType and double/float/bigInt.
In this PR, I fix this issue by finding widerType for `InSubquery` expression when DecimalType is involved.
### Why are the changes needed?
Some InSubquery would throw dataType mismatch exception.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Unit test.
Closes#26485 from turboFei/SPARK-29860-in-subquery.
Authored-by: turbofei <fwang12@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Improved error message while creating views.
### Why are the changes needed?
Error message should suggest user to use TEMPORARY keyword while creating permanent view referred by temporary view.
https://github.com/apache/spark/pull/26317#discussion_r352377363
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Updated test case.
Closes#26731 from amanomer/imp_err_msg.
Authored-by: Aman Omer <amanomer1996@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Follow up on https://github.com/apache/spark/pull/26654#discussion_r353826162
Instead of OrderingUtil or Utils.nanSafeCompare{Doubles,Floats}, just use java.lang.{Double,Float}.compare directly. All work identically w.r.t. NaN when used to `compare`.
### Why are the changes needed?
Simplification of the previous change, which existed to support Scala 2.13 migration.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests
Closes#26761 from srowen/SPARK-30009.2.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Where it generates a deprecation warning in Scala 2.13, replace Symbol shorthand syntax `'foo` with an equivalent.
### Why are the changes needed?
Symbol syntax `'foo` is deprecated in Scala 2.13. The lines changed below otherwise generate about 440 warnings when building for 2.13.
The previous PR directly replaced many usages with `Symbol("foo")`. But it's also used to specify Columns via implicit conversion (`.select('foo)`) or even where simple Strings are used (`.as('foo)`), as it's kind of an abstraction for interned Strings.
While I find this syntax confusing and would like to deprecate it, here I just replaced it where it generates a build warning (not sure why all occurrences don't): `$"foo"` or just `"foo"`.
### Does this PR introduce any user-facing change?
Should not change behavior.
### How was this patch tested?
Existing tests.
Closes#26748 from srowen/SPARK-29392.2.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Adding tooltip to SQL tab for better usability.
### Why are the changes needed?
There are a few common points of confusion in the UI that could be clarified with tooltips. We
should add tooltips to explain.
### Does this PR introduce any user-facing change?
yes.
![Screenshot 2019-11-23 at 9 47 41 AM](https://user-images.githubusercontent.com/8948111/69472963-aaec5980-0dd6-11ea-881a-fe6266171054.png)
### How was this patch tested?
Manual test.
Closes#26641 from 07ARB/SPARK-29453.
Authored-by: 07ARB <ankitrajboudh@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
Avoid duplicate error message in Analyzed Logical plan.
### Why are the changes needed?
Currently, when any query throws `AnalysisException`, same error message will be repeated because of following code segment.
04a5b8f5f8/sql/core/src/main/scala/org/apache/spark/sql/execution/QueryExecution.scala (L157-L166)
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Manually. Result of `explain extended select * from wrong;`
BEFORE
> == Parsed Logical Plan ==
> 'Project [*]
> +- 'UnresolvedRelation [wrong]
>
> == Analyzed Logical Plan ==
> org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 1 pos 31;
> 'Project [*]
> +- 'UnresolvedRelation [wrong]
>
> org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 1 pos 31;
> 'Project [*]
> +- 'UnresolvedRelation [wrong]
>
> == Optimized Logical Plan ==
> org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 1 pos 31;
> 'Project [*]
> +- 'UnresolvedRelation [wrong]
>
> == Physical Plan ==
> org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 1 pos 31;
> 'Project [*]
> +- 'UnresolvedRelation [wrong]
>
AFTER
> == Parsed Logical Plan ==
> 'Project [*]
> +- 'UnresolvedRelation [wrong]
>
> == Analyzed Logical Plan ==
> org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 1 pos 31;
> 'Project [*]
> +- 'UnresolvedRelation [wrong]
>
> == Optimized Logical Plan ==
> org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 1 pos 31;
> 'Project [*]
> +- 'UnresolvedRelation [wrong]
>
> == Physical Plan ==
> org.apache.spark.sql.AnalysisException: Table or view not found: wrong; line 1 pos 31;
> 'Project [*]
> +- 'UnresolvedRelation [wrong]
>
Closes#26734 from amanomer/cor_APlan.
Authored-by: Aman Omer <amanomer1996@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Changed the test **DPP triggers only for certain types of query** in **DynamicPartitionPruningSuite**.
### Why are the changes needed?
The sql has no partition key. The description "no predicate on the dimension table" is not right. So fix it.
```
Given("no predicate on the dimension table")
withSQLConf(SQLConf.DYNAMIC_PARTITION_PRUNING_ENABLED.key -> "true") {
val df = sql(
"""
|SELECT * FROM fact_sk f
|JOIN dim_store s
|ON f.date_id = s.store_id
""".stripMargin)
```
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Updated UT
Closes#26744 from deshanxiao/30106.
Authored-by: xiaodeshan <xiaodeshan@xiaomi.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Move some classes extending Scala collections into parallel source trees, to support 2.13; other minor collection-related modifications.
Modify some classes extending Scala collections to work with 2.13 as well as 2.12. In many cases, this means introducing parallel source trees, as the type hierarchy changed in ways that one class can't support both.
### Why are the changes needed?
To support building for Scala 2.13 in the future.
### Does this PR introduce any user-facing change?
There should be no behavior change.
### How was this patch tested?
Existing tests. Note that the 2.13 changes are not tested by the PR builder, of course. They compile in 2.13 but can't even be tested locally. Later, once the project can be compiled for 2.13, thus tested, it's possible the 2.13 implementations will need updates.
Closes#26728 from srowen/SPARK-30012.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Do not cast `NaN` to an `Integer`, `Long`, `Short` or `Byte`. This is because casting `NaN` to those types results in a `0` which erroneously replaces `0`s while only `NaN`s should be replaced.
### Why are the changes needed?
This Scala code snippet:
```
import scala.math;
println(Double.NaN.toLong)
```
returns `0` which is problematic as if you run the following Spark code, `0`s get replaced as well:
```
>>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value"))
>>> df.show()
+-----+-----+
|index|value|
+-----+-----+
| 1.0| 0|
| 0.0| 3|
| NaN| 0|
+-----+-----+
>>> df.replace(float('nan'), 2).show()
+-----+-----+
|index|value|
+-----+-----+
| 1.0| 2|
| 0.0| 3|
| 2.0| 2|
+-----+-----+
```
### Does this PR introduce any user-facing change?
Yes, after the PR, running the same above code snippet returns the correct expected results:
```
>>> df = spark.createDataFrame([(1.0, 0), (0.0, 3), (float('nan'), 0)], ("index", "value"))
>>> df.show()
+-----+-----+
|index|value|
+-----+-----+
| 1.0| 0|
| 0.0| 3|
| NaN| 0|
+-----+-----+
>>> df.replace(float('nan'), 2).show()
+-----+-----+
|index|value|
+-----+-----+
| 1.0| 0|
| 0.0| 3|
| 2.0| 0|
+-----+-----+
```
### How was this patch tested?
Added unit tests to verify replacing `NaN` only affects columns of type `Float` and `Double`
Closes#26738 from johnhany97/SPARK-30082.
Lead-authored-by: John Ayad <johnhany97@gmail.com>
Co-authored-by: John Ayad <jayad@palantir.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
`UnaryPositive` only accepts numeric and interval as we defined, but what we do for this in `AstBuider.visitArithmeticUnary` is just bypassing it.
This should not be omitted for the type checking requirement.
### Why are the changes needed?
bug fix, you can find a pre-discussion here https://github.com/apache/spark/pull/26578#discussion_r347350398
### Does this PR introduce any user-facing change?
yes, +non-numeric-or-interval is now invalid.
```
-- !query 14
select +date '1900-01-01'
-- !query 14 schema
struct<DATE '1900-01-01':date>
-- !query 14 output
1900-01-01
-- !query 15
select +timestamp '1900-01-01'
-- !query 15 schema
struct<TIMESTAMP '1900-01-01 00:00:00':timestamp>
-- !query 15 output
1900-01-01 00:00:00
-- !query 16
select +map(1, 2)
-- !query 16 schema
struct<map(1, 2):map<int,int>>
-- !query 16 output
{1:2}
-- !query 17
select +array(1,2)
-- !query 17 schema
struct<array(1, 2):array<int>>
-- !query 17 output
[1,2]
-- !query 18
select -'1'
-- !query 18 schema
struct<(- CAST(1 AS DOUBLE)):double>
-- !query 18 output
-1.0
-- !query 19
select -X'1'
-- !query 19 schema
struct<>
-- !query 19 output
org.apache.spark.sql.AnalysisException
cannot resolve '(- X'01')' due to data type mismatch: argument 1 requires (numeric or interval) type, however, 'X'01'' is of binary type.; line 1 pos 7
-- !query 20
select +X'1'
-- !query 20 schema
struct<X'01':binary>
-- !query 20 output
```
### How was this patch tested?
add ut check
Closes#26716 from yaooqinn/SPARK-30083.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Now the min/max/sum/avg are support for intervals, we should also enable it in RelationalGroupedDataset
### Why are the changes needed?
API consistency improvement
### Does this PR introduce any user-facing change?
yes, Dataset support min/max/sum/avg(mean) on intervals
### How was this patch tested?
add ut
Closes#26681 from yaooqinn/SPARK-30048.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Observable metrics are named arbitrary aggregate functions that can be defined on a query (Dataframe). As soon as the execution of a Dataframe reaches a completion point (e.g. finishes batch query or reaches streaming epoch) a named event is emitted that contains the metrics for the data processed since the last completion point.
A user can observe these metrics by attaching a listener to spark session, it depends on the execution mode which listener to attach:
- Batch: `QueryExecutionListener`. This will be called when the query completes. A user can access the metrics by using the `QueryExecution.observedMetrics` map.
- (Micro-batch) Streaming: `StreamingQueryListener`. This will be called when the streaming query completes an epoch. A user can access the metrics by using the `StreamingQueryProgress.observedMetrics` map. Please note that we currently do not support continuous execution streaming.
### Why are the changes needed?
This enabled observable metrics.
### Does this PR introduce any user-facing change?
Yes. It adds the `observe` method to `Dataset`.
### How was this patch tested?
- Added unit tests for the `CollectMetrics` logical node to the `AnalysisSuite`.
- Added unit tests for `StreamingProgress` JSON serialization to the `StreamingQueryStatusAndProgressSuite`.
- Added integration tests for streaming to the `StreamingQueryListenerSuite`.
- Added integration tests for batch to the `DataFrameCallbackSuite`.
Closes#26127 from hvanhovell/SPARK-29348.
Authored-by: herman <herman@databricks.com>
Signed-off-by: herman <herman@databricks.com>
### What changes were proposed in this pull request?
When user defined a base path which is not an ancestor directory for all the input paths,
throw exception immediately.
### Why are the changes needed?
Assuming that we have a DataFrame[c1, c2] be written out in parquet and partitioned by c1.
When using `spark.read.parquet("/path/to/data/c1=1")` to read the data, we'll have a DataFrame with column c2 only.
But if we use `spark.read.option("basePath", "/path/from").parquet("/path/to/data/c1=1")` to
read the data, we'll have a DataFrame with column c1 and c2.
This's happens because a wrong base path does not actually work in `parsePartition()`, so paring would continue until it reaches a directory without "=".
And I think the result of the second read way doesn't make sense.
### Does this PR introduce any user-facing change?
Yes, with this change, user would hit `IllegalArgumentException ` when given a wrong base path while previous behavior doesn't.
### How was this patch tested?
Added UT.
Closes#26195 from Ngone51/dev-wrong-basePath.
Lead-authored-by: wuyi <ngone_5451@163.com>
Co-authored-by: wuyi <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Improve error messages for unsupported data type.
### Why are the changes needed?
When the spark reads the hive table and encounters an unsupported field type, the exception message has only one unsupported type, and the user cannot know which field of which table.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
```create view t AS SELECT STRUCT('a' AS `$a`, 1 AS b) as q;```
current:
org.apache.spark.SparkException: Cannot recognize hive type string: struct<$a:string,b:int>
change:
org.apache.spark.SparkException: Cannot recognize hive type string: struct<$a:string,b:int>, column: q
```select * from t,t_normal_1,t_normal_2```
current:
org.apache.spark.SparkException: Cannot recognize hive type string: struct<$a:string,b:int>
change:
org.apache.spark.SparkException: Cannot recognize hive type string: struct<$a:string,b:int>, column: q, db: default, table: t
Closes#26577 from cxzl25/unsupport_data_type_msg.
Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR changes subquery planning by calling the planner and plan preparation rules on the subquery plan directly. Before we were creating a `QueryExecution` instance for subqueries to get the executedPlan. This would re-run analysis and optimization on the subqueries plan. Running the analysis again on an optimized query plan can have unwanted consequences, as some rules, for example `DecimalPrecision`, are not idempotent.
As an example, consider the expression `1.7 * avg(a)` which after applying the `DecimalPrecision` rule becomes:
```
promote_precision(1.7) * promote_precision(avg(a))
```
After the optimization, more specifically the constant folding rule, this expression becomes:
```
1.7 * promote_precision(avg(a))
```
Now if we run the analyzer on this optimized query again, we will get:
```
promote_precision(1.7) * promote_precision(promote_precision(avg(a)))
```
Which will later optimized as:
```
1.7 * promote_precision(promote_precision(avg(a)))
```
As can be seen, re-running the analysis and optimization on this expression results in an expression with extra nested promote_preceision nodes. Adding unneeded nodes to the plan is problematic because it can eliminate situations where we can reuse the plan.
We opted to introduce dedicated planners for subuqueries, instead of making the DecimalPrecision rule idempotent, because this eliminates this entire category of problems. Another benefit is that planning time for subqueries is reduced.
### How was this patch tested?
Unit tests
Closes#26705 from dbaliafroozeh/CreateDedicatedPlannerForSubqueries.
Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com>
Signed-off-by: herman <herman@databricks.com>
### What changes were proposed in this pull request?
The patch adds scaladoc on `HDFSMetadataLog.serialize` and `HDFSMetadataLog.deserialize` for adding implementation note when overriding - HDFSMetadataLog calls `serialize` and `deserialize` inside try-finally and caller will do the resource (input stream, output stream) cleanup, so resource cleanup should not be performed in these methods, but there's no note on this (only code comment, not scaladoc) which is easy to be missed.
### Why are the changes needed?
Contributors who are unfamiliar with the intention seem to think it as a bug if the resource is not cleaned up in serialize/deserialize of subclass of HDFSMetadataLog, and they couldn't know about the intention without reading the code of HDFSMetadataLog. Adding the note as scaladoc would expand the visibility.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Just a doc change.
Closes#26732 from HeartSaVioR/MINOR-SS-HDFSMetadataLog-serde-scaladoc.
Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Co-authored-by: dz <953396112@qq.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
add `.enabled` postfix to `spark.sql.analyzer.failAmbiguousSelfJoin`.
### Why are the changes needed?
to follow the existing naming style
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
not needed
Closes#26694 from cloud-fan/conf.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Optimize aggregates on interval values from sort-based to hash-based, and we can use the `org.apache.spark.sql.catalyst.expressions.RowBasedKeyValueBatch` for better performance.
### Why are the changes needed?
improve aggerates
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
add ut and existing ones
Closes#26680 from yaooqinn/SPARK-30047.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In SPARK-29421 (#26097) , we can specify a different table provider for `CREATE TABLE LIKE` via `USING provider`.
Hive support `STORED AS` new file format syntax:
```sql
CREATE TABLE tbl(a int) STORED AS TEXTFILE;
CREATE TABLE tbl2 LIKE tbl STORED AS PARQUET;
```
For Hive compatibility, we should also support `STORED AS` in `CREATE TABLE LIKE`.
### Why are the changes needed?
See https://github.com/apache/spark/pull/26097#issue-327424759
### Does this PR introduce any user-facing change?
Add a new syntax based on current CTL:
CREATE TABLE tbl2 LIKE tbl [STORED AS hiveFormat];
### How was this patch tested?
Add UTs.
Closes#26466 from LantaoJin/SPARK-29839.
Authored-by: LantaoJin <jinlantao@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Disable continuous shuffle block fetching when the old fetch protocol in use.
### Why are the changes needed?
The new feature of continuous shuffle block fetching depends on the latest version of the shuffle fetch protocol. We should keep this constraint in `BlockStoreShuffleReader.fetchContinuousBlocksInBatch`.
### Does this PR introduce any user-facing change?
Users will not get the exception related to continuous shuffle block fetching when old version of the external shuffle service is used.
### How was this patch tested?
Existing UT.
Closes#26663 from xuanyuanking/SPARK-30025.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This patch adds Hive provider into table metadata in `HiveExternalCatalog.alterTableStats`. When we call `HiveClient.alterTable`, `alterTable` will erase if it can not find hive provider in given table metadata.
Rename table also has this issue.
### Why are the changes needed?
Because running `ANALYZE TABLE` on a Hive table, if the table has bucketing info, will erase existing bucket info.
### Does this PR introduce any user-facing change?
Yes. After this PR, running `ANALYZE TABLE` on Hive table, won't erase existing bucketing info.
### How was this patch tested?
Unit test.
Closes#26685 from viirya/fix-hive-bucket.
Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to use foreach instead of misusing map as a small followup of #26476. This could cause some weird errors potentially and it's not a good practice anyway. See also SPARK-16694
### Why are the changes needed?
To avoid potential issues like SPARK-16694
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing tests should cover.
Closes#26729 from HyukjinKwon/SPARK-29851.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
1. Make maxNumPostShufflePartitions config obey reducePostShfflePartitions config.
2. Update the description for all the SQLConf affected by `spark.sql.adaptive.enabled`.
### Why are the changes needed?
Make the relation between these confs clearer.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing UT.
Closes#26664 from xuanyuanking/SPARK-9853-follow.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
`DataFrameNaFunctions.drop` doesn't handle duplicate columns even when column names are not specified.
```Scala
val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2")
val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2")
val df = left.join(right, Seq("col1"))
df.printSchema
df.na.drop("any").show
```
produces
```
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- col2: string (nullable = true)
org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could be: col2, col2.;
at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:240)
```
The reason for the above failure is that columns are resolved by name and if there are multiple columns with the same name, it will fail due to ambiguity.
This PR updates `DataFrameNaFunctions.drop` such that if the columns to drop are not specified, it will resolve ambiguity gracefully by applying `drop` to all the eligible columns. (Note that if the user specifies the columns, it will still continue to fail due to ambiguity).
### Why are the changes needed?
If column names are not specified, `drop` should not fail due to ambiguity since it should still be able to apply `drop` to the eligible columns.
### Does this PR introduce any user-facing change?
Yes, now all the rows with nulls are dropped in the above example:
```
scala> df.na.drop("any").show
+----+----+----+
|col1|col2|col2|
+----+----+----+
+----+----+----+
```
### How was this patch tested?
Added new unit tests.
Closes#26700 from imback82/na_drop.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
For a literal number with an exponent(e.g. 1e-45, 1E2), we'd parse it to Double by default rather than Decimal. And user could still use `spark.sql.legacy.exponentLiteralToDecimal.enabled=true` to fall back to previous behavior.
### Why are the changes needed?
According to ANSI standard of SQL, we see that the (part of) definition of `literal` :
```
<approximate numeric literal> ::=
<mantissa> E <exponent>
```
which indicates that a literal number with an exponent should be approximate numeric(e.g. Double) rather than exact numeric(e.g. Decimal).
And when we test Presto, we found that Presto also conforms to this standard:
```
presto:default> select typeof(1E2);
_col0
--------
double
(1 row)
```
```
presto:default> select typeof(1.2);
_col0
--------------
decimal(2,1)
(1 row)
```
We also find that, actually, literals like `1E2` are parsed as Double before Spark2.1, but changed to Decimal after #14828 due to *The difference between the two confuses most users* as it said. But we also see support(from DB2 test) of original behavior at #14828 (comment).
Although, we also see that PostgreSQL has its own implementation:
```
postgres=# select pg_typeof(1E2);
pg_typeof
-----------
numeric
(1 row)
postgres=# select pg_typeof(1.2);
pg_typeof
-----------
numeric
(1 row)
```
We still think that Spark should also conform to this standard while considering SQL standard and Spark own history and majority DBMS and also user experience.
### Does this PR introduce any user-facing change?
Yes.
For `1E2`, before this PR:
```
scala> spark.sql("select 1E2")
res0: org.apache.spark.sql.DataFrame = [1E+2: decimal(1,-2)]
```
After this PR:
```
scala> spark.sql("select 1E2")
res0: org.apache.spark.sql.DataFrame = [100.0: double]
```
And for `1E-45`, before this PR:
```
org.apache.spark.sql.catalyst.parser.ParseException:
decimal can only support precision up to 38
== SQL ==
select 1E-45
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parse(ParseDriver.scala:131)
at org.apache.spark.sql.execution.SparkSqlParser.parse(SparkSqlParser.scala:48)
at org.apache.spark.sql.catalyst.parser.AbstractSqlParser.parsePlan(ParseDriver.scala:76)
at org.apache.spark.sql.SparkSession.$anonfun$sql$1(SparkSession.scala:605)
at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
at org.apache.spark.sql.SparkSession.sql(SparkSession.scala:605)
... 47 elided
```
after this PR:
```
scala> spark.sql("select 1E-45");
res1: org.apache.spark.sql.DataFrame = [1.0E-45: double]
```
And before this PR, user may feel super weird to see that `select 1e40` works but `select 1e-40 fails`. And now, both of them work well.
### How was this patch tested?
updated `literals.sql.out` and `ansi/literals.sql.out`
Closes#26595 from Ngone51/SPARK-29956.
Authored-by: wuyi <ngone_5451@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
[HIVE-12063](https://issues.apache.org/jira/browse/HIVE-12063) improved pad decimal numbers with trailing zeros to the scale of the column. The following description is copied from the description of HIVE-12063.
> HIVE-7373 was to address the problems of trimming tailing zeros by Hive, which caused many problems including treating 0.0, 0.00 and so on as 0, which has different precision/scale. Please refer to HIVE-7373 description. However, HIVE-7373 was reverted by HIVE-8745 while the underlying problems remained. HIVE-11835 was resolved recently to address one of the problems, where 0.0, 0.00, and so on cannot be read into decimal(1,1).
However, HIVE-11835 didn't address the problem of showing as 0 in query result for any decimal values such as 0.0, 0.00, etc. This causes confusion as 0 and 0.0 have different precision/scale than 0.
The proposal here is to pad zeros for query result to the type's scale. This not only removes the confusion described above, but also aligns with many other DBs. Internal decimal number representation doesn't change, however.
**Spark SQL**:
```sql
// bin/spark-sql
spark-sql> select cast(1 as decimal(38, 18));
1
spark-sql>
// bin/beeline
0: jdbc:hive2://localhost:10000/default> select cast(1 as decimal(38, 18));
+----------------------------+--+
| CAST(1 AS DECIMAL(38,18)) |
+----------------------------+--+
| 1.000000000000000000 |
+----------------------------+--+
// bin/spark-shell
scala> spark.sql("select cast(1 as decimal(38, 18))").show(false)
+-------------------------+
|CAST(1 AS DECIMAL(38,18))|
+-------------------------+
|1.000000000000000000 |
+-------------------------+
// bin/pyspark
>>> spark.sql("select cast(1 as decimal(38, 18))").show()
+-------------------------+
|CAST(1 AS DECIMAL(38,18))|
+-------------------------+
| 1.000000000000000000|
+-------------------------+
// bin/sparkR
> showDF(sql("SELECT cast(1 as decimal(38, 18))"))
+-------------------------+
|CAST(1 AS DECIMAL(38,18))|
+-------------------------+
| 1.000000000000000000|
+-------------------------+
```
**PostgreSQL**:
```sql
postgres=# select cast(1 as decimal(38, 18));
numeric
----------------------
1.000000000000000000
(1 row)
```
**Presto**:
```sql
presto> select cast(1 as decimal(38, 18));
_col0
----------------------
1.000000000000000000
(1 row)
```
## How was this patch tested?
unit tests and manual test:
```sql
spark-sql> select cast(1 as decimal(38, 18));
1.000000000000000000
```
Spark SQL Upgrading Guide:
![image](https://user-images.githubusercontent.com/5399861/69649620-4405c380-10a8-11ea-84b1-6ee675663b98.png)
Closes#26697 from wangyum/SPARK-28461.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Support JDBC/ODBC tab for HistoryServer WebUI. Currently from Historyserver we can't access the JDBC/ODBC tab for thrift server applications. In this PR, I am doing 2 main changes
1. Refactor existing thrift server listener to support kvstore
2. Add history server plugin for thrift server listener and tab.
### Why are the changes needed?
Users can access Thriftserver tab from History server for both running and finished applications,
### Does this PR introduce any user-facing change?
Support for JDBC/ODBC tab for the WEBUI from History server
### How was this patch tested?
Add UT and Manual tests
1. Start Thriftserver and Historyserver
```
sbin/stop-thriftserver.sh
sbin/stop-historyserver.sh
sbin/start-thriftserver.sh
sbin/start-historyserver.sh
```
2. Launch beeline
`bin/beeline -u jdbc:hive2://localhost:10000`
3. Run queries
Go to the JDBC/ODBC page of the WebUI from History server
![image](https://user-images.githubusercontent.com/23054875/68365501-cf013700-0156-11ea-84b4-fda8008c92c4.png)
Closes#26378 from shahidki31/ThriftKVStore.
Authored-by: shahid <shahidki31@gmail.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
### What changes were proposed in this pull request?
refine the output of "DESC TABLE" command.
After this PR, the output of "DESC TABLE" command is like below :
```
id bigint
data string
# Partitioning
Part 0 id
# Detailed Table Information
Name testca.table_name
Comment this is a test table
Location /tmp/testcat/table_name
Provider foo
Table Properties [bar=baz]
```
### Why are the changes needed?
Currently, "DESC TABLE" will show reserved properties (eg. location, comment) in the "Table Property" section.
Since reserved properties are different from common properties, displaying reserved properties together with other table detailed information and displaying other properties in single field should be reasonable, and it is consistent with hive and DescribeTableCommand action.
### Does this PR introduce any user-facing change?
yes, the output of "DESC TABLE" command is refined as above.
### How was this patch tested?
Update existing unit tests.
Closes#26677 from fuwhu/SPARK-29979-FOLLOWUP-1.
Authored-by: fuwhu <bestwwg@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
[HIVE-12063](https://issues.apache.org/jira/browse/HIVE-12063) improved pad decimal numbers with trailing zeros to the scale of the column. The following description is copied from the description of HIVE-12063.
> HIVE-7373 was to address the problems of trimming tailing zeros by Hive, which caused many problems including treating 0.0, 0.00 and so on as 0, which has different precision/scale. Please refer to HIVE-7373 description. However, HIVE-7373 was reverted by HIVE-8745 while the underlying problems remained. HIVE-11835 was resolved recently to address one of the problems, where 0.0, 0.00, and so on cannot be read into decimal(1,1).
However, HIVE-11835 didn't address the problem of showing as 0 in query result for any decimal values such as 0.0, 0.00, etc. This causes confusion as 0 and 0.0 have different precision/scale than 0.
The proposal here is to pad zeros for query result to the type's scale. This not only removes the confusion described above, but also aligns with many other DBs. Internal decimal number representation doesn't change, however.
**Spark SQL**:
```sql
// bin/spark-sql
spark-sql> select cast(1 as decimal(38, 18));
1
spark-sql>
// bin/beeline
0: jdbc:hive2://localhost:10000/default> select cast(1 as decimal(38, 18));
+----------------------------+--+
| CAST(1 AS DECIMAL(38,18)) |
+----------------------------+--+
| 1.000000000000000000 |
+----------------------------+--+
// bin/spark-shell
scala> spark.sql("select cast(1 as decimal(38, 18))").show(false)
+-------------------------+
|CAST(1 AS DECIMAL(38,18))|
+-------------------------+
|1.000000000000000000 |
+-------------------------+
// bin/pyspark
>>> spark.sql("select cast(1 as decimal(38, 18))").show()
+-------------------------+
|CAST(1 AS DECIMAL(38,18))|
+-------------------------+
| 1.000000000000000000|
+-------------------------+
// bin/sparkR
> showDF(sql("SELECT cast(1 as decimal(38, 18))"))
+-------------------------+
|CAST(1 AS DECIMAL(38,18))|
+-------------------------+
| 1.000000000000000000|
+-------------------------+
```
**PostgreSQL**:
```sql
postgres=# select cast(1 as decimal(38, 18));
numeric
----------------------
1.000000000000000000
(1 row)
```
**Presto**:
```sql
presto> select cast(1 as decimal(38, 18));
_col0
----------------------
1.000000000000000000
(1 row)
```
## How was this patch tested?
unit tests and manual test:
```sql
spark-sql> select cast(1 as decimal(38, 18));
1.000000000000000000
```
Spark SQL Upgrading Guide:
![image](https://user-images.githubusercontent.com/5399861/69649620-4405c380-10a8-11ea-84b1-6ee675663b98.png)
Closes#25214 from wangyum/SPARK-28461.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Support columnar pruning through non-deterministic expressions.
### Why are the changes needed?
In some cases, columns can still be pruned even though nondeterministic expressions appears.
e.g. for the plan `Filter('a = 1, Project(Seq('a, rand() as 'r), LogicalRelation('a, 'b)))`, we shall still prune column b while non-deterministic expression appears.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added a new test file: `ScanOperationSuite`.
Added test in `FileSourceStrategySuite` to verify the right prune behavior for both DS v1 and v2.
Closes#26629 from Ngone51/SPARK-29768.
Authored-by: wuyi <ngone_5451@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
```scala
// Do not allow null values. We follow the semantics of Hive's collect_list/collect_set here.
// See: org.apache.hadoop.hive.ql.udf.generic.GenericUDAFMkCollectionEvaluator
```
These two functions do not allow null values as they are defined, so their elements should not contain null.
### Why are the changes needed?
Casting collect_list(a) to ArrayType(_, false) fails before this fix.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
add ut
Closes#26651 from yaooqinn/SPARK-30008.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This reverts commit 31c4fab (#23052) to make sure the partition calling `ManifestFileCommitProtocol.newTaskTempFile` creates actual file.
This also reverts part of commit 0d3d46d (#26639) since the commit fixes the issue raised from 31c4fab and we're reverting back. The reason of partial revert is that we found the UT be worth to keep as it is, preventing regression - given the UT can detect the issue on empty partition -> no actual file. This makes one more change to UT; moved intentionally to test both DSv1 and DSv2.
### Why are the changes needed?
After the changes in SPARK-26081 (commit 31c4fab / #23052), CSV/JSON/TEXT don't create actual file if the partition is empty. This optimization causes a problem in `ManifestFileCommitProtocol`: the API `newTaskTempFile` is called without actual file creation. Then `fs.getFileStatus` throws `FileNotFoundException` since the file is not created.
SPARK-29999 (commit 0d3d46d / #26639) fixes the problem. But it is too costly to check file existence on each task commit. We should simply restore the behavior before SPARK-26081.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Jenkins build will follow.
Closes#26671 from HeartSaVioR/revert-SPARK-26081-SPARK-29999.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
### What changes were proposed in this pull request?
We are now able to handle whitespaces for integral and fractional types, and the leading or trailing whitespaces for interval, date, and timestamps. But the current interval parser is not able to identify whitespaces as separates as PostgreSQL can do.
This PR makes the whitespaces handling be consistent for nterval values.
Typed interval literal, multi-unit representation, and casting from strings are all supported.
```sql
postgres=# select interval E'1 \t day';
interval
----------
1 day
(1 row)
postgres=# select interval E'1\t' day;
interval
----------
1 day
(1 row)
```
### Why are the changes needed?
Whitespace handling should be consistent for interval value, and across different types in Spark.
PostgreSQL feature parity.
### Does this PR introduce any user-facing change?
Yes, the interval string of multi-units values which separated by whitespaces can be valid now.
### How was this patch tested?
add ut.
Closes#26662 from yaooqinn/SPARK-30026.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Make separate source trees for Scala 2.12/2.13 in order to accommodate mutually-incompatible support for Ordering of double, float.
Note: This isn't the last change that will need a split source tree for 2.13. But this particular change could go several ways:
- (Split source tree)
- Inline the Scala 2.12 implementation
- Reflection
For this change alone any are possible, and splitting the source tree is a bit overkill. But if it will be necessary for other JIRAs (see umbrella SPARK-25075), then it might be the easiest way to implement this.
### Why are the changes needed?
Scala 2.13 split Ordering.Double into Ordering.Double.TotalOrdering and Ordering.Double.IeeeOrdering. Neither can be used in a single build that supports 2.12 and 2.13.
TotalOrdering works like java.lang.Double.compare. IeeeOrdering works like Scala 2.12 Ordering.Double. They differ in how NaN is handled - compares always above other values? or always compares as 'false'? In theory they have different uses: TotalOrdering is important if floating-point values are sorted. IeeeOrdering behaves like 2.12 and JVM comparison operators.
I chose TotalOrdering as I think we care more about stable sorting, and because elsewhere we rely on java.lang comparisons. It is also possible to support with two methods.
### Does this PR introduce any user-facing change?
Pending tests, will see if it obviously affects any sort order. We need to see if it changes NaN sort order.
### How was this patch tested?
Existing tests so far.
Closes#26654 from srowen/SPARK-30009.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Add CreateViewStatement and make CREARE VIEW go through the same catalog/table resolution framework of v2 commands.
### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC v // success and describe the view v from my_catalog
CREATE VIEW v AS SELECT 1 // report view not found as there is no view v in the session catalog
```
### Does this PR introduce any user-facing change?
Yes. When running CREATE VIEW ... Spark fails the command if the current catalog is set to a v2 catalog, or the view name specified a v2 catalog.
### How was this patch tested?
unit tests
Closes#26649 from huaxingao/spark-29862.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
fix the overdue and incorrect description about CalendarIntervalType
### Why are the changes needed?
api doc correctness
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
no
Closes#26659 from yaooqinn/intervaldoc.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR aims to remove `hive-2.3` profile from `sql/hive` module.
### Why are the changes needed?
Currently, we need `-Phive-1.2` or `-Phive-2.3` additionally to build `hive` or `hive-thriftserver` module. Without specifying it, the build fails like the following. This PR will recover it.
```
$ build/mvn -DskipTests compile --pl sql/hive
...
[ERROR] [Error] /Users/dongjoon/APACHE/spark-merge/sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveExternalCatalog.scala:32: object serde is not a member of package org.apache.hadoop.hive
```
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
1. Pass GitHub Action dependency check with no manifest change.
2. Pass GitHub Action build for all combinations.
3. Pass the Jenkins UT.
Closes#26668 from dongjoon-hyun/SPARK-30031.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
# What changes were proposed in this pull request?
This PR aims to relocate the following internal dependencies to compile `sql/core` without `-Phive-2.3` profile.
1. Move the `hive-storage-api` to `sql/core` which is using `hive-storage-api` really.
**BEFORE (sql/core compilation)**
```
$ ./build/mvn -DskipTests --pl sql/core --am compile
...
[ERROR] [Error] /Users/dongjoon/APACHE/spark/sql/core/v2.3/src/main/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilters.scala:21: object hive is not a member of package org.apache.hadoop
...
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
```
**AFTER (sql/core compilation)**
```
$ ./build/mvn -DskipTests --pl sql/core --am compile
...
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 02:04 min
[INFO] Finished at: 2019-11-25T00:20:11-08:00
[INFO] ------------------------------------------------------------------------
```
2. For (1), add `commons-lang:commons-lang` test dependency to `spark-core` module to manage the dependency explicitly. Without this, `core` module fails to build the test classes.
```
$ ./build/mvn -DskipTests --pl core --am package -Phadoop-3.2
...
[INFO] --- scala-maven-plugin:4.3.0:testCompile (scala-test-compile-first) spark-core_2.12 ---
[INFO] Using incremental compilation using Mixed compile order
[INFO] Compiler bridge file: /Users/dongjoon/.sbt/1.0/zinc/org.scala-sbt/org.scala-sbt-compiler-bridge_2.12-1.3.1-bin_2.12.10__52.0-1.3.1_20191012T045515.jar
[INFO] Compiling 271 Scala sources and 26 Java sources to /spark/core/target/scala-2.12/test-classes ...
[ERROR] [Error] /spark/core/src/test/scala/org/apache/spark/util/PropertiesCloneBenchmark.scala:23: object lang is not a member of package org.apache.commons
[ERROR] [Error] /spark/core/src/test/scala/org/apache/spark/util/PropertiesCloneBenchmark.scala:49: not found: value SerializationUtils
[ERROR] two errors found
```
**BEFORE (commons-lang:commons-lang)**
The following is the previous `core` module's `commons-lang:commons-lang` dependency.
1. **branch-2.4**
```
$ mvn dependency:tree -Dincludes=commons-lang:commons-lang
[INFO] --- maven-dependency-plugin:3.0.2:tree (default-cli) spark-core_2.11 ---
[INFO] org.apache.spark:spark-core_2.11🫙2.4.5-SNAPSHOT
[INFO] \- org.spark-project.hive:hive-exec:jar:1.2.1.spark2:provided
[INFO] \- commons-lang:commons-lang:jar:2.6:compile
```
2. **v3.0.0-preview (-Phadoop-3.2)**
```
$ mvn dependency:tree -Dincludes=commons-lang:commons-lang -Phadoop-3.2
[INFO] --- maven-dependency-plugin:3.1.1:tree (default-cli) spark-core_2.12 ---
[INFO] org.apache.spark:spark-core_2.12🫙3.0.0-preview
[INFO] \- org.apache.hive:hive-storage-api:jar:2.6.0:compile
[INFO] \- commons-lang:commons-lang:jar:2.6:compile
```
3. **v3.0.0-preview(default)**
```
$ mvn dependency:tree -Dincludes=commons-lang:commons-lang
[INFO] --- maven-dependency-plugin:3.1.1:tree (default-cli) spark-core_2.12 ---
[INFO] org.apache.spark:spark-core_2.12🫙3.0.0-preview
[INFO] \- org.apache.hadoop:hadoop-client:jar:2.7.4:compile
[INFO] \- org.apache.hadoop:hadoop-common:jar:2.7.4:compile
[INFO] \- commons-lang:commons-lang:jar:2.6:compile
```
**AFTER (commons-lang:commons-lang)**
```
$ mvn dependency:tree -Dincludes=commons-lang:commons-lang
[INFO] --- maven-dependency-plugin:3.1.1:tree (default-cli) spark-core_2.12 ---
[INFO] org.apache.spark:spark-core_2.12🫙3.0.0-SNAPSHOT
[INFO] \- commons-lang:commons-lang:jar:2.6:test
```
Since we wanted to verify that this PR doesn't change `hive-1.2` profile, we merged
[SPARK-30005 Update `test-dependencies.sh` to check `hive-1.2/2.3` profile](a1706e2fa7) before this PR.
### Why are the changes needed?
- Apache Spark 2.4's `sql/core` is using `Apache ORC (nohive)` jars including shaded `hive-storage-api` to access ORC data sources.
- Apache Spark 3.0's `sql/core` is using `Apache Hive` jars directly. Previously, `-Phadoop-3.2` hid this `hive-storage-api` dependency. Now, we are using `-Phive-2.3` instead. As I mentioned [previously](https://github.com/apache/spark/pull/26619#issuecomment-556926064), this PR is required to compile `sql/core` module without `-Phive-2.3`.
- For `sql/hive` and `sql/hive-thriftserver`, it's natural that we need `-Phive-1.2` or `-Phive-2.3`.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
This will pass the Jenkins (with the dependency check and unit tests).
We need to check manually with `./build/mvn -DskipTests --pl sql/core --am compile`.
This closes#26657 .
Closes#26658 from dongjoon-hyun/SPARK-30015.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Add "comment" and "location" property key constants in TableCatalog and SupportNamespaces.
And update code of implementation classes to use these constants instead of hard code.
### Why are the changes needed?
Currently, some basic/reserved keys (eg. "location", "comment") of table and namespace properties are hard coded or defined in specific logical plan implementation class.
These keys can be centralized in TableCatalog and SupportsNamespaces interface and shared across different implementation classes.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
Existing unit test
Closes#26617 from fuwhu/SPARK-29979.
Authored-by: fuwhu <bestwwg@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
`DataFrameNaFunctions.fill` doesn't handle duplicate columns even when column names are not specified.
```Scala
val left = Seq(("1", null), ("3", "4")).toDF("col1", "col2")
val right = Seq(("1", "2"), ("3", null)).toDF("col1", "col2")
val df = left.join(right, Seq("col1"))
df.printSchema
df.na.fill("hello").show
```
produces
```
root
|-- col1: string (nullable = true)
|-- col2: string (nullable = true)
|-- col2: string (nullable = true)
org.apache.spark.sql.AnalysisException: Reference 'col2' is ambiguous, could be: col2, col2.;
at org.apache.spark.sql.catalyst.expressions.package$AttributeSeq.resolve(package.scala:259)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveQuoted(LogicalPlan.scala:121)
at org.apache.spark.sql.Dataset.resolve(Dataset.scala:221)
at org.apache.spark.sql.Dataset.col(Dataset.scala:1268)
```
The reason for the above failure is that columns are looked up with `DataSet.col()` which tries to resolve a column by name and if there are multiple columns with the same name, it will fail due to ambiguity.
This PR updates `DataFrameNaFunctions.fill` such that if the columns to fill are not specified, it will resolve ambiguity gracefully by applying `fill` to all the eligible columns. (Note that if the user specifies the columns, it will still continue to fail due to ambiguity).
### Why are the changes needed?
If column names are not specified, `fill` should not fail due to ambiguity since it should still be able to apply `fill` to the eligible columns.
### Does this PR introduce any user-facing change?
Yes, now the above example displays the following:
```
+----+-----+-----+
|col1| col2| col2|
+----+-----+-----+
| 1|hello| 2|
| 3| 4|hello|
+----+-----+-----+
```
### How was this patch tested?
Added new unit tests.
Closes#26593 from imback82/na_fill.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
add document to address https://github.com/apache/spark/pull/26612#discussion_r349844327
### Why are the changes needed?
help people understand how to use --CONFIG_DIM
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
N/A
Closes#26661 from cloud-fan/test.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
A java like string trim method trims all whitespaces that less or equal than 0x20. currently, our UTF8String handle the space =0x20 ONLY. This is not suitable for many cases in Spark, like trim for interval strings, date, timestamps, PostgreSQL like cast string to boolean.
### Why are the changes needed?
improve the white spaces handling in UTF8String, also with some bugs fixed
### Does this PR introduce any user-facing change?
yes,
string with `control character` at either end can be convert to date/timestamp and interval now
### How was this patch tested?
add ut
Closes#26626 from yaooqinn/SPARK-29986.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
https://bugs.openjdk.java.net/browse/JDK-8170259https://bugs.openjdk.java.net/browse/JDK-8170563
When we cast string type to decimal type, we rely on java.math. BigDecimal. It can't accept leading and training spaces, as you can see in the above links. This behavior is not consistent with other numeric types now. we need to fix it and keep consistency.
### Why are the changes needed?
make string to numeric types be consistent
### Does this PR introduce any user-facing change?
yes, string removed trailing or leading white spaces will be able to convert to decimal if the trimmed is valid
### How was this patch tested?
1. modify ut
#### Benchmark
```scala
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.spark.sql.execution.benchmark
import org.apache.spark.benchmark.Benchmark
/**
* Benchmark trim the string when casting string type to Boolean/Numeric types.
* To run this benchmark:
* {{{
* 1. without sbt:
* bin/spark-submit --class <this class> --jars <spark core test jar> <spark sql test jar>
* 2. build/sbt "sql/test:runMain <this class>"
* 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
* Results will be written to "benchmarks/CastBenchmark-results.txt".
* }}}
*/
object CastBenchmark extends SqlBasedBenchmark {
override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
val title = "Cast String to Integral"
runBenchmark(title) {
withTempPath { dir =>
val N = 500L << 14
val df = spark.range(N)
val types = Seq("decimal")
(1 to 5).by(2).foreach { i =>
df.selectExpr(s"concat(id, '${" " * i}') as str")
.write.mode("overwrite").parquet(dir + i.toString)
}
val benchmark = new Benchmark(title, N, minNumIters = 5, output = output)
Seq(true, false).foreach { trim =>
types.foreach { t =>
val str = if (trim) "trim(str)" else "str"
val expr = s"cast($str as $t) as c_$t"
(1 to 5).by(2).foreach { i =>
benchmark.addCase(expr + s" - with $i spaces") { _ =>
spark.read.parquet(dir + i.toString).selectExpr(expr).collect()
}
}
}
}
benchmark.run()
}
}
}
}
```
#### string trim vs not trim
```java
[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.1
[info] Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
[info] Cast String to Integral: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] cast(trim(str) as decimal) as c_decimal - with 1 spaces 3362 5486 NaN 2.4 410.4 1.0X
[info] cast(trim(str) as decimal) as c_decimal - with 3 spaces 3251 5655 NaN 2.5 396.8 1.0X
[info] cast(trim(str) as decimal) as c_decimal - with 5 spaces 3208 5725 NaN 2.6 391.7 1.0X
[info] cast(str as decimal) as c_decimal - with 1 spaces 13962 16233 1354 0.6 1704.3 0.2X
[info] cast(str as decimal) as c_decimal - with 3 spaces 14273 14444 179 0.6 1742.4 0.2X
[info] cast(str as decimal) as c_decimal - with 5 spaces 14318 14535 125 0.6 1747.8 0.2X
```
#### string trim vs this fix
```java
[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.1
[info] Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
[info] Cast String to Integral: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
[info] ------------------------------------------------------------------------------------------------------------------------
[info] cast(trim(str) as decimal) as c_decimal - with 1 spaces 3265 6299 NaN 2.5 398.6 1.0X
[info] cast(trim(str) as decimal) as c_decimal - with 3 spaces 3183 6241 693 2.6 388.5 1.0X
[info] cast(trim(str) as decimal) as c_decimal - with 5 spaces 3167 5923 1151 2.6 386.7 1.0X
[info] cast(str as decimal) as c_decimal - with 1 spaces 3161 5838 1126 2.6 385.9 1.0X
[info] cast(str as decimal) as c_decimal - with 3 spaces 3046 3457 837 2.7 371.8 1.1X
[info] cast(str as decimal) as c_decimal - with 5 spaces 3053 4445 NaN 2.7 372.7 1.1X
[info]
```
Closes#26640 from yaooqinn/SPARK-30000.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Omit parens on calls like BigDecimal.longValue()
### Why are the changes needed?
For some reason, this won't compile in Scala 2.13. The calls are otherwise equivalent in 2.12.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing tests
Closes#26653 from srowen/SPARK-30013.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Since JIRA SPARK-28346,PR [25111](https://github.com/apache/spark/pull/25111), QueryExecution will copy all node stage-by-stage. This make all node instance twice almost. So we should make all class fields lazy to avoid create more unexpected object.
### Why are the changes needed?
Avoid create more unexpected object.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Exists UT.
Closes#26565 from ulysses-you/make-val-lazy.
Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This patch checks the existence of output file for each task while committing the task, so that it doesn't throw FileNotFoundException while creating SinkFileStatus. The check is newly required for DSv2 implementation of FileStreamSink, as it is changed to create the output file lazily (as an improvement).
JSON writer for example: 9ec2a4e58c/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonOutputWriter.scala (L49-L60)
### Why are the changes needed?
Without this patch, FileStreamSink throws FileNotFoundException when writing empty partition.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added UT.
Closes#26639 from HeartSaVioR/SPARK-29999.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This pr proposes a new independent config so that `LogicalRelation` could use `rowCount` to compute data statistics in logical plans even if CBO disabled. In the master, we currently cannot enable `StarSchemaDetection.reorderStarJoins` because we need to turn off CBO to enable it but `StarSchemaDetection` internally references the `rowCount` that is used in LogicalRelation if CBO disabled.
### Why are the changes needed?
Plan stats are pretty useful other than CBO, e.g., star-schema detector and dynamic partition pruning.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added tests in `DataFrameJoinSuite`.
Closes#21668 from maropu/PlanStatsConf.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Give `processingTimeSec` 0.001 when a micro-batch completed under 1ms.
### Why are the changes needed?
The `processingTimeSec` of batch may be less than 1 ms. As `processingTimeSec` is calculated in ms, so `processingTimeSec` equals 0L. If there is no data in this batch, the `processedRowsPerSecond` equals `0/0.0d`, i.e. `Double.NaN`. If there are some data in this batch, the `processedRowsPerSecond` equals `N/0.0d`, i.e. `Double.Infinity`.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Add new UT
Closes#26610 from uncleGen/SPARK-29973.
Authored-by: uncleGen <hustyugm@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
This is a follow-up of https://github.com/apache/spark/pull/26209 .
This renames class `Version` to class `SparkVersion`.
### Why are the changes needed?
According to the review comment, this uses more specific class name.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Pass the Jenkins with the existing tests.
Closes#26647 from dongjoon-hyun/SPARK-29554.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This is a follow-up according to liancheng 's advice.
- https://github.com/apache/spark/pull/26619#discussion_r349326090
### Why are the changes needed?
Previously, we chose the full version to be carefully. As of today, it seems that `Apache Hive 2.3` branch seems to become stable.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Pass the compile combination on GitHub Action.
1. hadoop-2.7/hive-1.2/JDK8
2. hadoop-2.7/hive-2.3/JDK8
3. hadoop-3.2/hive-2.3/JDK8
4. hadoop-3.2/hive-2.3/JDK11
Also, pass the Jenkins with `hadoop-2.7` and `hadoop-3.2` for (1) and (4).
(2) and (3) is not ready in Jenkins.
Closes#26645 from dongjoon-hyun/SPARK-RENAME-HIVE-DIRECTORY.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR aims the followings.
- Add two profiles, `hive-1.2` and `hive-2.3` (default)
- Validate if we keep the existing combination at least. (Hadoop-2.7 + Hive 1.2 / Hadoop-3.2 + Hive 2.3).
For now, we assumes that `hive-1.2` is explicitly used with `hadoop-2.7` and `hive-2.3` with `hadoop-3.2`. The followings are beyond the scope of this PR.
- SPARK-29988 Adjust Jenkins jobs for `hive-1.2/2.3` combination
- SPARK-29989 Update release-script for `hive-1.2/2.3` combination
- SPARK-29991 Support `hive-1.2/2.3` in PR Builder
### Why are the changes needed?
This will help to switch our dependencies to update the exposed dependencies.
### Does this PR introduce any user-facing change?
This is a dev-only change that the build profile combinations are changed.
- `-Phadoop-2.7` => `-Phadoop-2.7 -Phive-1.2`
- `-Phadoop-3.2` => `-Phadoop-3.2 -Phive-2.3`
### How was this patch tested?
Pass the Jenkins with the dependency check and tests to make it sure we don't change anything for now.
- [Jenkins (-Phadoop-2.7 -Phive-1.2)](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/114192/consoleFull)
- [Jenkins (-Phadoop-3.2 -Phive-2.3)](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/114192/consoleFull)
Also, from now, GitHub Action validates the following combinations.
![gha](https://user-images.githubusercontent.com/9700541/69355365-822d5e00-0c36-11ea-93f7-e00e5459e1d0.png)
Closes#26619 from dongjoon-hyun/SPARK-29981.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This is rather a followup of https://github.com/apache/spark/pull/25464 (see https://github.com/apache/spark/pull/25464/files#r349543286)
It will cause an infinite recursion via mapping children - we should return the hint rather than its parent plan in unknown hint resolution.
### Why are the changes needed?
Prevent Stack over flow during hint resolution.
### Does this PR introduce any user-facing change?
Yes, it avoids stack overflow exception It was caused by https://github.com/apache/spark/pull/25464 and this is only in the master.
No behaviour changes to end users as it happened only in the master.
### How was this patch tested?
Unittest was added.
Closes#26642 from HyukjinKwon/SPARK-30003.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to add `as` API to RelationalGroupedDataset. It creates KeyValueGroupedDataset instance using given grouping expressions, instead of a typed function in groupByKey API. Because it can leverage existing columns, it can use existing data partition, if any, when doing operations like cogroup.
### Why are the changes needed?
Currently if users want to do cogroup on DataFrames, there is no good way to do except for KeyValueGroupedDataset.
1. KeyValueGroupedDataset ignores existing data partition if any. That is a problem.
2. groupByKey calls typed function to create additional keys. You can not reuse existing columns, if you just need grouping by them.
```scala
// df1 and df2 are certainly partitioned and sorted.
val df1 = Seq((1, 2, 3), (2, 3, 4)).toDF("a", "b", "c")
.repartition($"a").sortWithinPartitions("a")
val df2 = Seq((1, 2, 4), (2, 3, 5)).toDF("a", "b", "c")
.repartition($"a").sortWithinPartitions("a")
```
```scala
// This groupBy.as.cogroup won't unnecessarily repartition the data
val df3 = df1.groupBy("a").as[Int]
.cogroup(df2.groupBy("a").as[Int]) { case (key, data1, data2) =>
data1.zip(data2).map { p =>
p._1.getInt(2) + p._2.getInt(2)
}
}
```
```
== Physical Plan ==
*(5) SerializeFromObject [input[0, int, false] AS value#11247]
+- CoGroup org.apache.spark.sql.DataFrameSuite$$Lambda$4922/12067092816eec1b6f, a#11209: int, createexternalrow(a#11209, b#11210, c#11211, StructField(a,IntegerType,false), StructField(b,IntegerType,false), StructField(c,IntegerType,false)), createexternalrow(a#11225, b#11226, c#11227, StructField(a,IntegerType,false), StructField(b,IntegerType,false), StructField(c,IntegerType,false)), [a#11209], [a#11225], [a#11209, b#11210, c#11211], [a#11225, b#11226, c#11227], obj#11246: int
:- *(2) Sort [a#11209 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(a#11209, 5), false, [id=#10218]
: +- *(1) Project [_1#11202 AS a#11209, _2#11203 AS b#11210, _3#11204 AS c#11211]
: +- *(1) LocalTableScan [_1#11202, _2#11203, _3#11204]
+- *(4) Sort [a#11225 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(a#11225, 5), false, [id=#10223]
+- *(3) Project [_1#11218 AS a#11225, _2#11219 AS b#11226, _3#11220 AS c#11227]
+- *(3) LocalTableScan [_1#11218, _2#11219, _3#11220]
```
```scala
// Current approach creates additional AppendColumns and repartition data again
val df4 = df1.groupByKey(r => r.getInt(0)).cogroup(df2.groupByKey(r => r.getInt(0))) {
case (key, data1, data2) =>
data1.zip(data2).map { p =>
p._1.getInt(2) + p._2.getInt(2)
}
}
```
```
== Physical Plan ==
*(7) SerializeFromObject [input[0, int, false] AS value#11257]
+- CoGroup org.apache.spark.sql.DataFrameSuite$$Lambda$4933/138102700737171997, value#11252: int, createexternalrow(a#11209, b#11210, c#11211, StructField(a,IntegerType,false), StructField(b,IntegerType,false), StructField(c,IntegerType,false)), createexternalrow(a#11225, b#11226, c#11227, StructField(a,IntegerType,false), StructField(b,IntegerType,false), StructField(c,IntegerType,false)), [value#11252], [value#11254], [a#11209, b#11210, c#11211], [a#11225, b#11226, c#11227], obj#11256: int
:- *(3) Sort [value#11252 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(value#11252, 5), true, [id=#10302]
: +- AppendColumns org.apache.spark.sql.DataFrameSuite$$Lambda$4930/19529195347ce07f47, createexternalrow(a#11209, b#11210, c#11211, StructField(a,IntegerType,false), StructField(b,IntegerType,false), StructField(c,IntegerType,false)), [input[0, int, false] AS value#11252]
: +- *(2) Sort [a#11209 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(a#11209, 5), false, [id=#10297]
: +- *(1) Project [_1#11202 AS a#11209, _2#11203 AS b#11210, _3#11204 AS c#11211]
: +- *(1) LocalTableScan [_1#11202, _2#11203, _3#11204]
+- *(6) Sort [value#11254 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(value#11254, 5), true, [id=#10312]
+- AppendColumns org.apache.spark.sql.DataFrameSuite$$Lambda$4932/15265288491f0e0c1f, createexternalrow(a#11225, b#11226, c#11227, StructField(a,IntegerType,false), StructField(b,IntegerType,false), StructField(c,IntegerType,false)), [input[0, int, false] AS value#11254]
+- *(5) Sort [a#11225 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(a#11225, 5), false, [id=#10307]
+- *(4) Project [_1#11218 AS a#11225, _2#11219 AS b#11226, _3#11220 AS c#11227]
+- *(4) LocalTableScan [_1#11218, _2#11219, _3#11220]
```
### Does this PR introduce any user-facing change?
Yes, this adds a new `as` API to RelationalGroupedDataset. Users can use it to create KeyValueGroupedDataset and do cogroup.
### How was this patch tested?
Unit tests.
Closes#26509 from viirya/SPARK-29427-2.
Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
A few cleanups for https://github.com/apache/spark/pull/26516:
1. move the calculating of partition start indices from the RDD to the rule. We can reuse code from "shrink number of reducers" in the future if we split partitions by size.
2. only check extra shuffles when adding local readers to the probe side.
3. add comments.
4. simplify the config name: `optimizedLocalShuffleReader` -> `localShuffleReader`
### Why are the changes needed?
make code more maintainable.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
existing tests
Closes#26625 from cloud-fan/aqe.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request?
Modify `UTF8String.toInt/toLong` to support trim spaces for both sides before converting it to byte/short/int/long.
With this kind of "cheap" trim can help improve performance for casting string to integrals. The idea is from https://github.com/apache/spark/pull/24872#issuecomment-556917834
### Why are the changes needed?
make the behavior consistent.
### Does this PR introduce any user-facing change?
yes, cast string to an integral type, and binary comparison between string and integrals will trim spaces first. their behavior will be consistent with float and double.
### How was this patch tested?
1. add ut.
2. benchmark tests
the benchmark is modified based on https://github.com/apache/spark/pull/24872#issuecomment-503827016
```scala
/*
* Licensed to the Apache Software Foundation (ASF) under one or more
* contributor license agreements. See the NOTICE file distributed with
* this work for additional information regarding copyright ownership.
* The ASF licenses this file to You under the Apache License, Version 2.0
* (the "License"); you may not use this file except in compliance with
* the License. You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE-2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
package org.apache.spark.sql.execution.benchmark
import org.apache.spark.benchmark.Benchmark
/**
* Benchmark trim the string when casting string type to Boolean/Numeric types.
* To run this benchmark:
* {{{
* 1. without sbt:
* bin/spark-submit --class <this class> --jars <spark core test jar> <spark sql test jar>
* 2. build/sbt "sql/test:runMain <this class>"
* 3. generate result: SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain <this class>"
* Results will be written to "benchmarks/CastBenchmark-results.txt".
* }}}
*/
object CastBenchmark extends SqlBasedBenchmark {
This conversation was marked as resolved by yaooqinn
override def runBenchmarkSuite(mainArgs: Array[String]): Unit = {
val title = "Cast String to Integral"
runBenchmark(title) {
withTempPath { dir =>
val N = 500L << 14
val df = spark.range(N)
val types = Seq("int", "long")
(1 to 5).by(2).foreach { i =>
df.selectExpr(s"concat(id, '${" " * i}') as str")
.write.mode("overwrite").parquet(dir + i.toString)
}
val benchmark = new Benchmark(title, N, minNumIters = 5, output = output)
Seq(true, false).foreach { trim =>
types.foreach { t =>
val str = if (trim) "trim(str)" else "str"
val expr = s"cast($str as $t) as c_$t"
(1 to 5).by(2).foreach { i =>
benchmark.addCase(expr + s" - with $i spaces") { _ =>
spark.read.parquet(dir + i.toString).selectExpr(expr).collect()
}
}
}
}
benchmark.run()
}
}
}
}
```
#### benchmark result.
normal trim v.s. trim in toInt/toLong
```java
================================================================================================
Cast String to Integral
================================================================================================
Java HotSpot(TM) 64-Bit Server VM 1.8.0_231-b11 on Mac OS X 10.15.1
Intel(R) Core(TM) i5-5287U CPU 2.90GHz
Cast String to Integral: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
cast(trim(str) as int) as c_int - with 1 spaces 10220 12994 1337 0.8 1247.5 1.0X
cast(trim(str) as int) as c_int - with 3 spaces 4763 8356 357 1.7 581.4 2.1X
cast(trim(str) as int) as c_int - with 5 spaces 4791 8042 NaN 1.7 584.9 2.1X
cast(trim(str) as long) as c_long - with 1 spaces 4014 6755 NaN 2.0 490.0 2.5X
cast(trim(str) as long) as c_long - with 3 spaces 4737 6938 NaN 1.7 578.2 2.2X
cast(trim(str) as long) as c_long - with 5 spaces 4478 6919 1404 1.8 546.6 2.3X
cast(str as int) as c_int - with 1 spaces 4443 6222 NaN 1.8 542.3 2.3X
cast(str as int) as c_int - with 3 spaces 3659 3842 170 2.2 446.7 2.8X
cast(str as int) as c_int - with 5 spaces 4372 7996 NaN 1.9 533.7 2.3X
cast(str as long) as c_long - with 1 spaces 3866 5838 NaN 2.1 471.9 2.6X
cast(str as long) as c_long - with 3 spaces 3793 5449 NaN 2.2 463.0 2.7X
cast(str as long) as c_long - with 5 spaces 4947 5961 1198 1.7 603.9 2.1X
```
Closes#26622 from yaooqinn/cheapstringtrim.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This is follow up of #26543
See https://github.com/apache/spark/pull/26543#discussion_r348934276
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Exist UT.
Closes#26628 from LantaoJin/SPARK-29911_FOLLOWUP.
Authored-by: LantaoJin <jinlantao@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
allow the sql test files to specify different dimensions of config sets during testing. For example,
```
--CONFIG_DIM1 a=1
--CONFIG_DIM1 b=2,c=3
--CONFIG_DIM2 x=1
--CONFIG_DIM2 y=1,z=2
```
This example defines 2 config dimensions, and each dimension defines 2 config sets. We will run the queries 4 times:
1. a=1, x=1
2. a=1, y=1, z=2
3. b=2, c=3, x=1
4. b=2, c=3, y=1, z=2
### Why are the changes needed?
Currently `SQLQueryTestSuite` takes a long time. This is because we run each test at least 3 times, to check with different codegen modes. This is not necessary for most of the tests, e.g. DESC TABLE. We should only check these codegen modes for certain tests.
With the --CONFIG_DIM directive, we can do things like: test different join operator(broadcast or shuffle join) X different codegen modes.
After reducing testing time, we should be able to run thrifter server SQL tests with config settings.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
test only
Closes#26612 from cloud-fan/test.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Make `ResolveRelations` call `ResolveTables` at the beginning, and make `ResolveTables` call `ResolveTempViews`(newly added) at the beginning, to ensure the relation resolution priority.
### Why are the changes needed?
To resolve an `UnresolvedRelation`, the general process is:
1. try to resolve to (global) temp view first. If it's not a temp view, move on
2. if the table name specifies a catalog, lookup the table from the specified catalog. Otherwise, lookup table from the current catalog.
3. when looking up table from session catalog, return a v1 relation if the table provider is v1.
Currently, this process is done by 2 rules: `ResolveTables` and `ResolveRelations`. To avoid rule conflicts, we add a lot of checks:
1. `ResolveTables` only resolves `UnresolvedRelation` if it's not a temp view and the resolved table is not v1.
2. `ResolveRelations` only resolves `UnresolvedRelation` if the table name has less than 2 parts.
This requires to run `ResolveTables` before `ResolveRelations`, otherwise we may resolve a v2 table to a v1 relation.
To clearly guarantee the resolution priority, and avoid massive changes, this PR proposes to call one rule in another rule to ensure the rule execution order. Now the process is simple:
1. first run `ResolveTempViews`, see if we can resolve relation to temp view
2. then run `ResolveTables`, see if we can resolve relation to v2 tables.
3. finally run `ResolveRelations`, see if we can resolve relation to v1 tables.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
existing tests
Closes#26214 from cloud-fan/resolve.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Ryan Blue <blue@apache.org>
### What changes were proposed in this pull request?
When implementing a ScanBuilder, we require the implementor to provide the schema of the data and the number of partitions.
However, when someone is implementing WriteBuilder we only pass them the schema, but not the number of partitions. This is an asymetrical developer experience.
This PR adds a PhysicalWriteInfo interface that is passed to createBatchWriterFactory and createStreamingWriterFactory that adds the number of partitions of the data that is going to be written.
### Why are the changes needed?
Passing in the number of partitions on the WriteBuilder would enable data sources to provision their write targets before starting to write. For example:
it could be used to provision a Kafka topic with a specific number of partitions
it could be used to scale a microservice prior to sending the data to it
it could be used to create a DsV2 that sends the data to another spark cluster (currently not possible since the reader wouldn't be able to know the number of partitions)
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Tests passed
Closes#26591 from edrevo/temp.
Authored-by: Ximo Guanter <joaquin.guantergonzalbez@telefonica.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This is to refactor `SparkPlan` code; it mainly removed `newMutableProjection`/`newOrdering`/`newNaturalAscendingOrdering` from `SparkPlan`.
The other modifications are listed below;
- Move `BaseOrdering` from `o.a.s.sqlcatalyst.expressions.codegen.GenerateOrdering.scala` to `o.a.s.sqlcatalyst.expressions.ordering.scala`
- `RowOrdering` extends `CodeGeneratorWithInterpretedFallback ` for `BaseOrdering`
- Remove the unused variables (`subexpressionEliminationEnabled` and `codeGenFallBack`) from `SparkPlan`
### Why are the changes needed?
For better code/test coverage.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing.
Closes#26615 from maropu/RefactorOrdering.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In origin way to judge if a DataSet is empty by
```
def isEmpty: Boolean = withAction("isEmpty", limit(1).groupBy().count().queryExecution) { plan =>
plan.executeCollect().head.getLong(0) == 0
}
```
will add two shuffles by `limit()`, `groupby() and count()`, then collect all data to driver.
In this way we can avoid `oom` when collect data to driver. But it will trigger all partitions calculated and add more shuffle process.
We change it to
```
def isEmpty: Boolean = withAction("isEmpty", select().queryExecution) { plan =>
plan.executeTake(1).isEmpty
}
```
After these pr, we will add a column pruning to origin LogicalPlan and use `executeTake()` API.
then we won't add more shuffle process and just compute only one partition's data in last stage.
In this way we can reduce cost when we call `DataSet.isEmpty()` and won't bring memory issue to driver side.
### Why are the changes needed?
Optimize Dataset.isEmpty()
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Origin UT
Closes#26500 from AngersZhuuuu/SPARK-29874.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add typeof function for Spark to get the underlying type of value.
```sql
-- !query 0
select typeof(1)
-- !query 0 schema
struct<typeof(1):string>
-- !query 0 output
int
-- !query 1
select typeof(1.2)
-- !query 1 schema
struct<typeof(1.2):string>
-- !query 1 output
decimal(2,1)
-- !query 2
select typeof(array(1, 2))
-- !query 2 schema
struct<typeof(array(1, 2)):string>
-- !query 2 output
array<int>
-- !query 3
select typeof(a) from (values (1), (2), (3.1)) t(a)
-- !query 3 schema
struct<typeof(a):string>
-- !query 3 output
decimal(11,1)
decimal(11,1)
decimal(11,1)
```
##### presto
```sql
presto> select typeof(array[1]);
_col0
----------------
array(integer)
(1 row)
```
##### PostgreSQL
```sql
postgres=# select pg_typeof(a) from (values (1), (2), (3.0)) t(a);
pg_typeof
-----------
numeric
numeric
numeric
(3 rows)
```
##### impala
https://issues.apache.org/jira/browse/IMPALA-1597
### Why are the changes needed?
a function which is better we have to help us debug, test, develop ...
### Does this PR introduce any user-facing change?
add a new function
### How was this patch tested?
add ut and example
Closes#26599 from yaooqinn/SPARK-29961.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
I propose to add a test from the commit a936522113 for 2.4. I extended the test by a few more lengths of requested field to cover more code branches in Jackson Core. In particular, [the optimization](5eb8973f87/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/jsonExpressions.scala (L473-L476)) calls Jackson's method 42b8b56684/src/main/java/com/fasterxml/jackson/core/json/UTF8JsonGenerator.java (L742-L746) where the internal buffer size is **8000**. In this way:
- 2000 to check 2000+2000+2000 < 8000
- 2800 from the 2.4 commit. It covers the specific case: 42b8b56684/src/main/java/com/fasterxml/jackson/core/json/UTF8JsonGenerator.java (L746)
- 8000-1, 8000, 8000+1 are sizes around the size of the internal buffer
- 65535 to test an outstanding large field.
### Why are the changes needed?
To be sure that the current implementation and future versions of Spark don't have the bug fixed in 2.4.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By running `JsonFunctionsSuite`.
Closes#26613 from MaxGekk/json_tuple-test.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
The local temporary view is session-scoped. Its lifetime is the lifetime of the session that created it. But now cache data is cross-session. Its lifetime is the lifetime of the Spark application. That's will cause the memory leak if cache a local temporary view in memory when the session closed.
In this PR, we uncache the cached data of local temporary view when session closed. This PR doesn't impact the cached data of global temp view and persisted view.
How to reproduce:
1. create a local temporary view v1
2. cache it in memory
3. close session without drop table v1.
The application will hold the memory forever. In a long running thrift server scenario. It's worse.
```shell
0: jdbc:hive2://localhost:10000> CACHE TABLE testCacheTable AS SELECT 1;
CACHE TABLE testCacheTable AS SELECT 1;
+---------+--+
| Result |
+---------+--+
+---------+--+
No rows selected (1.498 seconds)
0: jdbc:hive2://localhost:10000> !close
!close
Closing: 0: jdbc:hive2://localhost:10000
0: jdbc:hive2://localhost:10000 (closed)> !connect 'jdbc:hive2://localhost:10000'
!connect 'jdbc:hive2://localhost:10000'
Connecting to jdbc:hive2://localhost:10000
Enter username for jdbc:hive2://localhost:10000:
lajin
Enter password for jdbc:hive2://localhost:10000:
***
Connected to: Spark SQL (version 3.0.0-SNAPSHOT)
Driver: Hive JDBC (version 1.2.1.spark2)
Transaction isolation: TRANSACTION_REPEATABLE_READ
1: jdbc:hive2://localhost:10000> select * from testCacheTable;
select * from testCacheTable;
Error: Error running query: org.apache.spark.sql.AnalysisException: Table or view not found: testCacheTable; line 1 pos 14;
'Project [*]
+- 'UnresolvedRelation [testCacheTable] (state=,code=0)
```
<img width="1047" alt="Screen Shot 2019-11-15 at 2 03 49 PM" src="https://user-images.githubusercontent.com/1853780/68923527-7ca8c180-07b9-11ea-9cc7-74f276c46840.png">
### Why are the changes needed?
Resolve memory leak for thrift server
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Manual test in UI storage tab
And add an UT
Closes#26543 from LantaoJin/SPARK-29911.
Authored-by: LantaoJin <jinlantao@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
Use JUnit assertions in tests uniformly, not JVM assert() statements.
### Why are the changes needed?
assert() statements do not produce as useful errors when they fail, and, if they were somehow disabled, would fail to test anything.
### Does this PR introduce any user-facing change?
No. The assertion logic should be identical.
### How was this patch tested?
Existing tests.
Closes#26581 from srowen/assertToJUnit.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
Fix the inconsistent behavior of build-in function SQL LEFT/RIGHT.
### Why are the changes needed?
As the comment in https://github.com/apache/spark/pull/26497#discussion_r345708065, Postgre dialect should not be affected by the ANSI mode config.
During reran the existing tests, only the LEFT/RIGHT build-in SQL function broke the assumption. We fix this by following https://www.postgresql.org/docs/12/sql-keywords-appendix.html: `LEFT/RIGHT reserved (can be function or type)`
### Does this PR introduce any user-facing change?
Yes, the Postgre dialect will not be affected by the ANSI mode config.
### How was this patch tested?
Existing UT.
Closes#26584 from xuanyuanking/SPARK-29951.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
SPARK-28885(#26107) has supported the ANSI store assignment rules and stopped running some ported PgSQL regression tests that violate the rules. To re-activate these tests, this pr is to modify them for passing tests with the rules.
### Why are the changes needed?
To make the test coverage better.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#26492 from maropu/SPARK-28885-FOLLOWUP.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
The Web UI SQL Tab provides information on the executed SQL using plan graphs and by reporting SQL execution plans. Both sources provide useful information. Physical execution plans report Codegen Stage Ids. This PR adds Codegen Stage Ids to the plan graphs.
### Why are the changes needed?
It is useful to have Codegen Stage Id information also reported in plan graphs, this allows to more easily match physical plans and graphs with metrics when troubleshooting SQL execution.
Example snippet to show the proposed change:
![](https://issues.apache.org/jira/secure/attachment/12985837/snippet__plan_graph_with_Codegen_Stage_Id_Annotated.png)
Example of the current state:
![](https://issues.apache.org/jira/secure/attachment/12985838/snippet_plan_graph_before_patch.png)
Physical plan:
![](https://issues.apache.org/jira/secure/attachment/12985932/Physical_plan_Annotated.png)
### Does this PR introduce any user-facing change?
This PR adds Codegen Stage Id information to SQL plan graphs in the Web UI/SQL Tab.
### How was this patch tested?
Added a test + manually tested
Closes#26519 from LucaCanali/addCodegenStageIdtoWEBUIGraphs.
Authored-by: Luca Canali <luca.canali@cern.ch>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This is to refactor Predicate code; it mainly removed `newPredicate` from `SparkPlan`.
Modifications are listed below;
- Move `Predicate` from `o.a.s.sqlcatalyst.expressions.codegen.GeneratePredicate.scala` to `o.a.s.sqlcatalyst.expressions.predicates.scala`
- To resolve the name conflict, rename `o.a.s.sqlcatalyst.expressions.codegen.Predicate` to `o.a.s.sqlcatalyst.expressions.BasePredicate`
- Extend `CodeGeneratorWithInterpretedFallback ` for `BasePredicate`
This comes from the cloud-fan suggestion: https://github.com/apache/spark/pull/26420#discussion_r348005497
### Why are the changes needed?
For better code/test coverage.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#26604 from maropu/RefactorPredicate.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR fixes the issue of substituting aliases while collecting filters in `PhysicalOperation.collectProjectsAndFilters`. When the `AttributeReference` in alias map differs from the `AttributeReference` in filter condition only in qualifier, it does not substitute alias and throws exception saying `key videoid#47L not found` in the following scenario.
```
[1] Project [userid#0]
+- [2] Filter (isnotnull(videoid#47L) && NOT (videoid#47L = 30))
+- [3] Project [factorial(videoid#1) AS videoid#47L, userid#0]
+- [4] Filter (isnotnull(avebitrate#2) && (avebitrate#2 < 10))
+- [5] Relation[userid#0,videoid#1,avebitrate#2]
```
### Why are the changes needed?
We need to use `AttributeMap` where the key is `AttributeReference`'s `ExprId` instead of `Map[Attribute, Expression]` while collecting and substituting aliases in `PhysicalOperation.collectProjectsAndFilters`.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
New unit tests were added in `TestPhysicalOperation` which reproduces the bug
Closes#25761 from nikitagkonda/SPARK-29029-use-attributemap-for-aliasmap-in-physicaloperation.
Authored-by: Nikita Konda <nikita.konda@workday.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Remove the special handling of the negative sign in the parser (interval literal and type constructor)
### Why are the changes needed?
The negative sign is an operator (UnaryMinus). We don't need to handle it specially, which is kind of doing constant folding at parser side.
### Does this PR introduce any user-facing change?
The error message becomes a little different. Now it reports type mismatch for the `-` operator.
### How was this patch tested?
existing tests
Closes#26578 from cloud-fan/interval.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
In the PR, I propose to add tests from the commit 47cb1f359a for Spark 2.4 that check formatting of timestamp strings for various seconds fractions.
### Why are the changes needed?
To make sure that current behavior is the same as in Spark 2.4
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By running `CSVSuite`, `JsonFunctionsSuite` and `TimestampFormatterSuite`.
Closes#26601 from MaxGekk/format-timestamp-micros-tests.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
`AdaptiveSparkPlanExec` should forward `executeCollect` and `executeTake` to the underlying physical plan.
### Why are the changes needed?
some physical plan has optimization in `executeCollect` and `executeTake`. For example, `CollectLimitExec` won't do shuffle for outermost limit.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
a new test
This closes#26560Closes#26576 from cloud-fan/aqe.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request?
Exception improvement.
### Why are the changes needed?
After selecting pgSQL dialect, queries which are failing because of wrong syntax will give long exception stack trace. For example,
`explain select cast ("abc" as boolean);`
Current output:
> ERROR SparkSQLDriver: Failed in [explain select cast ("abc" as boolean)]
> java.lang.IllegalArgumentException: invalid input syntax for type boolean: abc
> at org.apache.spark.sql.catalyst.expressions.postgreSQL.PostgreCastToBoolean.$anonfun$castToBoolean$2(PostgreCastToBoolean.scala:51)
> at org.apache.spark.sql.catalyst.expressions.CastBase.buildCast(Cast.scala:277)
> at org.apache.spark.sql.catalyst.expressions.postgreSQL.PostgreCastToBoolean.$anonfun$castToBoolean$1(PostgreCastToBoolean.scala:44)
> at org.apache.spark.sql.catalyst.expressions.CastBase.nullSafeEval(Cast.scala:773)
> at org.apache.spark.sql.catalyst.expressions.UnaryExpression.eval(Expression.scala:460)
> at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$1$$anonfun$applyOrElse$1.applyOrElse(expressions.scala:52)
> at org.apache.spark.sql.catalyst.optimizer.ConstantFolding$$anonfun$apply$1$$anonfun$applyOrElse$1.applyOrElse(expressions.scala:45)
> at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$1(TreeNode.scala:286)
> at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
> at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:286)
> at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$transformDown$3(TreeNode.scala:291)
> at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:376)
> at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:214)
> at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:374)
> at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327)
> at org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:291)
> at org.apache.spark.sql.catalyst.plans.QueryPlan.
> .
> .
> .
### Does this PR introduce any user-facing change?
Yes. After this PR, output for above query will be:
> == Physical Plan ==
> org.apache.spark.sql.AnalysisException: invalid input syntax for type boolean: abc;
>
> Time taken: 0.044 seconds, Fetched 1 row(s)
> 19/11/15 15:38:57 INFO SparkSQLCLIDriver: Time taken: 0.044 seconds, Fetched 1 row(s)
### How was this patch tested?
Updated existing test cases.
Closes#26546 from jobitmathew/pgsqlexception.
Authored-by: Jobit Mathew <jobit.mathew@huawei.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Currently, we support to parse '1. second' to 1s or even '. second' to 0s.
```sql
-- !query 118
select interval '1. seconds'
-- !query 118 schema
struct<1 seconds:interval>
-- !query 118 output
1 seconds
-- !query 119
select interval '. seconds'
-- !query 119 schema
struct<0 seconds:interval>
-- !query 119 output
0 seconds
```
```sql
postgres=# select interval '1. second';
ERROR: invalid input syntax for type interval: "1. second"
LINE 1: select interval '1. second';
postgres=# select interval '. second';
ERROR: invalid input syntax for type interval: ". second"
LINE 1: select interval '. second';
```
We fix this by fixing the new interval parser's VALUE_FRACTIONAL_PART state
With further digging, we found that 1. is valid in python, r, scala, and presto and so on... so this PR
ONLY forbid the invalid interval value in the form of '. seconds'.
### Why are the changes needed?
bug fix
### Does this PR introduce any user-facing change?
yes, now we treat '. second' .... as invalid intervals
### How was this patch tested?
add ut
Closes#26573 from yaooqinn/SPARK-29926.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR update the local reader task number from 1 to multi `partitionStartIndices.length`.
### Why are the changes needed?
Improve the performance of local shuffle reader.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing UTs
Closes#26516 from JkSelf/improveLocalShuffleReader.
Authored-by: jiake <ke.a.jia@intel.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR try to make sure the comparison results of `compared by 8 bytes at a time` and `compared by bytes wise` in RecordBinaryComparator is *consistent*, by reverse long bytes if it is little-endian and using Long.compareUnsigned.
### Why are the changes needed?
If the architecture supports unaligned or the offset is 8 bytes aligned, `RecordBinaryComparator` compare 8 bytes at a time by reading 8 bytes as a long. Related code is
```
if (Platform.unaligned() || (((leftOff + i) % 8 == 0) && ((rightOff + i) % 8 == 0))) {
while (i <= leftLen - 8) {
final long v1 = Platform.getLong(leftObj, leftOff + i);
final long v2 = Platform.getLong(rightObj, rightOff + i);
if (v1 != v2) {
return v1 > v2 ? 1 : -1;
}
i += 8;
}
}
```
Otherwise, it will compare bytes by bytes. Related code is
```
while (i < leftLen) {
final int v1 = Platform.getByte(leftObj, leftOff + i) & 0xff;
final int v2 = Platform.getByte(rightObj, rightOff + i) & 0xff;
if (v1 != v2) {
return v1 > v2 ? 1 : -1;
}
i += 1;
}
```
However, on little-endian machine, the result of *compared by a long value* and *compared bytes by bytes* maybe different.
For two same records, its offsets may vary in the first run and second run, which will lead to compare them using long comparison or byte-by-byte comparison, the result maybe different.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Add new test cases in RecordBinaryComparatorSuite
Closes#26548 from WangGuangxin/binary_comparator.
Authored-by: wangguangxin.cn <wangguangxin.cn@bytedance.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Update `Literal.sql` to make date, timestamp and interval consistent. They should all use the `TYPE 'value'` format.
### Why are the changes needed?
Make the default alias consistent. For example, without this patch we will see
```
scala> sql("select interval '1 day', date '2000-10-10'").show
+------+-----------------+
|1 days|DATE '2000-10-10'|
+------+-----------------+
|1 days| 2000-10-10|
+------+-----------------+
```
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
existing tests
Closes#26579 from cloud-fan/sql.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In our production, HiveMetastoreCatalog#convertToLogicalRelation throws AssertError occasionally:
```sql
scala> spark.table("hive_table").show
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:208)
at org.apache.spark.sql.hive.HiveMetastoreCatalog.convertToLogicalRelation(HiveMetastoreCatalog.scala:261)
at org.apache.spark.sql.hive.HiveMetastoreCatalog.convert(HiveMetastoreCatalog.scala:137)
at org.apache.spark.sql.hive.RelationConversions$$anonfun$apply$4.applyOrElse(HiveStrategies.scala:220)
at org.apache.spark.sql.hive.RelationConversions$$anonfun$apply$4.applyOrElse(HiveStrategies.scala:207)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$2(AnalysisHelper.scala:108)
at org.apache.spark.sql.catalyst.trees.CurrentOrigin$.withOrigin(TreeNode.scala:72)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:108)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$4(AnalysisHelper.scala:113)
at org.apache.spark.sql.catalyst.trees.TreeNode.$anonfun$mapChildren$1(TreeNode.scala:376)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapProductIterator(TreeNode.scala:214)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:374)
at org.apache.spark.sql.catalyst.trees.TreeNode.mapChildren(TreeNode.scala:327)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.$anonfun$resolveOperatorsDown$1(AnalysisHelper.scala:113)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.allowInvokingTransformsInAnalyzer(AnalysisHelper.scala:194)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown(AnalysisHelper.scala:106)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperatorsDown$(AnalysisHelper.scala:104)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperatorsDown(LogicalPlan.scala:29)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators(AnalysisHelper.scala:73)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper.resolveOperators$(AnalysisHelper.scala:72)
at org.apache.spark.sql.catalyst.plans.logical.LogicalPlan.resolveOperators(LogicalPlan.scala:29)
at org.apache.spark.sql.hive.RelationConversions.apply(HiveStrategies.scala:207)
at org.apache.spark.sql.hive.RelationConversions.apply(HiveStrategies.scala:191)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$2(RuleExecutor.scala:130)
at scala.collection.IndexedSeqOptimized.foldLeft(IndexedSeqOptimized.scala:60)
at scala.collection.IndexedSeqOptimized.foldLeft$(IndexedSeqOptimized.scala:68)
at scala.collection.mutable.ArrayBuffer.foldLeft(ArrayBuffer.scala:49)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:127)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:119)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:119)
at org.apache.spark.sql.catalyst.analysis.Analyzer.org$apache$spark$sql$catalyst$analysis$Analyzer$$executeSameContext(Analyzer.scala:168)
at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:162)
at org.apache.spark.sql.catalyst.analysis.Analyzer.execute(Analyzer.scala:122)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:98)
at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:98)
at org.apache.spark.sql.catalyst.analysis.Analyzer.$anonfun$executeAndCheck$1(Analyzer.scala:146)
at org.apache.spark.sql.catalyst.plans.logical.AnalysisHelper$.markInAnalyzer(AnalysisHelper.scala:201)
at org.apache.spark.sql.catalyst.analysis.Analyzer.executeAndCheck(Analyzer.scala:145)
at org.apache.spark.sql.execution.QueryExecution.$anonfun$analyzed$1(QueryExecution.scala:66)
at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
at org.apache.spark.sql.execution.QueryExecution.analyzed$lzycompute(QueryExecution.scala:63)
at org.apache.spark.sql.execution.QueryExecution.analyzed(QueryExecution.scala:63)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:55)
at org.apache.spark.sql.Dataset$.ofRows(Dataset.scala:86)
at org.apache.spark.sql.SparkSession.table(SparkSession.scala:585)
at org.apache.spark.sql.SparkSession.table(SparkSession.scala:581)
... 47 elided
````
Most of cases occurred in reading a table which created by an old Spark version.
After recreated the table, the issue will be gone.
After deep dive, the root cause is this external table is a non-partitioned table but the `LOCATION` set to a partitioned path {{/tablename/dt=yyyymmdd}}. The partitionSpec is inferred.
### Why are the changes needed?
Above error message is very confused. We need more details about assert failure information.
This issue caused by `PartitioningAwareFileIndex#inferPartitioning()`. For non-HiveMetastore Spark, it's useful. But for Hive table, it shouldn't infer partition if Hive tell us it's a non partitioned table. (new added)
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Add UT.
Closes#26499 from LantaoJin/SPARK-29869.
Authored-by: LantaoJin <jinlantao@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR adds `ALTER TABLE a.b.c RENAME TO x.y.x` support for V2 catalogs.
### Why are the changes needed?
The current implementation doesn't support this command V2 catalogs.
### Does this PR introduce any user-facing change?
Yes, now the renaming table works for v2 catalogs:
```
scala> spark.sql("SHOW TABLES IN testcat.ns1.ns2").show
+---------+---------+
|namespace|tableName|
+---------+---------+
| ns1.ns2| old|
+---------+---------+
scala> spark.sql("ALTER TABLE testcat.ns1.ns2.old RENAME TO testcat.ns1.ns2.new").show
scala> spark.sql("SHOW TABLES IN testcat.ns1.ns2").show
+---------+---------+
|namespace|tableName|
+---------+---------+
| ns1.ns2| new|
+---------+---------+
```
### How was this patch tested?
Added unit tests.
Closes#26539 from imback82/rename_table.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR add guides for `ThriftServerQueryTestSuite`.
### Why are the changes needed?
Add guides
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
N/A
Closes#26587 from wangyum/SPARK-28527-FOLLOW-UP.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/26418. This PR removed `CalendarInterval`'s `toString` with an unfinished changes.
### Why are the changes needed?
1. Ideally we should make each PR isolated and separate targeting one issue without touching unrelated codes.
2. There are some other places where the string formats were exposed to users. For example:
```scala
scala> sql("select interval 1 days as a").selectExpr("to_csv(struct(a))").show()
```
```
+--------------------------+
|to_csv(named_struct(a, a))|
+--------------------------+
| "CalendarInterval...|
+--------------------------+
```
3. Such fixes:
```diff
private def writeMapData(
map: MapData, mapType: MapType, fieldWriter: ValueWriter): Unit = {
val keyArray = map.keyArray()
+ val keyString = mapType.keyType match {
+ case CalendarIntervalType =>
+ (i: Int) => IntervalUtils.toMultiUnitsString(keyArray.getInterval(i))
+ case _ => (i: Int) => keyArray.get(i, mapType.keyType).toString
+ }
```
can cause performance regression due to type dispatch for each map.
### Does this PR introduce any user-facing change?
Yes, see 2. case above.
### How was this patch tested?
Manually tested.
Closes#26572 from HyukjinKwon/SPARK-29783.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR is a followup of https://github.com/apache/spark/pull/26530 and proposes to move the configuration `spark.sql.defaultUrlStreamHandlerFactory.enabled` to `StaticSQLConf.scala` for consistency.
### Why are the changes needed?
To put the similar configurations together and for readability.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Manually tested as described in https://github.com/apache/spark/pull/26530.
Closes#26570 from HyukjinKwon/SPARK-25694.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
When regenerating golden files, the set operations via `--SET` will not be done, but those with --import should be exceptions because we need the set command.
### Why are the changes needed?
fix test tool.
### Does this PR introduce any user-facing change?
### How was this patch tested?
add ut, but I'm not sure we need these tests for tests itself.
cc maropu cloud-fan
Closes#26557 from yaooqinn/SPARK-29873.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Checked with SQL Standard and PostgreSQL
> CHAR is equivalent to CHARACTER. DEC is equivalent to DECIMAL. INT is equivalent to INTEGER. VARCHAR is equivalent to CHARACTER VARYING. ...
```sql
postgres=# select dec '1.0';
numeric
---------
1.0
(1 row)
postgres=# select CHARACTER '. second';
bpchar
----------
. second
(1 row)
postgres=# select CHAR '. second';
bpchar
----------
. second
(1 row)
```
### Why are the changes needed?
For better ansi support
### Does this PR introduce any user-facing change?
yes, we add character as char and dec as decimal
### How was this patch tested?
add ut
Closes#26574 from yaooqinn/SPARK-29941.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add AlterNamespaceSetLocationStatement, AlterNamespaceSetLocation, AlterNamespaceSetLocationExec to make ALTER DATABASE (SET LOCATION) look up catalog like v2 commands.
And also refine the code of AlterNamespaceSetProperties, AlterNamespaceSetPropertiesExec, DescribeNamespace, DescribeNamespaceExec to use SupportsNamespaces instead of CatalogPlugin for catalog parameter.
### Why are the changes needed?
It's important to make all the commands have the same catalog/namespace resolution behavior, to avoid confusing end-users.
### Does this PR introduce any user-facing change?
Yes, add "ALTER NAMESPACE ... SET LOCATION" whose function is same as "ALTER DATABASE ... SET LOCATION" and "ALTER SCHEMA ... SET LOCATION".
### How was this patch tested?
New unit tests
Closes#26562 from fuwhu/SPARK-29859.
Authored-by: fuwhu <bestwwg@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
We now have two different implementation for multi-units interval strings to CalendarInterval type values.
One is used to covert interval string literals to CalendarInterval. This approach will re-delegate the interval string to spark parser which handles the string as a `singleInterval` -> `multiUnitsInterval` -> eventually call `IntervalUtils.fromUnitStrings`
The other is used in `Cast`, which eventually calls `IntervalUtils.stringToInterval`. This approach is ~10 times faster than the other.
We should unify these two for better performance and simple logic. this pr uses the 2nd approach.
### Why are the changes needed?
We should unify these two for better performance and simple logic.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
we shall not fail on existing uts
Closes#26491 from yaooqinn/SPARK-29870.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add 3 interval output types which are named as `SQL_STANDARD`, `ISO_8601`, `MULTI_UNITS`. And we add a new conf `spark.sql.dialect.intervalOutputStyle` for this. The `MULTI_UNITS` style displays the interval values in the former behavior and it is the default. The newly added `SQL_STANDARD`, `ISO_8601` styles can be found in the following table.
Style | conf | Year-Month Interval | Day-Time Interval | Mixed Interval
-- | -- | -- | -- | --
Format With Time Unit Designators | MULTI_UNITS | 1 year 2 mons | 1 days 2 hours 3 minutes 4.123456 seconds | interval 1 days 2 hours 3 minutes 4.123456 seconds
SQL STANDARD | SQL_STANDARD | 1-2 | 3 4:05:06 | -1-2 3 -4:05:06
ISO8601 Basic Format| ISO_8601| P1Y2M| P3DT4H5M6S|P-1Y-2M3D-4H-5M-6S
### Why are the changes needed?
for ANSI SQL support
### Does this PR introduce any user-facing change?
yes,interval out now has 3 output styles
### How was this patch tested?
add new unit tests
cc cloud-fan maropu MaxGekk HyukjinKwon thanks.
Closes#26418 from yaooqinn/SPARK-29783.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
I've noticed that there are two functions to sort arrays sort_array and array_sort.
sort_array is from 1.5.0 and it has the possibility of ordering both ascending and descending
array_sort is from 2.4.0 and it only has the possibility of ordering in ascending.
Basically I just added the possibility of ordering either ascending or descending using array_sort.
I think it would be good to have unified behaviours and not having to user sort_array when you want to order in descending order.
Imagine that you are new to spark, I'd like to be able to sort array using the newest spark functions.
### Why are the changes needed?
Basically to be able to sort the array in descending order using *array_sort* instead of using *sort_array* from 1.5.0
### Does this PR introduce any user-facing change?
Yes, now you are able to sort the array in descending order. Note that it has the same behaviour with nulls than sort_array
### How was this patch tested?
Test's added
This is the link to the [jira](https://issues.apache.org/jira/browse/SPARK-29020)
Closes#25728 from Gschiavon/improving-array-sort.
Lead-authored-by: gschiavon <german.schiavon@lifullconnect.com>
Co-authored-by: Takuya UESHIN <ueshin@databricks.com>
Co-authored-by: gschiavon <Gschiavon@users.noreply.github.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add a property `spark.fsUrlStreamHandlerFactory.enabled` to allow users turn off the default registration of `org.apache.hadoop.fs.FsUrlStreamHandlerFactory`
### Why are the changes needed?
This [SPARK-25694](https://issues.apache.org/jira/browse/SPARK-25694) is a long-standing issue. Originally, [[SPARK-12868][SQL] Allow adding jars from hdfs](https://github.com/apache/spark/pull/17342 ) added this for better Hive support. However, this have a side-effect when the users use Apache Spark without `-Phive`. This causes exceptions when the users tries to use another custom factories or 3rd party library (trying to set this). This configuration will unblock those non-hive users.
### Does this PR introduce any user-facing change?
Yes. This provides a new user-configurable property.
By default, the behavior is unchanged.
### How was this patch tested?
Manual testing.
**BEFORE**
```
$ build/sbt package
$ bin/spark-shell
scala> sql("show tables").show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+
scala> java.net.URL.setURLStreamHandlerFactory(new org.apache.hadoop.fs.FsUrlStreamHandlerFactory())
java.lang.Error: factory already defined
at java.net.URL.setURLStreamHandlerFactory(URL.java:1134)
... 47 elided
```
**AFTER**
```
$ build/sbt package
$ bin/spark-shell --conf spark.sql.defaultUrlStreamHandlerFactory.enabled=false
scala> sql("show tables").show
+--------+---------+-----------+
|database|tableName|isTemporary|
+--------+---------+-----------+
+--------+---------+-----------+
scala> java.net.URL.setURLStreamHandlerFactory(new org.apache.hadoop.fs.FsUrlStreamHandlerFactory())
```
Closes#26530 from jiangzho/master.
Lead-authored-by: Zhou Jiang <zhou_jiang@apple.com>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Co-authored-by: zhou-jiang <zhou_jiang@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
### What changes were proposed in this pull request?
SPARK-27444 introduced `dmlStatementNoWith` so that any dml that needs cte support can leverage it. It be better if we move DELETE/UPDATE/MERGE rules to `dmlStatementNoWith`.
### Why are the changes needed?
Wit this change, we can support syntax like "With t AS (SELECT) DELETE FROM xxx", and so as UPDATE/MERGE.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
New cases added.
Closes#26536 from xianyinxin/SPARK-29907.
Authored-by: xy_xin <xianyin.xxy@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to remove the following SQL configs:
1. `spark.sql.fromJsonForceNullableSchema`
2. `spark.sql.legacy.compareDateTimestampInTimestamp`
3. `spark.sql.legacy.allowCreatingManagedTableUsingNonemptyLocation`
that are declared to be removed in Spark 3.0
### Why are the changes needed?
To make code cleaner and improve maintainability.
### Does this PR introduce any user-facing change?
Yes
### How was this patch tested?
By `TypeCoercionSuite`, `JsonExpressionsSuite` and `DDLSuite`.
Closes#26559 from MaxGekk/remove-sql-configs.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
Some of the columns of JDBC/ODBC tab Session info in Web UI are hard to understand.
Add tool tip for Start time, finish time , Duration and Total Execution
![Screenshot from 2019-10-16 12-33-17](https://user-images.githubusercontent.com/51401130/66901981-76d68980-f01d-11e9-9686-e20346a38c25.png)
Why are the changes needed?
To improve the understanding of the WebUI
Does this PR introduce any user-facing change?
No
How was this patch tested?
manual test
Closes#26138 from PavithraRamachandran/JDBC_tooltip.
Authored-by: Pavithra Ramachandran <pavi.rams@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
Add AlterNamespaceSetPropertiesStatement, AlterNamespaceSetProperties and AlterNamespaceSetPropertiesExec to make ALTER DATABASE (SET DBPROPERTIES) command look up catalog like v2 commands.
### Why are the changes needed?
It's important to make all the commands have the same catalog/namespace resolution behavior, to avoid confusing end-users.
### Does this PR introduce any user-facing change?
Yes, add "ALTER NAMESPACE ... SET (DBPROPERTIES | PROPERTIES) ..." whose function is same as "ALTER DATABASE ... SET DBPROPERTIES ..." and "ALTER SCHEMA ... SET DBPROPERTIES ...".
### How was this patch tested?
New unit test
Closes#26551 from fuwhu/SPARK-29858.
Authored-by: fuwhu <bestwwg@163.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
In the PR, I propose to add tests from the commit 9c7e8be1dc for Spark 2.4 that check parsing of timestamp strings for various seconds fractions.
### Why are the changes needed?
To make sure that current behavior is the same as in Spark 2.4
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By running `CSVSuite`, `JsonFunctionsSuite` and `TimestampFormatterSuite`.
Closes#26558 from MaxGekk/parse-timestamp-micros-tests.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Rename config "spark.sql.ansi.enabled" to "spark.sql.dialect.spark.ansi.enabled"
### Why are the changes needed?
The relation between "spark.sql.ansi.enabled" and "spark.sql.dialect" is confusing, since the "PostgreSQL" dialect should contain the features of "spark.sql.ansi.enabled".
To make things clearer, we can rename the "spark.sql.ansi.enabled" to "spark.sql.dialect.spark.ansi.enabled", thus the option "spark.sql.dialect.spark.ansi.enabled" is only for Spark dialect.
For the casting and arithmetic operations, runtime exceptions should be thrown if "spark.sql.dialect" is "spark" and "spark.sql.dialect.spark.ansi.enabled" is true or "spark.sql.dialect" is PostgresSQL.
### Does this PR introduce any user-facing change?
Yes, the config name changed.
### How was this patch tested?
Existing UT.
Closes#26444 from xuanyuanking/SPARK-29807.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR aims to add `io.netty.tryReflectionSetAccessible=true` to the testing configuration for JDK11 because this is an officially documented requirement of Apache Arrow.
Apache Arrow community documented this requirement at `0.15.0` ([ARROW-6206](https://github.com/apache/arrow/pull/5078)).
> #### For java 9 or later, should set "-Dio.netty.tryReflectionSetAccessible=true".
> This fixes `java.lang.UnsupportedOperationException: sun.misc.Unsafe or java.nio.DirectByteBuffer.(long, int) not available`. thrown by netty.
### Why are the changes needed?
After ARROW-3191, Arrow Java library requires the property `io.netty.tryReflectionSetAccessible` to be set to true for JDK >= 9. After https://github.com/apache/spark/pull/26133, JDK11 Jenkins job seem to fail.
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/676/
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/677/
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-3.2-jdk-11/678/
```scala
Previous exception in task:
sun.misc.Unsafe or java.nio.DirectByteBuffer.<init>(long, int) not available

io.netty.util.internal.PlatformDependent.directBuffer(PlatformDependent.java:473)

io.netty.buffer.NettyArrowBuf.getDirectBuffer(NettyArrowBuf.java:243)

io.netty.buffer.NettyArrowBuf.nioBuffer(NettyArrowBuf.java:233)

io.netty.buffer.ArrowBuf.nioBuffer(ArrowBuf.java:245)

org.apache.arrow.vector.ipc.message.ArrowRecordBatch.computeBodyLength(ArrowRecordBatch.java:222)

```
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Pass the Jenkins with JDK11.
Closes#26552 from dongjoon-hyun/SPARK-ARROW-JDK11.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This pr is to remove floating-point `Sum/Average/CentralMomentAgg` from order-insensitive aggregates in `EliminateSorts`.
This pr comes from the gatorsmile suggestion: https://github.com/apache/spark/pull/26011#discussion_r344583899
### Why are the changes needed?
Bug fix.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added tests in `SubquerySuite`.
Closes#26534 from maropu/SPARK-29343-FOLLOWUP.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Add DescribeNamespaceStatement, DescribeNamespace and DescribeNamespaceExec
to make "DESC DATABASE" look up catalog like v2 commands.
### Why are the changes needed?
It's important to make all the commands have the same catalog/namespace resolution behavior, to avoid confusing end-users.
### Does this PR introduce any user-facing change?
Yes, add "DESC NAMESPACE" whose function is same as "DESC DATABASE" and "DESC SCHEMA".
### How was this patch tested?
New unit test
Closes#26513 from fuwhu/SPARK-29834.
Authored-by: fuwhu <bestwwg@163.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR proposes to show Python, pandas and PyArrow versions in integrated UDF tests as a clue so when the test cases fail, it show the related version information.
I think we don't really need this kind of version information in the test case name for now since I intend that integrated SQL test cases do not target to test different combinations of Python, Pandas and PyArrow.
### Why are the changes needed?
To make debug easier.
### Does this PR introduce any user-facing change?
It will change test name to include related Python, pandas and PyArrow versions.
### How was this patch tested?
Manually tested:
```
[info] - udf/postgreSQL/udf-case.sql - Scala UDF *** FAILED *** (8 seconds, 229 milliseconds)
[info] udf/postgreSQL/udf-case.sql - Scala UDF
...
[info] - udf/postgreSQL/udf-case.sql - Regular Python UDF *** FAILED *** (6 seconds, 298 milliseconds)
[info] udf/postgreSQL/udf-case.sql - Regular Python UDF
[info] Python: 3.7
...
[info] - udf/postgreSQL/udf-case.sql - Scalar Pandas UDF *** FAILED *** (6 seconds, 376 milliseconds)
[info] udf/postgreSQL/udf-case.sql - Scalar Pandas UDF
[info] Python: 3.7 Pandas: 0.25.3 PyArrow: 0.14.0
```
Closes#26538 from HyukjinKwon/investigate-flaky-test.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Add ShowTableStatement and make SHOW TABLE EXTENDED go through the same catalog/table resolution framework of v2 commands.
We don’t have this methods in the catalog to implement an V2 command
- catalog.getPartition
- catalog.getTempViewOrPermanentTableMetadata
### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing
```sql
USE my_catalog
DESC t // success and describe the table t from my_catalog
SHOW TABLE EXTENDED FROM LIKE 't' // report table not found as there is no table t in the session catalog
```
### Does this PR introduce any user-facing change?
Yes. When running SHOW TABLE EXTENDED Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.
### How was this patch tested?
Unit tests.
Closes#26540 from planga82/feature/SPARK-29481_ShowTableExtended.
Authored-by: Pablo Langa <soypab@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This is a follow-up pr to fix the code coming from #23400; it replaces `update` with `setByte` for ByteType in `JdbcUtils.makeGetter`.
### Why are the changes needed?
For better code.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#26532 from maropu/SPARK-26499-FOLLOWUP.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
In order to avoid frequently changing the value of `spark.sql.adaptive.shuffle.maxNumPostShufflePartitions`, we usually set `spark.sql.adaptive.shuffle.maxNumPostShufflePartitions` much larger than `spark.sql.shuffle.partitions` after enabling adaptive execution, which causes some bucket map join lose efficacy and add more `ShuffleExchange`.
How to reproduce:
```scala
val bucketedTableName = "bucketed_table"
spark.range(10000).write.bucketBy(500, "id").sortBy("id").mode(org.apache.spark.sql.SaveMode.Overwrite).saveAsTable(bucketedTableName)
val bucketedTable = spark.table(bucketedTableName)
val df = spark.range(8)
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
// Spark 2.4. spark.sql.adaptive.enabled=false
// We set spark.sql.shuffle.partitions <= 500 every time based on our data in this case.
spark.conf.set("spark.sql.shuffle.partitions", 500)
bucketedTable.join(df, "id").explain()
// Since 3.0. We enabled adaptive execution and set spark.sql.adaptive.shuffle.maxNumPostShufflePartitions to a larger values to fit more cases.
spark.conf.set("spark.sql.adaptive.enabled", true)
spark.conf.set("spark.sql.adaptive.shuffle.maxNumPostShufflePartitions", 1000)
bucketedTable.join(df, "id").explain()
```
```
scala> bucketedTable.join(df, "id").explain()
== Physical Plan ==
*(4) Project [id#5L]
+- *(4) SortMergeJoin [id#5L], [id#7L], Inner
:- *(1) Sort [id#5L ASC NULLS FIRST], false, 0
: +- *(1) Project [id#5L]
: +- *(1) Filter isnotnull(id#5L)
: +- *(1) ColumnarToRow
: +- FileScan parquet default.bucketed_table[id#5L] Batched: true, DataFilters: [isnotnull(id#5L)], Format: Parquet, Location: InMemoryFileIndex[file:/root/opensource/apache-spark/spark-3.0.0-SNAPSHOT-bin-3.2.0/spark-warehou..., PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 500 out of 500
+- *(3) Sort [id#7L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(id#7L, 500), true, [id=#49]
+- *(2) Range (0, 8, step=1, splits=16)
```
vs
```
scala> bucketedTable.join(df, "id").explain()
== Physical Plan ==
AdaptiveSparkPlan(isFinalPlan=false)
+- Project [id#5L]
+- SortMergeJoin [id#5L], [id#7L], Inner
:- Sort [id#5L ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(id#5L, 1000), true, [id=#93]
: +- Project [id#5L]
: +- Filter isnotnull(id#5L)
: +- FileScan parquet default.bucketed_table[id#5L] Batched: true, DataFilters: [isnotnull(id#5L)], Format: Parquet, Location: InMemoryFileIndex[file:/root/opensource/apache-spark/spark-3.0.0-SNAPSHOT-bin-3.2.0/spark-warehou..., PartitionFilters: [], PushedFilters: [IsNotNull(id)], ReadSchema: struct<id:bigint>, SelectedBucketsCount: 500 out of 500
+- Sort [id#7L ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(id#7L, 1000), true, [id=#92]
+- Range (0, 8, step=1, splits=16)
```
This PR makes read bucketed tables always obeys `spark.sql.shuffle.partitions` even enabling adaptive execution and set `spark.sql.adaptive.shuffle.maxNumPostShufflePartitions` to avoid add more `ShuffleExchange`.
### Why are the changes needed?
Do not degrade performance after enabling adaptive execution.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Unit test.
Closes#26409 from wangyum/SPARK-29655.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Current string to interval cast logic does not support i.e. cast('.111 second' as interval) which will fail in SIGN state and return null, actually, it is 00:00:00.111.
```scala
-- !query 63
select interval '.111 seconds'
-- !query 63 schema
struct<0.111 seconds:interval>
-- !query 63 output
0.111 seconds
-- !query 64
select cast('.111 seconds' as interval)
-- !query 64 schema
struct<CAST(.111 seconds AS INTERVAL):interval>
-- !query 64 output
NULL
````
### Why are the changes needed?
bug fix.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
add ut
Closes#26514 from yaooqinn/SPARK-29888.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Upgrade Apache Arrow to version 0.15.1. This includes Java artifacts and increases the minimum required version of PyArrow also.
Version 0.12.0 to 0.15.1 includes the following selected fixes/improvements relevant to Spark users:
* ARROW-6898 - [Java] Fix potential memory leak in ArrowWriter and several test classes
* ARROW-6874 - [Python] Memory leak in Table.to_pandas() when conversion to object dtype
* ARROW-5579 - [Java] shade flatbuffer dependency
* ARROW-5843 - [Java] Improve the readability and performance of BitVectorHelper#getNullCount
* ARROW-5881 - [Java] Provide functionalities to efficiently determine if a validity buffer has completely 1 bits/0 bits
* ARROW-5893 - [C++] Remove arrow::Column class from C++ library
* ARROW-5970 - [Java] Provide pointer to Arrow buffer
* ARROW-6070 - [Java] Avoid creating new schema before IPC sending
* ARROW-6279 - [Python] Add Table.slice method or allow slices in \_\_getitem\_\_
* ARROW-6313 - [Format] Tracking for ensuring flatbuffer serialized values are aligned in stream/files.
* ARROW-6557 - [Python] Always return pandas.Series from Array/ChunkedArray.to_pandas, propagate field names to Series from RecordBatch, Table
* ARROW-2015 - [Java] Use Java Time and Date APIs instead of JodaTime
* ARROW-1261 - [Java] Add container type for Map logical type
* ARROW-1207 - [C++] Implement Map logical type
Changelog can be seen at https://arrow.apache.org/release/0.15.0.html
### Why are the changes needed?
Upgrade to get bug fixes, improvements, and maintain compatibility with future versions of PyArrow.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing tests, manually tested with Python 3.7, 3.8
Closes#26133 from BryanCutler/arrow-upgrade-015-SPARK-29376.
Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
move interval tests to `interval.sql`, and import it to `ansi/interval.sql`
### Why are the changes needed?
improve test coverage
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
N/A
Closes#26515 from cloud-fan/test.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/23977 I made a mistake related to this line: 3725b1324f (diff-71c2cad03f08cb5f6c70462aa4e28d3aL112)
Previously,
1. the reader iterator for R worker read some initial data eagerly during RDD materialization. So it read the data before actual execution. For some reasons, in this case, it showed standard error from R worker.
2. After that, when error happens during actual execution, stderr wasn't shown: 3725b1324f (diff-71c2cad03f08cb5f6c70462aa4e28d3aL260)
After my change 3725b1324f (diff-71c2cad03f08cb5f6c70462aa4e28d3aL112), it now ignores 1. case and only does 2. of previous code path, because 1. does not happen anymore as I avoided to such eager execution (which is consistent with PySpark code path).
This PR proposes to do only 1. before/after execution always because It is pretty much possible R worker was failed during actual execution and it's best to show the stderr from R worker whenever possible.
### Why are the changes needed?
It currently swallows standard error from R worker which makes debugging harder.
### Does this PR introduce any user-facing change?
Yes,
```R
df <- createDataFrame(list(list(n=1)))
collect(dapply(df, function(x) {
stop("asdkjasdjkbadskjbsdajbk")
x
}, structType("a double")))
```
**Before:**
```
Error in handleErrors(returnStatus, conn) :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 13.0 failed 1 times, most recent failure: Lost task 0.0 in stage 13.0 (TID 13, 192.168.35.193, executor driver): org.apache.spark.SparkException: R worker exited unexpectedly (cranshed)
at org.apache.spark.api.r.RRunner$$anon$1.read(RRunner.scala:130)
at org.apache.spark.api.r.BaseRRunner$ReaderIterator.hasNext(BaseRRunner.scala:118)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:726)
at org.apache.spark.sql.execution.SparkPlan.$anonfun$getByteArrayRdd$1(SparkPlan.scala:337)
at org.apache.spark.
```
**After:**
```
Error in handleErrors(returnStatus, conn) :
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1.0 (TID 1, 192.168.35.193, executor driver): org.apache.spark.SparkException: R unexpectedly exited.
R worker produced errors: Error in computeFunc(inputData) : asdkjasdjkbadskjbsdajbk
at org.apache.spark.api.r.BaseRRunner$ReaderIterator$$anonfun$1.applyOrElse(BaseRRunner.scala:144)
at org.apache.spark.api.r.BaseRRunner$ReaderIterator$$anonfun$1.applyOrElse(BaseRRunner.scala:137)
at scala.runtime.AbstractPartialFunction.apply(AbstractPartialFunction.scala:38)
at org.apache.spark.api.r.RRunner$$anon$1.read(RRunner.scala:128)
at org.apache.spark.api.r.BaseRRunner$ReaderIterator.hasNext(BaseRRunner.scala:113)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegen
```
### How was this patch tested?
Manually tested and unittest was added.
Closes#26517 from HyukjinKwon/SPARK-26923-followup.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR addresses issues where conflicting attributes in `Expand` are not correctly handled.
### Why are the changes needed?
```Scala
val numsDF = Seq(1, 2, 3, 4, 5, 6).toDF("nums")
val cubeDF = numsDF.cube("nums").agg(max(lit(0)).as("agcol"))
cubeDF.join(cubeDF, "nums").show
```
fails with the following exception:
```
org.apache.spark.sql.AnalysisException:
Failure when resolving conflicting references in Join:
'Join Inner
:- Aggregate [nums#38, spark_grouping_id#36], [nums#38, max(0) AS agcol#35]
: +- Expand [List(nums#3, nums#37, 0), List(nums#3, null, 1)], [nums#3, nums#38, spark_grouping_id#36]
: +- Project [nums#3, nums#3 AS nums#37]
: +- Project [value#1 AS nums#3]
: +- LocalRelation [value#1]
+- Aggregate [nums#38, spark_grouping_id#36], [nums#38, max(0) AS agcol#58]
+- Expand [List(nums#3, nums#37, 0), List(nums#3, null, 1)], [nums#3, nums#38, spark_grouping_id#36]
^^^^^^^
+- Project [nums#3, nums#3 AS nums#37]
+- Project [value#1 AS nums#3]
+- LocalRelation [value#1]
Conflicting attributes: nums#38
```
As you can see from the above plan, `num#38`, the output of `Expand` on the right side of `Join`, should have been handled to produce new attribute. Since the conflict is not resolved in `Expand`, the failure is happening upstream at `Aggregate`. This PR addresses handling conflicting attributes in `Expand`.
### Does this PR introduce any user-facing change?
Yes, the previous example now shows the following output:
```
+----+-----+-----+
|nums|agcol|agcol|
+----+-----+-----+
| 1| 0| 0|
| 6| 0| 0|
| 4| 0| 0|
| 2| 0| 0|
| 5| 0| 0|
| 3| 0| 0|
+----+-----+-----+
```
### How was this patch tested?
Added new unit test.
Closes#26441 from imback82/spark-29682.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This pr is to support `--import` directive to load queries from another test case in SQLQueryTestSuite.
This fix comes from the cloud-fan suggestion in https://github.com/apache/spark/pull/26479#discussion_r345086978
### Why are the changes needed?
This functionality might reduce duplicate test code in `SQLQueryTestSuite`.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Run `SQLQueryTestSuite`.
Closes#26497 from maropu/ImportTests.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Make SparkSQL's `cast to boolean` behavior be consistent with PostgreSQL when
spark.sql.dialect is configured as PostgreSQL.
### Why are the changes needed?
SparkSQL and PostgreSQL have a lot different cast behavior between types by default. We should make SparkSQL's cast behavior be consistent with PostgreSQL when `spark.sql.dialect` is configured as PostgreSQL.
### Does this PR introduce any user-facing change?
Yes. If user switches to PostgreSQL dialect now, they will
* get an exception if they input a invalid string, e.g "erut", while they get `null` before;
* get an exception if they input `TimestampType`, `DateType`, `LongType`, `ShortType`, `ByteType`, `DecimalType`, `DoubleType`, `FloatType` values, while they get `true` or `false` result before.
And here're evidences for those unsupported types from PostgreSQL:
timestamp:
```
postgres=# select cast(cast('2019-11-11' as timestamp) as boolean);
ERROR: cannot cast type timestamp without time zone to boolean
```
date:
```
postgres=# select cast(cast('2019-11-11' as date) as boolean);
ERROR: cannot cast type date to boolean
```
bigint:
```
postgres=# select cast(cast('20191111' as bigint) as boolean);
ERROR: cannot cast type bigint to boolean
```
smallint:
```
postgres=# select cast(cast(2019 as smallint) as boolean);
ERROR: cannot cast type smallint to boolean
```
bytea:
```
postgres=# select cast(cast('2019' as bytea) as boolean);
ERROR: cannot cast type bytea to boolean
```
decimal:
```
postgres=# select cast(cast('2019' as decimal) as boolean);
ERROR: cannot cast type numeric to boolean
```
float:
```
postgres=# select cast(cast('2019' as float) as boolean);
ERROR: cannot cast type double precision to boolean
```
### How was this patch tested?
Added and tested manually.
Closes#26463 from Ngone51/dev-postgre-cast2bool.
Authored-by: wuyi <ngone_5451@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
We already know task attempts that do not clean up output files in staging directory can cause job failure (SPARK-27194). There was proposals trying to fix it by changing output filename, or deleting existing output files. These proposals are not reliable completely.
The difficulty is, as previous failed task attempt wrote the output file, at next task attempt the output file is still under same staging directory, even the output file name is different.
If the job will go to fail eventually, there is no point to re-run the task until max attempts are reached. For the jobs running a lot of time, re-running the task can waste a lot of time.
This patch proposes to let Spark detect such file already exist exception and stop the task set early.
### Why are the changes needed?
For now, if FileAlreadyExistsException is thrown during data writing job in SQL, the job will continue re-running task attempts until max failure number is reached. It is no point for re-running tasks as task attempts will also fail because they can not write to the existing file too. We should stop the task set early.
### Does this PR introduce any user-facing change?
Yes. If FileAlreadyExistsException is thrown during data writing job in SQL, no more task attempts are re-tried and the task set will be stoped early.
### How was this patch tested?
Unit test.
Closes#26312 from viirya/stop-taskset-if-outputfile-exists.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Corrected ShortType and ByteType mapping to SmallInt and TinyInt, corrected setter methods to set ShortType and ByteType as setShort() and setByte(). Changes in JDBCUtils.scala
Fixed Unit test cases to where applicable and added new E2E test cases in to test table read/write using ShortType and ByteType.
#### Problems
- In master in JDBCUtils.scala line number 547 and 551 have a problem where ShortType and ByteType are set as Integers rather than set as Short and Byte respectively.
```
case ShortType =>
(stmt: PreparedStatement, row: Row, pos: Int) =>
stmt.setInt(pos + 1, row.getShort(pos))
The issue was pointed out by maropu
case ByteType =>
(stmt: PreparedStatement, row: Row, pos: Int) =>
stmt.setInt(pos + 1, row.getByte(pos))
```
- Also at line JDBCUtils.scala 247 TinyInt is interpreted wrongly as IntergetType in getCatalystType()
``` case java.sql.Types.TINYINT => IntegerType ```
- At line 172 ShortType was wrongly interpreted as IntegerType
``` case ShortType => Option(JdbcType("INTEGER", java.sql.Types.SMALLINT)) ```
- All thru out tests, ShortType and ByteType were being interpreted as IntegerTypes.
### Why are the changes needed?
A given type should be set using the right type.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Corrected Unit test cases where applicable. Validated in CI/CD
Added a test case in MsSqlServerIntegrationSuite.scala, PostgresIntegrationSuite.scala , MySQLIntegrationSuite.scala to write/read tables from dataframe with cols as shorttype and bytetype. Validated by manual as follows.
```
./build/mvn install -DskipTests
./build/mvn test -Pdocker-integration-tests -pl :spark-docker-integration-tests_2.12
```
Closes#26301 from shivsood/shorttype_fix_maropu.
Authored-by: shivsood <shivsood@microsoft.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
`saveAsTable` had an oversight where write options were not considered in the append save mode.
### Why are the changes needed?
Address the bug so that write options can be considered during appends.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Unit test added that looks in the logic plan of `AppendData` for the existing write options.
Closes#26474 from SpaceRangerWes/master.
Authored-by: Wesley Hoffman <wesleyhoffman109@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR adds a SQL Conf: `spark.sql.streaming.stopActiveRunOnRestart`. When this conf is `true` (by default it is), an already running stream will be stopped, if a new copy gets launched on the same checkpoint location.
### Why are the changes needed?
In multi-tenant environments where you have multiple SparkSessions, you can accidentally start multiple copies of the same stream (i.e. streams using the same checkpoint location). This will cause all new instantiations of the new stream to fail. However, sometimes you may want to turn off the old stream, as the old stream may have turned into a zombie (you no longer have access to the query handle or SparkSession).
It would be nice to have a SQL flag that allows the stopping of the old stream for such zombie cases.
### Does this PR introduce any user-facing change?
Yes. Now by default, if you launch a new copy of an already running stream on a multi-tenant cluster, the existing stream will be stopped.
### How was this patch tested?
Unit tests in StreamingQueryManagerSuite
Closes#26225 from brkyvz/stopStream.
Lead-authored-by: Burak Yavuz <brkyvz@gmail.com>
Co-authored-by: Burak Yavuz <burak@databricks.com>
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
### What changes were proposed in this pull request?
rename EveryAgg/AnyAgg to BoolAnd/BoolOr
### Why are the changes needed?
Under ansi mode, `every`, `any` and `some` are reserved keywords and can't be used as function names. `EveryAgg`/`AnyAgg` has several aliases and I think it's better to not pick reserved keywords as the primary name.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
existing tests
Closes#26486 from cloud-fan/naming.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
rename the config to address the comment: https://github.com/apache/spark/pull/24594#discussion_r285431212
improve the config description, provide a default value to simplify the code.
### Why are the changes needed?
make the config more understandable.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
existing tests
Closes#26395 from cloud-fan/config.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
The current parse and analyze flow for DELETE is: 1, the SQL string will be firstly parsed to `DeleteFromStatement`; 2, the `DeleteFromStatement` be converted to `DeleteFromTable`. However, the SQL string can be parsed to `DeleteFromTable` directly, where a `DeleteFromStatement` seems to be redundant.
It is the same for UPDATE.
This pr removes the unnecessary `DeleteFromStatement` and `UpdateTableStatement`.
### Why are the changes needed?
This makes the codes for DELETE and UPDATE cleaner, and keep align with MERGE INTO.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existed tests and new tests.
Closes#26464 from xianyinxin/SPARK-29835.
Authored-by: xy_xin <xianyin.xxy@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Currently, `SupportsNamespaces.dropNamespace` drops a namespace only if it is empty. Thus, to implement a cascading drop, one needs to iterate all objects (tables, view, etc.) within the namespace (including its sub-namespaces recursively) and drop them one by one. This can have a negative impact on the performance when there are large number of objects.
Instead, this PR proposes to change the default behavior of dropping a namespace to cascading such that implementing cascading/non-cascading drop is simpler without performance penalties.
### Why are the changes needed?
The new behavior makes implementing cascading/non-cascading drop simple without performance penalties.
### Does this PR introduce any user-facing change?
Yes. The default behavior of `SupportsNamespaces.dropNamespace` is now cascading.
### How was this patch tested?
Added new unit tests.
Closes#26476 from imback82/drop_ns_cascade.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add 3 interval functions justify_days, justify_hours, justif_interval to support justify interval values
### Why are the changes needed?
For feature parity with postgres
add three interval functions to justify interval values.
justify_days(interval) | interval | Adjust interval so 30-day time periods are represented as months | justify_days(interval '35 days') | 1 mon 5 days
-- | -- | -- | -- | --
justify_hours(interval) | interval | Adjust interval so 24-hour time periods are represented as days | justify_hours(interval '27 hours') | 1 day 03:00:00
justify_interval(interval) | interval | Adjust interval using justify_days and justify_hours, with additional sign adjustments | justify_interval(interval '1 mon -1 hour') | 29 days 23:00:00
### Does this PR introduce any user-facing change?
yes. new interval functions are added
### How was this patch tested?
add ut
Closes#26465 from yaooqinn/SPARK-29390.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Follow comment of https://github.com/apache/spark/pull/25854#discussion_r342383272
### Why are the changes needed?
NO
### Does this PR introduce any user-facing change?
NO
### How was this patch tested?
ADD TEST CASE
Closes#26406 from AngersZhuuuu/SPARK-29145-FOLLOWUP.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
All tooltips message will display in centre.
### Why are the changes needed?
Some time tooltips will hide the data of column and tooltips display position will be inconsistent in UI.
### Does this PR introduce any user-facing change?
yes.
![Screenshot 2019-10-26 at 3 08 51 AM](https://user-images.githubusercontent.com/8948111/67606124-04dd0d80-f79e-11e9-865a-b7e9bffc9890.png)
### How was this patch tested?
Manual test.
Closes#26263 from 07ARB/SPARK-29570.
Lead-authored-by: Ankitraj <8948111+07ARB@users.noreply.github.com>
Co-authored-by: 07ARB <ankitrajboudh@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
When creating v2 expressions, we have public java APIs, as well as interval scala APIs. All of these APIs take a string column name and parse it to `NamedReference`.
This is convenient for end-users, but not for interval development. For example, the query plan already contains the parsed partition/bucket column names, and it's tricky if we need to quote the names before creating v2 expressions.
This PR proposes to change the interval scala APIs to take `NamedReference` directly, with a new method to create `NamedReference` with the exact name parts. The public java APIs are not changed.
### Why are the changes needed?
fix a bug, and make it easier to create v2 expressions correctly in the future.
### Does this PR introduce any user-facing change?
yes, now v2 CREATE TABLE works as expected.
### How was this patch tested?
a new test
Closes#26425 from cloud-fan/extract.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Ryan Blue <blue@apache.org>
### What changes were proposed in this pull request?
When whole stage codegen `HashAggregateExec`, create the hash map when we begin to process inputs.
### Why are the changes needed?
Sort-merge join completes directly if the left side table is empty. If there is an aggregate in the right side, the aggregate will not be triggered at all, but its hash map is created during codegen and can't be released.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
a new test
Closes#26471 from cloud-fan/memory.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
```sql
-- !query 83
select -integer '7'
-- !query 83 schema
struct<7:int>
-- !query 83 output
7
-- !query 86
select -date '1999-01-01'
-- !query 86 schema
struct<DATE '1999-01-01':date>
-- !query 86 output
1999-01-01
-- !query 87
select -timestamp '1999-01-01'
-- !query 87 schema
struct<TIMESTAMP('1999-01-01 00:00:00'):timestamp>
-- !query 87 output
1999-01-01 00:00:00
```
the integer should be -7 and the date and timestamp results are confusing which should throw exceptions
### Why are the changes needed?
bug fix
### Does this PR introduce any user-facing change?
NO
### How was this patch tested?
ADD UTs
Closes#26479 from yaooqinn/SPARK-29855.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add ShowTablePropertiesStatement and make SHOW TBLPROPERTIES go through the same catalog/table resolution framework of v2 commands.
### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
USE my_catalog
DESC t // success and describe the table t from my_catalog
SHOW TBLPROPERTIES t // report table not found as there is no table t in the session catalog
### Does this PR introduce any user-facing change?
yes. When running SHOW TBLPROPERTIES Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.
### How was this patch tested?
Unit tests.
Closes#26176 from planga82/feature/SPARK-29519_SHOW_TBLPROPERTIES_datasourceV2.
Authored-by: Pablo Langa <soypab@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This patch fixes the edge case of streaming left/right outer join described below:
Suppose query is provided as
`select * from A join B on A.id = B.id AND (A.ts <= B.ts AND B.ts <= A.ts + interval 5 seconds)`
and there're two rows for L1 (from A) and R1 (from B) which ensures L1.id = R1.id and L1.ts = R1.ts.
(we can simply imagine it from self-join)
Then Spark processes L1 and R1 as below:
- row L1 and row R1 are joined at batch 1
- row R1 is evicted at batch 2 due to join and watermark condition, whereas row L1 is not evicted
- row L1 is evicted at batch 3 due to join and watermark condition
When determining outer rows to match with null, Spark applies some assumption commented in codebase, as below:
```
Checking whether the current row matches a key in the right side state, and that key
has any value which satisfies the filter function when joined. If it doesn't,
we know we can join with null, since there was never (including this batch) a match
within the watermark period. If it does, there must have been a match at some point, so
we know we can't join with null.
```
But as explained the edge-case earlier, the assumption is not correct. As we don't have any good assumption to optimize which doesn't have edge-case, we have to track whether such row is matched with others before, and match with null row only when the row is not matched.
To track the matching of row, the patch adds a new state to streaming join state manager, and mark whether the row is matched to others or not. We leverage the information when dealing with eviction of rows which would be candidates to match with null rows.
This approach introduces new state format which is not compatible with old state format - queries with old state format will be still running but they will still have the issue and be required to discard checkpoint and rerun to take this patch in effect.
### Why are the changes needed?
This patch fixes a correctness issue.
### Does this PR introduce any user-facing change?
No for compatibility viewpoint, but we'll encourage end users to discard the old checkpoint and rerun the query if they run stream-stream outer join query with old checkpoint, which might be "yes" for the question.
### How was this patch tested?
Added UT which fails on current Spark and passes with this patch. Also passed existing streaming join UTs.
Closes#26108 from HeartSaVioR/SPARK-26154-shorten-alternative.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
This unblocks the event handling thread, which should help avoid dropped
events when large queries are running.
Existing unit tests should already cover this code.
Closes#26405 from vanzin/SPARK-29766.
Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Enable nested schema pruning and nested pruning on expressions by default. We have been using those features in production in Apple for couple months with great success. For some jobs, we reduce the data reading by more than 8x and 21x faster in wall clock time.
### Why are the changes needed?
Better performance.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#26443 from dbtsai/enableNestedSchemaPrunning.
Authored-by: DB Tsai <d_tsai@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
### What changes were proposed in this pull request?
For better test coverage, this pr is to add join-related configs in `inner-join.sql` and `postgreSQL/join.sql`. These join related configs were just copied from ones in the other join-related tests in `SQLQueryTestSuite` (e.g., https://github.com/apache/spark/blob/master/sql/core/src/test/resources/sql-tests/inputs/natural-join.sql#L2-L4).
### Why are the changes needed?
Better test coverage.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#26459 from maropu/AddJoinConds.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
With the latest string to literal optimization https://github.com/apache/spark/pull/26256, some interval strings can not be cast when there are some spaces between signs and unit values. After state `PARSE_SIGN`, it directly goes to `PARSE_UNIT_VALUE` when takes a space character as the end. So when there are some white spaces come before the real unit value, it fails to parse, we should add a new state like `TRIM_VALUE` to trim all these spaces.
How to re-produce, which aim the revisions since https://github.com/apache/spark/pull/26256 is merged
```sql
select cast(v as interval) from values ('+ 1 second') t(v);
select cast(v as interval) from values ('- 1 second') t(v);
```
### Why are the changes needed?
bug fix
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
1. ut
2. new benchmark test
Closes#26449 from yaooqinn/SPARK-29605.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Hive support STORED AS new file format syntax:
```sql
CREATE TABLE tbl(a int) STORED AS TEXTFILE;
CREATE TABLE tbl2 LIKE tbl STORED AS PARQUET;
```
We add a similar syntax for Spark. Here we separate to two features:
1. specify a different table provider in CREATE TABLE LIKE
2. Hive compatibility
In this PR, we address the first one:
- [ ] Using `USING provider` to specify a different table provider in CREATE TABLE LIKE.
- [ ] Using `STORED AS file_format` in CREATE TABLE LIKE to address Hive compatibility.
### Why are the changes needed?
Use CREATE TABLE tb1 LIKE tb2 command to create an empty table tb1 based on the definition of table tb2. The most user case is to create tb1 with the same schema of tb2. But an inconvenient case here is this command also copies the FileFormat from tb2, it cannot change the input/output format and serde. Add the ability of changing file format is useful for some scenarios like upgrading a table from a low performance file format to a high performance one (parquet, orc).
### Does this PR introduce any user-facing change?
Add a new syntax based on current CTL:
```sql
CREATE TABLE tbl2 LIKE tbl [USING parquet];
```
### How was this patch tested?
Modify some exist UTs.
Closes#26097 from LantaoJin/SPARK-29421.
Authored-by: lajin <lajin@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose new expression `MakeInterval` and register it as the function `make_interval`. The function accepts the following parameters:
- `years` - the number of years in the interval, positive or negative. The parameter is multiplied by 12, and added to interval's `months`.
- `months` - the number of months in the interval, positive or negative.
- `weeks` - the number of months in the interval, positive or negative. The parameter is multiplied by 7, and added to interval's `days`.
- `hours`, `mins` - the number of hours and minutes. The parameters can be negative or positive. They are converted to microseconds and added to interval's `microseconds`.
- `seconds` - the number of seconds with the fractional part in microseconds precision. It is converted to microseconds, and added to total interval's `microseconds` as `hours` and `minutes`.
For example:
```sql
spark-sql> select make_interval(2019, 11, 1, 1, 12, 30, 01.001001);
2019 years 11 months 8 days 12 hours 30 minutes 1.001001 seconds
```
### Why are the changes needed?
- To improve user experience with Spark SQL, and allow users making `INTERVAL` columns from other columns containing `years`, `months` ... `seconds`. Currently, users can make an `INTERVAL` column from other columns only by constructing a `STRING` column and cast it to `INTERVAL`. Have a look at the `IntervalBenchmark` as an example.
- To maintain feature parity with PostgreSQL which provides such function:
```sql
# SELECT make_interval(2019, 11);
make_interval
--------------------
2019 years 11 mons
```
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
- By new tests for the `MakeInterval` expression to `IntervalExpressionsSuite`
- By tests in `interval.sql`
Closes#26446 from MaxGekk/make_interval.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Provide Ellipses in Statement column , just like description in Jobs page .
### Why are the changes needed?
When a query is executed the whole query statement is displayed no matter how big it is. When bigger queries are executed, it covers a large portion of the page display, when we have multiple queries it is difficult to scroll down to view all.
### Does this PR introduce any user-facing change?
No
Before:
![Screenshot from 2019-11-01 23-15-23](https://user-images.githubusercontent.com/51401130/68064468-ebaa0300-fd41-11e9-8787-c5144c1468d4.png)
After:
![Screenshot from 2019-11-02 07-07-21](https://user-images.githubusercontent.com/51401130/68064471-f19fe400-fd41-11e9-85c6-65f0faa64cc3.png)
### How was this patch tested?
Manual
Closes#26364 from PavithraRamachandran/ellipse_JDBC.
Authored-by: Pavithra Ramachandran <pavi.rams@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
- `SqlBase.g4` is modified to support a negative sign `-` in the interval type constructor from a string and in interval literals
- Negate interval in `AstBuilder` if a sign presents.
- Interval related SQL statements are moved from `inputs/datetime.sql` to new file `inputs/interval.sql`
For example:
```sql
spark-sql> select -interval '-1 month 1 day -1 second';
1 months -1 days 1 seconds
spark-sql> select -interval -1 month 1 day -1 second;
1 months -1 days 1 seconds
```
### Why are the changes needed?
For feature parity with PostgreSQL which supports that:
```sql
# select -interval '-1 month 1 day -1 second';
?column?
-------------------------
1 mon -1 days +00:00:01
(1 row)
```
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
- Added tests to `ExpressionParserSuite`
- by `interval.sql`
Closes#26438 from MaxGekk/negative-interval.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
In the PR, I propose an enumeration for interval units with the value `YEAR`, `MONTH`, `WEEK`, `DAY`, `HOUR`, `MINUTE`, `SECOND`, `MILLISECOND`, `MICROSECOND` and `NANOSECOND`.
### Why are the changes needed?
- This should prevent typos in interval unit names
- Stronger type checking of unit parameters.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By existing test suites `ExpressionParserSuite` and `IntervalUtilsSuite`
Closes#26455 from MaxGekk/interval-unit-enum.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Add AlterViewAsStatement and make ALTER VIEW ... QUERY go through the same catalog/table resolution framework of v2 commands.
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC v // success and describe the view v from my_catalog
ALTER VIEW v SELECT 1 // report view not found as there is no view v in the session catalog
```
Yes. When running ALTER VIEW ... QUERY, Spark fails the command if the current catalog is set to a v2 catalog, or the view name specified a v2 catalog.
unit tests
Closes#26453 from huaxingao/spark-29730.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Remove the duplicate code
See the build failure: https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Compile/job/spark-master-compile-maven-hadoop-3.2/986/
### Why are the changes needed?
Fix the compilation
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
The existing tests
Closes#26445 from gatorsmile/hotfixPraser.
Authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request?
This PR supports MERGE INTO in the parser and add the corresponding logical plan. The SQL syntax likes,
```
MERGE INTO [ds_catalog.][multi_part_namespaces.]target_table [AS target_alias]
USING [ds_catalog.][multi_part_namespaces.]source_table | subquery [AS source_alias]
ON <merge_condition>
[ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ]
[ WHEN MATCHED [ AND <condition> ] THEN <matched_action> ]
[ WHEN NOT MATCHED [ AND <condition> ] THEN <not_matched_action> ]
```
where
```
<matched_action> =
DELETE |
UPDATE SET * |
UPDATE SET column1 = value1 [, column2 = value2 ...]
<not_matched_action> =
INSERT * |
INSERT (column1 [, column2 ...]) VALUES (value1 [, value2 ...])
```
### Why are the changes needed?
This is a start work for introduce `MERGE INTO` support for the builtin datasource, and the design work for the `MERGE INTO` support in DSV2.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
New test cases.
Closes#26167 from xianyinxin/SPARK-28893.
Authored-by: xy_xin <xianyin.xxy@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Replace qualifiedName with multipartIdentifier in parser rules of DDL commands.
### Why are the changes needed?
There are identifiers in some DDL rules we use `qualifiedName`. We should use `multipartIdentifier` because it can capture wrong identifiers such as `test-table`, `test-col`.
### Does this PR introduce any user-facing change?
Yes. Wrong identifiers such as test-table, will be captured now after this change.
### How was this patch tested?
Unit tests.
Closes#26419 from viirya/SPARK-29680-followup2.
Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
### What changes were proposed in this pull request?
interval type support >, >=, <, <=, =, <=>, order by, min,max..
### Why are the changes needed?
Part of SPARK-27764 Feature Parity between PostgreSQL and Spark
### Does this PR introduce any user-facing change?
yes, we now support compare intervals
### How was this patch tested?
add ut
Closes#26337 from yaooqinn/SPARK-29679.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
avg aggregate support interval type values
### Why are the changes needed?
Part of SPARK-27764 Feature Parity between PostgreSQL and Spark
### Does this PR introduce any user-facing change?
yes, we can do avg on intervals
### How was this patch tested?
add ut
Closes#26347 from yaooqinn/SPARK-29688.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Spark org.apache.spark.sql.functions do not have `if` function so conditions are expressed using `when-otherwise` function. However `If` (which is available in SQL) has more efficient code gen. This pr rewrites `when-otherwise` conditions to `If` if it is possible (`when-otherwise` with single branch)
### Why are the changes needed?
It is an optimization enhancement. Here is a simple performance comparison (tested in local mode (with 4 cores)):
```
val df = spark.range(10000000000L).withColumn("x", rand)
val resultA = df.withColumn("r", when($"x" < 0.5, lit(1)).otherwise(lit(0))).agg(sum($"r"))
val resultB = df.withColumn("r", expr("if(x < 0.5, 1, 0)")).agg(sum($"r"))
resultA.collect() // takes 56s to finish
resultB.collect() // takes 30s to finish
```
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
New test is added.
Closes#26294 from davidvrba/spark-28477_rewriteCaseWhenToIf.
Authored-by: davidvrba <vrba.dave@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
V2 catalog support namespace, we should add `withNamespace` like `withDatabase`.
### Why are the changes needed?
Make test easy.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Add UT.
Closes#26411 from ulysses-you/Add-test-with-namespace.
Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Move method add/subtract/negate from CalendarInterval to IntervalUtils
### Why are the changes needed?
https://github.com/apache/spark/pull/26410#discussion_r343125468 suggested here
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
add uts and move some
Closes#26423 from yaooqinn/SPARK-29787.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This aims to exclude the `preview` release to recover `HiveExternalCatalogVersionsSuite`. Currently, new preview release breaks `branch-2.4` PRBuilder since yesterday. New release (especially `preview`) should not affect `branch-2.4`.
- https://github.com/apache/spark/pull/26417 (Failed 4 times)
### Why are the changes needed?
**BEFORE**
```scala
scala> scala.io.Source.fromURL("https://dist.apache.org/repos/dist/release/spark/").mkString.split("\n").filter(_.contains("""<li><a href="spark-""")).map("""<a href="spark-(\d.\d.\d)/">""".r.findFirstMatchIn(_).get.group(1))
java.util.NoSuchElementException: None.get
```
**AFTER**
```scala
scala> scala.io.Source.fromURL("https://dist.apache.org/repos/dist/release/spark/").mkString.split("\n").filter(_.contains("""<li><a href="spark-""")).filterNot(_.contains("preview")).map("""<a href="spark-(\d.\d.\d)/">""".r.findFirstMatchIn(_).get.group(1))
res5: Array[String] = Array(2.3.4, 2.4.4)
```
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
This should pass the PRBuilder.
Closes#26428 from dongjoon-hyun/SPARK-HiveExternalCatalogVersionsSuite.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
```java
public static final int YEARS_PER_DECADE = 10;
public static final int YEARS_PER_CENTURY = 100;
public static final int YEARS_PER_MILLENNIUM = 1000;
public static final byte MONTHS_PER_QUARTER = 3;
public static final int MONTHS_PER_YEAR = 12;
public static final byte DAYS_PER_WEEK = 7;
public static final long DAYS_PER_MONTH = 30L;
public static final long HOURS_PER_DAY = 24L;
public static final long MINUTES_PER_HOUR = 60L;
public static final long SECONDS_PER_MINUTE = 60L;
public static final long SECONDS_PER_HOUR = MINUTES_PER_HOUR * SECONDS_PER_MINUTE;
public static final long SECONDS_PER_DAY = HOURS_PER_DAY * SECONDS_PER_HOUR;
public static final long MILLIS_PER_SECOND = 1000L;
public static final long MILLIS_PER_MINUTE = SECONDS_PER_MINUTE * MILLIS_PER_SECOND;
public static final long MILLIS_PER_HOUR = MINUTES_PER_HOUR * MILLIS_PER_MINUTE;
public static final long MILLIS_PER_DAY = HOURS_PER_DAY * MILLIS_PER_HOUR;
public static final long MICROS_PER_MILLIS = 1000L;
public static final long MICROS_PER_SECOND = MILLIS_PER_SECOND * MICROS_PER_MILLIS;
public static final long MICROS_PER_MINUTE = SECONDS_PER_MINUTE * MICROS_PER_SECOND;
public static final long MICROS_PER_HOUR = MINUTES_PER_HOUR * MICROS_PER_MINUTE;
public static final long MICROS_PER_DAY = HOURS_PER_DAY * MICROS_PER_HOUR;
public static final long MICROS_PER_MONTH = DAYS_PER_MONTH * MICROS_PER_DAY;
/* 365.25 days per year assumes leap year every four years */
public static final long MICROS_PER_YEAR = (36525L * MICROS_PER_DAY) / 100;
public static final long NANOS_PER_MICROS = 1000L;
public static final long NANOS_PER_MILLIS = MICROS_PER_MILLIS * NANOS_PER_MICROS;
public static final long NANOS_PER_SECOND = MILLIS_PER_SECOND * NANOS_PER_MILLIS;
```
The above parameters are defined in IntervalUtils, DateTimeUtils, and CalendarInterval, some of them are redundant, some of them are cross-referenced.
### Why are the changes needed?
To simplify code, enhance consistency and reduce risks
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
modified uts
Closes#26399 from yaooqinn/SPARK-29757.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
remove the leading "interval" in `CalendarInterval.toString`.
### Why are the changes needed?
Although it's allowed to have "interval" prefix when casting string to int, it's not recommended.
This is also consistent with pgsql:
```
cloud0fan=# select interval '1' day;
interval
----------
1 day
(1 row)
```
### Does this PR introduce any user-facing change?
yes, when display a dataframe with interval type column, the result is different.
### How was this patch tested?
updated tests.
Closes#26401 from cloud-fan/interval.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose new function `stringToInterval()` in `IntervalUtils` for converting `UTF8String` to `CalendarInterval`. The function is used in casting a `STRING` column to an `INTERVAL` column.
### Why are the changes needed?
The proposed implementation is ~10 times faster. For example, parsing 9 interval units on JDK 8:
Before:
```
9 units w/ interval 14004 14125 116 0.1 14003.6 0.0X
9 units w/o interval 13785 14056 290 0.1 13784.9 0.0X
```
After:
```
9 units w/ interval 1343 1344 1 0.7 1343.0 0.3X
9 units w/o interval 1345 1349 8 0.7 1344.6 0.3X
```
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
- By new tests for `stringToInterval` in `IntervalUtilsSuite`
- By existing tests
Closes#26256 from MaxGekk/string-to-interval.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Handle the inconsistence dividing zeros between literals and columns.
fix the null issue too.
### Why are the changes needed?
BUG FIX
### 1 Handle the inconsistence dividing zeros between literals and columns
```sql
-- !query 24
select
k,
v,
cast(k as interval) / v,
cast(k as interval) * v
from VALUES
('1 seconds', 1),
('2 seconds', 0),
('3 seconds', null),
(null, null),
(null, 0) t(k, v)
-- !query 24 schema
struct<k:string,v:int,divide_interval(CAST(k AS INTERVAL), CAST(v AS DOUBLE)):interval,multiply_interval(CAST(k AS INTERVAL), CAST(v AS DOUBLE)):interval>
-- !query 24 output
1 seconds 1 interval 1 seconds interval 1 seconds
2 seconds 0 interval 0 microseconds interval 0 microseconds
3 seconds NULL NULL NULL
NULL 0 NULL NULL
NULL NULL NULL NULL
```
```sql
-- !query 21
select interval '1 year 2 month' / 0
-- !query 21 schema
struct<divide_interval(interval 1 years 2 months, CAST(0 AS DOUBLE)):interval>
-- !query 21 output
NULL
```
in the first case, interval ’2 seconds ‘ / 0, it produces `interval 0 microseconds `
in the second case, it is `null`
### 2 null literal issues
```sql
-- !query 20
select interval '1 year 2 month' / null
-- !query 20 schema
struct<>
-- !query 20 output
org.apache.spark.sql.AnalysisException
cannot resolve '(interval 1 years 2 months / NULL)' due to data type mismatch: differing types in '(interval 1 years 2 months / NULL)' (interval and null).; line 1 pos 7
-- !query 22
select interval '4 months 2 weeks 6 days' * null
-- !query 22 schema
struct<>
-- !query 22 output
org.apache.spark.sql.AnalysisException
cannot resolve '(interval 4 months 20 days * NULL)' due to data type mismatch: differing types in '(interval 4 months 20 days * NULL)' (interval and null).; line 1 pos 7
-- !query 23
select null * interval '4 months 2 weeks 6 days'
-- !query 23 schema
struct<>
-- !query 23 output
org.apache.spark.sql.AnalysisException
cannot resolve '(NULL * interval 4 months 20 days)' due to data type mismatch: differing types in '(NULL * interval 4 months 20 days)' (null and interval).; line 1 pos 7
```
dividing or multiplying null literals, error occurs; where in column is fine as the first case
### Does this PR introduce any user-facing change?
NO, maybe yes, but it is just a follow-up
### How was this patch tested?
add uts
cc cloud-fan MaxGekk maropu
Closes#26410 from yaooqinn/SPARK-29387.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Update `LocalShuffleReaderExec.outputPartitioning` to use attributes from `ReusedQueryStage`.
This also removes the override `doCanonicalize` in local/coalesced shuffle reader, as these 2 operators change the output partitioning. It's not safe to strip them in the canonicalized query plan.
### Why are the changes needed?
We will have an invalid output partitioning if we don fix it.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
existing tests
Closes#26400 from cloud-fan/aqe.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request?
This patch fixes the bug that `ContinuousMemoryStream[String]` throws error regarding ClassCastException - cast String to UTFString. This is because ContinuousMemoryStream and ContinuousRecordEndpoint uses origin input as it is for underlying data structure of Row, and encoding is missing here.
To force encoding, this patch changes the element type of underlying array to UnsafeRow instead of Any for ContinuousRecordEndpoint - ContinuousMemoryStream and TextSocketContinuousStream are modified to reflect the change.
### Why are the changes needed?
Above section describes the bug.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Add new UT to check for availability on couple of types.
Closes#26300 from HeartSaVioR/SPARK-29642.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
### What changes were proposed in this pull request?
instead of checking the exact number of local shuffle readers, we should check whether the number of shuffles is equal to the number of local readers.
### Why are the changes needed?
AQE is known to have randomness. We may pick different build side for broadcast join depending on which query stage finishes first. The decision to build side may add/remove shuffles downstream, so it's flaky to check the exact number of local shuffle readers.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
test only PR.
Closes#26394 from cloud-fan/test.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request?
Added UT for the classes `ThriftServerPage.scala` and `ThriftServerSessionPage.scala`
### Why are the changes needed?
Currently, there are no UTs for testing Thriftserver UI page
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
UT
Closes#26403 from shahidki31/ut.
Authored-by: shahid <shahidki31@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
During creation of array, if CreateArray does not gets any children to set data type for array, it will create an array of null type .
### Why are the changes needed?
When empty array is created, it should be declared as array<null>.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Tested manually
Closes#26324 from amanomer/29462.
Authored-by: Aman Omer <amanomer1996@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This patch removes v1 ALTER TABLE CHANGE COLUMN syntax.
### Why are the changes needed?
Since in v2 we have ALTER TABLE CHANGE COLUMN and ALTER TABLE RENAME COLUMN, this old syntax is not necessary now and can be confusing.
The v2 ALTER TABLE CHANGE COLUMN should fallback to v1 AlterTableChangeColumnCommand (#26354).
### Does this PR introduce any user-facing change?
Yes, the old v1 ALTER TABLE CHANGE COLUMN syntax is removed.
### How was this patch tested?
Unit tests.
Closes#26338 from viirya/SPARK-29680.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Added new expressions `MultiplyInterval` and `DivideInterval` to multiply/divide an interval by a numeric. Updated `TypeCoercion.DateTimeOperations` to turn the `Multiply`/`Divide` expressions of `CalendarIntervalType` and `NumericType` to `MultiplyInterval`/`DivideInterval`.
To support new operations, added new methods `multiply()` and `divide()` to `CalendarInterval`.
### Why are the changes needed?
- To maintain feature parity with PostgreSQL which supports multiplication and division of intervals by doubles:
```sql
# select interval '1 hour' / double precision '1.5';
?column?
----------
00:40:00
```
- To conform the SQL standard which defines those operations: `numeric * interval`, `interval * numeric` and `interval / numeric`. See [4.5.3 Operations involving datetimes and intervals](http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt).
- Improve Spark SQL UX and allow users to adjust interval columns. For example:
```sql
spark-sql> select (timestamp'now' - timestamp'yesterday') * 1.3;
interval 2 days 10 hours 39 minutes 38 seconds 568 milliseconds 900 microseconds
```
### Does this PR introduce any user-facing change?
Yes, previously the following query fails with the error:
```sql
spark-sql> select interval 1 hour 30 minutes * 1.5;
Error in query: cannot resolve '(interval 1 hours 30 minutes * 1.5BD)' due to data type mismatch: differing types in '(interval 1 hours 30 minutes * 1.5BD)' (interval and decimal(2,1)).; line 1 pos 7;
```
After:
```sql
spark-sql> select interval 1 hour 30 minutes * 1.5;
interval 2 hours 15 minutes
```
### How was this patch tested?
- Added tests for the `multiply()` and `divide()` methods to `CalendarIntervalSuite.java`
- New test suite `IntervalExpressionsSuite`
- by tests for `Multiply` -> `MultiplyInterval` and `Divide` -> `DivideInterval` in `TypeCoercionSuite`
- updated `datetime.sql`
Closes#26132 from MaxGekk/interval-mul-div.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add AlterTableSerDePropertiesStatement and make ALTER TABLE ... SET SERDE/SERDEPROPERTIES go through the same catalog/table resolution framework of v2 commands.
### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC t // success and describe the table t from my_catalog
ALTER TABLE t SET SERDE 'org.apache.class' // report table not found as there is no table t in the session catalog
```
### Does this PR introduce any user-facing change?
Yes. When running ALTER TABLE ... SET SERDE/SERDEPROPERTIES, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.
### How was this patch tested?
Unit tests.
Closes#26374 from huaxingao/spark_29695.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Disallow creating a permanent view that references a temporary view in **expressions**.
### Why are the changes needed?
Creating a permanent view that references a temporary view is currently disallowed. For example,
```SQL
# The following throws org.apache.spark.sql.AnalysisException
# Not allowed to create a permanent view `per_view` by referencing a temporary view `tmp`;
CREATE VIEW per_view AS SELECT t1.a, t2.b FROM base_table t1, (SELECT * FROM tmp) t2"
```
However, the following is allowed.
```SQL
CREATE VIEW per_view AS SELECT * FROM base_table WHERE EXISTS (SELECT * FROM tmp);
```
This PR fixes the bug where temporary views used inside expressions are not checked.
### Does this PR introduce any user-facing change?
Yes. Now the following SQL query throws an exception as expected:
```SQL
# The following throws org.apache.spark.sql.AnalysisException
# Not allowed to create a permanent view `per_view` by referencing a temporary view `tmp`;
CREATE VIEW per_view AS SELECT * FROM base_table WHERE EXISTS (SELECT * FROM tmp);
```
### How was this patch tested?
Added new unit tests.
Closes#26361 from imback82/spark-29630.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR introduces a new SQL command: `SHOW CURRENT NAMESPACE`.
### Why are the changes needed?
Datasource V2 supports multiple catalogs/namespaces and having `SHOW CURRENT NAMESPACE` to retrieve the current catalog/namespace info would be useful.
### Does this PR introduce any user-facing change?
Yes, the user can perform the following:
```
scala> spark.sql("SHOW CURRENT NAMESPACE").show
+-------------+---------+
| catalog|namespace|
+-------------+---------+
|spark_catalog| default|
+-------------+---------+
scala> spark.sql("USE testcat.ns1.ns2").show
scala> spark.sql("SHOW CURRENT NAMESPACE").show
+-------+---------+
|catalog|namespace|
+-------+---------+
|testcat| ns1.ns2|
+-------+---------+
```
### How was this patch tested?
Added unit tests.
Closes#26379 from imback82/show_current_catalog.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
This patch adds the option to clean up files which are completed in previous batch.
`cleanSource` -> "archive" / "delete" / "off"
The default value is "off", which Spark will do nothing.
If "delete" is specified, Spark will simply delete input files. If "archive" is specified, Spark will require additional config `sourceArchiveDir` which will be used to move input files to there. When archiving (via move) the path of input files are retained to the archived paths as sub-path.
Note that it is only applied to "micro-batch", since for batch all input files must be kept to get same result across multiple query executions.
## How was this patch tested?
Added UT. Manual test against local disk as well as HDFS.
Closes#22952 from HeartSaVioR/SPARK-20568.
Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Co-authored-by: Jungtaek Lim <kabhwan@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
## What changes were proposed in this pull request?
This pr proposes to be case insensitive when matching dialects via jdbc url prefix.
When I use jdbc url such as: ```jdbc: MySQL://localhost/db``` to query data through sparksql, the result is wrong, but MySQL supports such url writing.
because sparksql matches MySQLDialect by prefix ```jdbc:mysql```, so ```jdbc: MySQL``` is not matched with the correct dialect. Therefore, it should be case insensitive when identifying the corresponding dialect through jdbc url
https://issues.apache.org/jira/browse/SPARK-28552
## How was this patch tested?
UT.
Closes#25287 from teeyog/sql_dialect.
Lead-authored-by: yong.tian1 <yong.tian1@dmall.com>
Co-authored-by: Xingbo Jiang <xingbo.jiang@databricks.com>
Co-authored-by: Chris Martin <chris@cmartinit.co.uk>
Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Co-authored-by: Kent Yao <yaooqinn@hotmail.com>
Co-authored-by: teeyog <teeyog@gmail.com>
Co-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
`SampleExec` has a bug that it sets `needCopyResult` to false as long as the `withReplacement` parameter is false. This causes problems if its child needs to copy the result, e.g. a join.
### Why are the changes needed?
to fix a correctness issue
### Does this PR introduce any user-facing change?
Yes, the result will be corrected.
### How was this patch tested?
a new test
Closes#26387 from cloud-fan/sample-bug.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Current checkstyle checking folder can't cover all folder.
Since for support multi version hive, we have some divided hive folder.
We should check it too.
### Why are the changes needed?
Fix build bug
### Does this PR introduce any user-facing change?
NO
### How was this patch tested?
NO
Closes#26385 from AngersZhuuuu/SPARK-29742.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
sum support interval values
### Why are the changes needed?
Part of SPARK-27764 Feature Parity between PostgreSQL and Spark
### Does this PR introduce any user-facing change?
yes, sum can evaluate intervals
### How was this patch tested?
add ut
Closes#26325 from yaooqinn/SPARK-29663.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add AlterTableAddPartitionStatement and make ALTER TABLE ... ADD PARTITION go through the same catalog/table resolution framework of v2 commands.
### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC t // success and describe the table t from my_catalog
ALTER TABLE t ADD PARTITION (id=1) // report table not found as there is no table t in the session catalog
```
### Does this PR introduce any user-facing change?
Yes. When running ALTER TABLE ... ADD PARTITION, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.
### How was this patch tested?
Unit tests
Closes#26369 from imback82/spark-29678.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Currently, JDBC/ODBC tab in the WEBUI doesn't support hiding table. Other tabs in the web ui like, Jobs, stages, SQL etc supports hiding table (refer https://github.com/apache/spark/pull/22592).
In this PR, added the support for hide table in the jdbc/odbc tab also.
### Why are the changes needed?
Spark ui about the contents of the form need to have hidden and show features, when the table records very much. Because sometimes you do not care about the record of the table, you just want to see the contents of the next table, but you have to scroll the scroll bar for a long time to see the contents of the next table.
### Does this PR introduce any user-facing change?
No, except support of hide table
### How was this patch tested?
Manually tested
![Screenshot 2019-11-01 at 12 10 05 PM](https://user-images.githubusercontent.com/23054875/68007364-61aa5d80-fca1-11e9-841e-c5a7382871fa.png)
![Screenshot 2019-11-01 at 12 10 43 PM](https://user-images.githubusercontent.com/23054875/68007355-5a834f80-fca1-11e9-844a-f4ba1a333db7.png)
Closes#26353 from shahidki31/hideTable.
Authored-by: shahid <shahidki31@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
- Retry the tests for special date-time values on failure. The tests can potentially fail when reference values were taken before midnight and test code resolves special values after midnight. The retry can guarantees that the tests run during the same day.
- Simplify getting of the current timestamp via `Instant.now()`. This should avoid any issues of converting current local datetime to an instance. For example, the same local time can be mapped to 2 instants when clocks are turned backward 1 hour on daylight saving date.
- Extract common code to SQLHelper
- Set the tested zoneId to the session time zone in `DateTimeUtilsSuite`.
### Why are the changes needed?
To make the tests more stable.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By existing test suites `Date`/`TimestampFormatterSuite` and `DateTimeUtilsSuite`.
Closes#26380 from MaxGekk/retry-on-fail.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
If the resolved table is v1 table, AlterTableAlterColumnStatement fallbacks to v1 AlterTableChangeColumnCommand.
### Why are the changes needed?
To make the catalog/table lookup logic consistent.
### Does this PR introduce any user-facing change?
Yes, a ALTER TABLE ALTER COLUMN command previously fails on v1 tables. After this, it falls back to v1 AlterTableChangeColumnCommand.
### How was this patch tested?
Unit test.
Closes#26354 from viirya/SPARK-29353.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to changed `CalendarInterval.toString`:
- to skip the `week` unit
- to convert `milliseconds` and `microseconds` as the fractional part of the `seconds` unit.
### Why are the changes needed?
To improve readability.
### Does this PR introduce any user-facing change?
Yes
### How was this patch tested?
- By `CalendarIntervalSuite` and `IntervalUtilsSuite`
- `literals.sql`, `datetime.sql` and `interval.sql`
Closes#26367 from MaxGekk/interval-to-string-format.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This is somewhat a complement of https://github.com/apache/spark/pull/21853.
The `Sort` without `Limit` operator in `Join` subquery is useless, it's the same case in `GroupBy` when the aggregation function is order irrelevant, such as `count`, `sum`.
This PR try to remove this kind of `Sort` operator in `SQL Optimizer`.
### Why are the changes needed?
For example, `select count(1) from (select a from test1 order by a)` is equal to `select count(1) from (select a from test1)`.
'select * from (select a from test1 order by a) t1 join (select b from test2) t2 on t1.a = t2.b' is equal to `select * from (select a from test1) t1 join (select b from test2) t2 on t1.a = t2.b`.
Remove useless `Sort` operator can improve performance.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Adding new UT `RemoveSortInSubquerySuite.scala`
Closes#26011 from WangGuangxin/remove_sorts.
Authored-by: wangguangxin.cn <wangguangxin.cn@bytedance.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Support non-reversed keywords to be used in high order functions.
### Why are the changes needed?
the keywords are non-reversed.
### Does this PR introduce any user-facing change?
yes, all non-reversed keywords can be used in high order function correctly
### How was this patch tested?
add uts
Closes#26366 from yaooqinn/SPARK-29722.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
The `assertEquals` method of JUnit Assert requires the first parameter to be the expected value. In this PR, I propose to change the order of parameters when the expected value is passed as the second parameter.
### Why are the changes needed?
Wrong order of assert parameters confuses when the assert fails and the parameters have special string representation. For example:
```java
assertEquals(input1.add(input2), new CalendarInterval(5, 5, 367200000000L));
```
```
java.lang.AssertionError:
Expected :interval 5 months 5 days 101 hours
Actual :interval 5 months 5 days 102 hours
```
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By existing tests.
Closes#26377 from MaxGekk/fix-order-in-assert-equals.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
re-arrange the parser rules to make it clear that multiple unit TO unit statement like `SELECT INTERVAL '1-1' YEAR TO MONTH '2-2' YEAR TO MONTH` is not allowed.
### Why are the changes needed?
This is definitely an accident that we support such a weird syntax in the past. It's not supported by any other DBs and I can't think of any use case of it. Also no test covers this syntax in the current codebase.
### Does this PR introduce any user-facing change?
Yes, and a migration guide item is added.
### How was this patch tested?
new tests.
Closes#26285 from cloud-fan/syntax.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
###What changes were proposed in this pull request?
Add AlterTableDropPartitionStatement and make ALTER TABLE/VIEW ... DROP PARTITION go through the same catalog/table resolution framework of v2 commands.
### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC t // success and describe the table t from my_catalog
ALTER TABLE t DROP PARTITION (id=1) // report table not found as there is no table t in the session catalog
```
### Does this PR introduce any user-facing change?
Yes. When running ALTER TABLE/VIEW ... DROP PARTITION, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.
### How was this patch tested?
Unit tests.
Closes#26303 from huaxingao/spark-29643.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Current CalendarInterval has 2 fields: months and microseconds. This PR try to change it
to 3 fields: months, days and microseconds. This is because one logical day interval may
have different number of microseconds (daylight saving).
### Why are the changes needed?
One logical day interval may have different number of microseconds (daylight saving).
For example, in PST timezone, there will be 25 hours from 2019-11-2 12:00:00 to
2019-11-3 12:00:00
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
unit test and new added test cases
Closes#26134 from LinhongLiu/calendarinterval.
Authored-by: Liu,Linhong <liulinhong@baidu.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add AlterTableRenamePartitionStatement and make ALTER TABLE ... RENAME TO PARTITION go through the same catalog/table resolution framework of v2 commands.
### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC t // success and describe the table t from my_catalog
ALTER TABLE t PARTITION (id=1) RENAME TO PARTITION (id=2) // report table not found as there is no table t in the session catalog
```
### Does this PR introduce any user-facing change?
Yes. When running ALTER TABLE ... RENAME TO PARTITION, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.
### How was this patch tested?
Unit tests.
Closes#26350 from huaxingao/spark_29676.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
### What changes were proposed in this pull request?
Fix JDBC metrics counter data type. Related pull request [26109](https://github.com/apache/spark/pull/26109).
### Why are the changes needed?
Avoid overflow.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Exists UT.
Closes#26346 from ulysses-you/SPARK-29687.
Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Now we use JDBC api and set an Illegal isolationLevel option, spark will throw a `scala.MatchError`, it's not friendly to user. So we should add an IllegalArgumentException.
### Why are the changes needed?
Make exception friendly to user.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Add UT.
Closes#26334 from ulysses-you/SPARK-29675.
Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Bring back https://github.com/apache/spark/pull/25955
### What changes were proposed in this pull request?
This adds a new rule, `V2ScanRelationPushDown`, to push filters and projections in to a new `DataSourceV2ScanRelation` in the optimizer. That scan is then used when converting to a physical scan node. The new relation correctly reports stats based on the scan.
To run scan pushdown before rules where stats are used, this adds a new optimizer override, `earlyScanPushDownRules` and a batch for early pushdown in the optimizer, before cost-based join reordering. The other early pushdown rule, `PruneFileSourcePartitions`, is moved into the early pushdown rule set.
This also moves pushdown helper methods from `DataSourceV2Strategy` into a util class.
### Why are the changes needed?
This is needed for DSv2 sources to supply stats for cost-based rules in the optimizer.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
This updates the implementation of stats from `DataSourceV2Relation` so tests will fail if stats are accessed before early pushdown for v2 relations.
Closes#26341 from cloud-fan/back.
Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
[PR#25295](https://github.com/apache/spark/pull/25295) already implement the rule of converting the shuffle reader to local reader for the `BroadcastHashJoin` in probe side. This PR support converting the shuffle reader to local reader in build side.
### Why are the changes needed?
Improve performance
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing unit tests
Closes#26289 from JkSelf/supportTwoSideLocalReader.
Authored-by: jiake <ke.a.jia@intel.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This is code cleanup PR for https://github.com/apache/spark/pull/25600, aiming to remove an unnecessary condition and to correct a code comment.
### Why are the changes needed?
For code cleanup only.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Passed existing tests.
Closes#26328 from maryannxue/dpp-followup.
Authored-by: maryannxue <maryannxue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
- Added `getDuration()` to calculate interval duration in specified time units assuming provided days per months
- Added `isNegative()` which return `true` is the interval duration is less than 0
- Fix checking negative intervals by using `isNegative()` in structured streaming classes
- Fix checking of `year-months` intervals
### Why are the changes needed?
This fixes incorrect checking of negative intervals. An interval is negative when its duration is negative but not if interval's months **or** microseconds is negative. Also this fixes checking of `year-month` interval support because the `month` field could be negative.
### Does this PR introduce any user-facing change?
Should not
### How was this patch tested?
- Added tests for the `getDuration()` and `isNegative()` methods to `IntervalUtilsSuite`
- By existing SS tests
Closes#26177 from MaxGekk/interval-is-positive.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Update `AlterTableSetLocationStatement` to store `partitionSpec` and make `ALTER TABLE a.b.c PARTITION(...) SET LOCATION 'loc'` fail if `partitionSpec` is set with unsupported message.
### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC t // success and describe the table t from my_catalog
ALTER TABLE t PARTITION(...) SET LOCATION 'loc' // report set location with partition spec is not supported.
```
### Does this PR introduce any user-facing change?
yes. When running ALTER TABLE (set partition location), Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.
### How was this patch tested?
New unit tests
Closes#26304 from imback82/alter_table_partition_loc.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add ShowColumnsStatement and make SHOW COLUMNS go through the same catalog/table resolution framework of v2 commands.
### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
USE my_catalog
DESC t // success and describe the table t from my_catalog
SHOW COLUMNS FROM t // report table not found as there is no table t in the session catalog
### Does this PR introduce any user-facing change?
yes. When running SHOW COLUMNS Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.
### How was this patch tested?
Unit tests.
Closes#26182 from planga82/feature/SPARK-29523_SHOW_COLUMNS_datasourceV2.
Authored-by: Unknown <soypab@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to extract parsing of the seconds interval units to the private method `parseNanos` in `IntervalUtils` and modify the code to correctly parse the fractional part of the seconds unit of intervals in the cases:
- When the fractional part has less than 9 digits
- The seconds unit is negative
### Why are the changes needed?
The changes are needed to fix the issues:
```sql
spark-sql> select interval '10.123456 seconds';
interval 10 seconds 123 microseconds
```
The correct result must be `interval 10 seconds 123 milliseconds 456 microseconds`
```sql
spark-sql> select interval '-10.123456789 seconds';
interval -9 seconds -876 milliseconds -544 microseconds
```
but the whole interval should be negated, and the result must be `interval -10 seconds -123 milliseconds -456 microseconds`, taking into account the truncation to microseconds.
### Does this PR introduce any user-facing change?
Yes. After changes:
```sql
spark-sql> select interval '10.123456 seconds';
interval 10 seconds 123 milliseconds 456 microseconds
spark-sql> select interval '-10.123456789 seconds';
interval -10 seconds -123 milliseconds -456 microseconds
```
### How was this patch tested?
By existing and new tests in `ExpressionParserSuite`.
Closes#26313 from MaxGekk/fix-interval-nanos-parsing.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This adds a new rule, `V2ScanRelationPushDown`, to push filters and projections in to a new `DataSourceV2ScanRelation` in the optimizer. That scan is then used when converting to a physical scan node. The new relation correctly reports stats based on the scan.
To run scan pushdown before rules where stats are used, this adds a new optimizer override, `earlyScanPushDownRules` and a batch for early pushdown in the optimizer, before cost-based join reordering. The other early pushdown rule, `PruneFileSourcePartitions`, is moved into the early pushdown rule set.
This also moves pushdown helper methods from `DataSourceV2Strategy` into a util class.
### Why are the changes needed?
This is needed for DSv2 sources to supply stats for cost-based rules in the optimizer.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
This updates the implementation of stats from `DataSourceV2Relation` so tests will fail if stats are accessed before early pushdown for v2 relations.
Closes#25955 from rdblue/move-v2-pushdown.
Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Ryan Blue <blue@apache.org>
### What changes were proposed in this pull request?
To push the built jars to maven release repository, we need to remove the 'SNAPSHOT' tag from the version name.
Made the following changes in this PR:
* Update all the `3.0.0-SNAPSHOT` version name to `3.0.0-preview`
* Update the sparkR version number check logic to allow jvm version like `3.0.0-preview`
**Please note those changes were generated by the release script in the past, but this time since we manually add tags on master branch, we need to manually apply those changes too.**
We shall revert the changes after 3.0.0-preview release passed.
### Why are the changes needed?
To make the maven release repository to accept the built jars.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
N/A
### What changes were proposed in this pull request?
MICROS_PER_MONTH = DAYS_PER_MONTH * MICROS_PER_DAY
### Why are the changes needed?
fix bug
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
add ut
Closes#26321 from yaooqinn/SPARK-29653.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This patch fixes the issue that external listeners are not initialized properly when `spark.sql.hive.metastore.jars` is set to either "maven" or custom list of jar.
("builtin" is not a case here - all jars in Spark classloader are also available in separate classloader)
The culprit is lazy initialization (lazy val or passing builder function) & thread context classloader. HiveClient leverages IsolatedClientLoader to properly load Hive and relevant libraries without issue - to not mess up with Spark classpath it uses separate classloader with leveraging thread context classloader.
But there's a messed-up case - SessionState is being initialized while HiveClient changed the thread context classloader from Spark classloader to Hive isolated one, and streaming query listeners are loaded from changed classloader while initializing SessionState.
This patch forces initializing SessionState in SparkSQLEnv to avoid such case.
### Why are the changes needed?
ClassNotFoundException could occur in spark-sql with specific configuration, as explained above.
### Does this PR introduce any user-facing change?
No, as I don't think end users assume the classloader of external listeners is only containing jars for Hive client.
### How was this patch tested?
New UT added which fails on master branch and passes with the patch.
The error message with master branch when running UT:
```
java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':;
org.apache.spark.sql.AnalysisException: java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':;
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:109)
at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:221)
at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:147)
at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:137)
at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:59)
at org.apache.spark.sql.hive.thriftserver.SparkSQLEnvSuite.$anonfun$new$2(SparkSQLEnvSuite.scala:44)
at org.apache.spark.sql.hive.thriftserver.SparkSQLEnvSuite.withSystemProperties(SparkSQLEnvSuite.scala:61)
at org.apache.spark.sql.hive.thriftserver.SparkSQLEnvSuite.$anonfun$new$1(SparkSQLEnvSuite.scala:43)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:186)
at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:149)
at org.scalatest.FunSuiteLike.invokeWithFixture$1(FunSuiteLike.scala:184)
at org.scalatest.FunSuiteLike.$anonfun$runTest$1(FunSuiteLike.scala:196)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:286)
at org.scalatest.FunSuiteLike.runTest(FunSuiteLike.scala:196)
at org.scalatest.FunSuiteLike.runTest$(FunSuiteLike.scala:178)
at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:56)
at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:221)
at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:214)
at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:56)
at org.scalatest.FunSuiteLike.$anonfun$runTests$1(FunSuiteLike.scala:229)
at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:393)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:381)
at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:376)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:458)
at org.scalatest.FunSuiteLike.runTests(FunSuiteLike.scala:229)
at org.scalatest.FunSuiteLike.runTests$(FunSuiteLike.scala:228)
at org.scalatest.FunSuite.runTests(FunSuite.scala:1560)
at org.scalatest.Suite.run(Suite.scala:1124)
at org.scalatest.Suite.run$(Suite.scala:1106)
at org.scalatest.FunSuite.org$scalatest$FunSuiteLike$$super$run(FunSuite.scala:1560)
at org.scalatest.FunSuiteLike.$anonfun$run$1(FunSuiteLike.scala:233)
at org.scalatest.SuperEngine.runImpl(Engine.scala:518)
at org.scalatest.FunSuiteLike.run(FunSuiteLike.scala:233)
at org.scalatest.FunSuiteLike.run$(FunSuiteLike.scala:232)
at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:56)
at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:56)
at org.scalatest.tools.SuiteRunner.run(SuiteRunner.scala:45)
at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13(Runner.scala:1349)
at org.scalatest.tools.Runner$.$anonfun$doRunRunRunDaDoRunRun$13$adapted(Runner.scala:1343)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.scalatest.tools.Runner$.doRunRunRunDaDoRunRun(Runner.scala:1343)
at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24(Runner.scala:1033)
at org.scalatest.tools.Runner$.$anonfun$runOptionallyWithPassFailReporter$24$adapted(Runner.scala:1011)
at org.scalatest.tools.Runner$.withClassLoaderAndDispatchReporter(Runner.scala:1509)
at org.scalatest.tools.Runner$.runOptionallyWithPassFailReporter(Runner.scala:1011)
at org.scalatest.tools.Runner$.run(Runner.scala:850)
at org.scalatest.tools.Runner.run(Runner.scala)
at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.runScalaTest2(ScalaTestRunner.java:133)
at org.jetbrains.plugins.scala.testingSupport.scalaTest.ScalaTestRunner.main(ScalaTestRunner.java:27)
Caused by: java.lang.IllegalArgumentException: Error while instantiating 'org.apache.spark.sql.hive.HiveSessionStateBuilder':
at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1054)
at org.apache.spark.sql.SparkSession.$anonfun$sessionState$2(SparkSession.scala:156)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.SparkSession.sessionState$lzycompute(SparkSession.scala:154)
at org.apache.spark.sql.SparkSession.sessionState(SparkSession.scala:151)
at org.apache.spark.sql.SparkSession.$anonfun$new$3(SparkSession.scala:105)
at scala.Option.map(Option.scala:230)
at org.apache.spark.sql.SparkSession.$anonfun$new$1(SparkSession.scala:105)
at org.apache.spark.sql.internal.SQLConf$.get(SQLConf.scala:164)
at org.apache.spark.sql.hive.client.HiveClientImpl.newState(HiveClientImpl.scala:183)
at org.apache.spark.sql.hive.client.HiveClientImpl.<init>(HiveClientImpl.scala:127)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:300)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:421)
at org.apache.spark.sql.hive.HiveUtils$.newClientForMetadata(HiveUtils.scala:314)
at org.apache.spark.sql.hive.HiveExternalCatalog.client$lzycompute(HiveExternalCatalog.scala:68)
at org.apache.spark.sql.hive.HiveExternalCatalog.client(HiveExternalCatalog.scala:67)
at org.apache.spark.sql.hive.HiveExternalCatalog.$anonfun$databaseExists$1(HiveExternalCatalog.scala:221)
at scala.runtime.java8.JFunction0$mcZ$sp.apply(JFunction0$mcZ$sp.java:23)
at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:99)
... 58 more
Caused by: java.lang.ClassNotFoundException: test.custom.listener.DummyQueryExecutionListener
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:206)
at org.apache.spark.util.Utils$.$anonfun$loadExtensions$1(Utils.scala:2746)
at scala.collection.TraversableLike.$anonfun$flatMap$1(TraversableLike.scala:245)
at scala.collection.mutable.ResizableArray.foreach(ResizableArray.scala:62)
at scala.collection.mutable.ResizableArray.foreach$(ResizableArray.scala:55)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:49)
at scala.collection.TraversableLike.flatMap(TraversableLike.scala:245)
at scala.collection.TraversableLike.flatMap$(TraversableLike.scala:242)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:108)
at org.apache.spark.util.Utils$.loadExtensions(Utils.scala:2744)
at org.apache.spark.sql.util.ExecutionListenerManager.$anonfun$new$1(QueryExecutionListener.scala:83)
at org.apache.spark.sql.util.ExecutionListenerManager.$anonfun$new$1$adapted(QueryExecutionListener.scala:82)
at scala.Option.foreach(Option.scala:407)
at org.apache.spark.sql.util.ExecutionListenerManager.<init>(QueryExecutionListener.scala:82)
at org.apache.spark.sql.internal.BaseSessionStateBuilder.$anonfun$listenerManager$2(BaseSessionStateBuilder.scala:293)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.internal.BaseSessionStateBuilder.listenerManager(BaseSessionStateBuilder.scala:293)
at org.apache.spark.sql.internal.BaseSessionStateBuilder.build(BaseSessionStateBuilder.scala:320)
at org.apache.spark.sql.SparkSession$.org$apache$spark$sql$SparkSession$$instantiateSessionState(SparkSession.scala:1051)
... 80 more
```
Closes#26258 from HeartSaVioR/SPARK-29604.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
```
postgres=# select date '2001-09-28' + integer '7';
?column?
------------
2001-10-05
(1 row)postgres=# select integer '7';
int4
------
7
(1 row)
```
Add support for typed integer literal expression from postgreSQL.
### Why are the changes needed?
SPARK-27764 Feature Parity between PostgreSQL and Spark
### Does this PR introduce any user-facing change?
support typed integer lit in SQL
### How was this patch tested?
add uts
Closes#26291 from yaooqinn/SPARK-29629.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
Now, `RepartitionByExpression` is allowed at Dataset method `Dataset.repartition()`. But in spark sql, we do not have an equivalent functionality.
In hive, we can use `distribute by`, so it's worth to add a hint to support such function.
Similar jira [SPARK-24940](https://issues.apache.org/jira/browse/SPARK-24940)
## Why are the changes needed?
Make repartition hints consistent with repartition api .
## Does this PR introduce any user-facing change?
This pr intends to support quries below;
```
// SQL cases
- sql("SELECT /*+ REPARTITION(c) */ * FROM t")
- sql("SELECT /*+ REPARTITION(1, c) */ * FROM t")
- sql("SELECT /*+ REPARTITION_BY_RANGE(c) */ * FROM t")
- sql("SELECT /*+ REPARTITION_BY_RANGE(1, c) */ * FROM t")
```
## How was this patch tested?
UT
Closes#25464 from ulysses-you/SPARK-28746.
Lead-authored-by: ulysses <youxiduo@weidian.com>
Co-authored-by: ulysses <646303253@qq.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
In the PR, I propose to move all static methods from the `CalendarInterval` class to the `IntervalUtils` object. All those methods are rewritten from Java to Scala.
### Why are the changes needed?
- For consistency with other helper methods. Such methods were placed to the helper object `IntervalUtils`, see https://github.com/apache/spark/pull/26190
- Taking into account that `CalendarInterval` will be fully exposed to users in the future (see https://github.com/apache/spark/pull/25022), it would be nice to clean it up by moving service methods to an internal object.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
- By moved tests from `CalendarIntervalSuite` to `IntervalUtilsSuite`
- By existing test suites
Closes#26261 from MaxGekk/refactoring-calendar-interval.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add AlterTableRecoverPartitionsStatement and make ALTER TABLE ... RECOVER PARTITIONS go through the same catalog/table resolution framework of v2 commands.
### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC t // success and describe the table t from my_catalog
ALTER TABLE t RECOVER PARTITIONS // report table not found as there is no table t in the session catalog
```
### Does this PR introduce any user-facing change?
Yes. When running ALTER TABLE ... RECOVER PARTITIONS Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.
### How was this patch tested?
Unit tests.
Closes#26269 from huaxingao/spark-29612.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
To push the built jars to maven release repository, we need to remove the 'SNAPSHOT' tag from the version name.
Made the following changes in this PR:
* Update all the `3.0.0-SNAPSHOT` version name to `3.0.0-preview`
* Update the PySpark version from `3.0.0.dev0` to `3.0.0`
**Please note those changes were generated by the release script in the past, but this time since we manually add tags on master branch, we need to manually apply those changes too.**
We shall revert the changes after 3.0.0-preview release passed.
### Why are the changes needed?
To make the maven release repository to accept the built jars.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
N/A
Closes#26243 from jiangxb1987/3.0.0-preview-prepare.
Lead-authored-by: Xingbo Jiang <xingbo.jiang@databricks.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>
### What changes were proposed in this pull request?
There're some issues observed in `HiveUserDefinedTypeSuite."Support UDT in Hive UDF"`:
1) Neither function (TestUDF) nor test take "nullable" point column into account.
2) ExamplePointUDT. sqlType is ArrayType which doesn't provide information how many elements are expected. RandomDataGenerator may provide less elements than needed.
This patch fixes `HiveUserDefinedTypeSuite."Support UDT in Hive UDF"` to change the type of "point" column to be non-nullable, as well as not use RandomDataGenerator to create row for UDT backed by ArrayType.
### Why are the changes needed?
CI builds are failing in high occurrences.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Manually tested by running tests locally multiple times.
Closes#26287 from HeartSaVioR/SPARK-28158-FOLLOWUP.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR adds `DROP NAMESPACE` support for V2 catalogs.
### Why are the changes needed?
Currently, you cannot drop namespaces for v2 catalogs.
### Does this PR introduce any user-facing change?
The user can now perform the following:
```SQL
CREATE NAMESPACE mycatalog.ns
DROP NAMESPACE mycatalog.ns
SHOW NAMESPACES IN mycatalog # Will show no namespaces
```
to drop a namespace `ns` inside `mycatalog` V2 catalog.
### How was this patch tested?
Added unit tests.
Closes#26262 from imback82/drop_namespace.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Add LoadDataStatement and make LOAD DATA INTO TABLE go through the same catalog/table resolution framework of v2 commands.
### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC t // success and describe the table t from my_catalog
LOAD DATA INPATH 'filepath' INTO TABLE t // report table not found as there is no table t in the session catalog
```
### Does this PR introduce any user-facing change?
yes. When running LOAD DATA INTO TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.
### How was this patch tested?
Unit tests.
Closes#26178 from viirya/SPARK-29521.
Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In this PR, extend the support of pagination to session table in `JDBC/PDBC` .
### Why are the changes needed?
Some times we may connect a lot client and there a many session info shown in session tab.
make it can be paged for better view.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Manuel verify.
After pr:
<img width="1440" alt="Screen Shot 2019-10-25 at 4 19 27 PM" src="https://user-images.githubusercontent.com/46485123/67555133-50ae9900-f743-11e9-8724-9624a691f232.png">
<img width="1434" alt="Screen Shot 2019-10-25 at 4 19 38 PM" src="https://user-images.githubusercontent.com/46485123/67555165-5906d400-f743-11e9-819e-73f86a333dd3.png">
Closes#26253 from AngersZhuuuu/SPARK-29599.
Lead-authored-by: angerszhu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
After this PR, we can create and register Hive UDFs to accept UDT type, like `VectorUDT` and `MatrixUDT`. These UDTs are widely used in Spark machine learning.
## How was this patch tested?
add new ut
Closes#24961 from uncleGen/SPARK-28158.
Authored-by: uncleGen <hustyugm@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
### Why are the changes needed?
When make the `LocalShuffleReaderExec` to leaf node, there exists a potential issue: the leaf node will hide the running query stage and make the unfinished query stage as finished query stage when creating its parent query stage.
This PR make the leaf node to unary node.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing tests
Closes#26250 from JkSelf/updateLeafNodeofLocalReaderToUnaryExecNode.
Authored-by: jiake <ke.a.jia@intel.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This pr is to fix wrong code to check parameter lengths of split methods in `subexpressionEliminationForWholeStageCodegen`.
### Why are the changes needed?
Bug fix.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#26267 from maropu/SPARK-29008-FOLLOWUP.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
The `DateTimeUtilsSuite` and `TimestampFormatterSuite` assume constant time difference between `timestamp'yesterday'`, `timestamp'today'` and `timestamp'tomorrow'` which is wrong on daylight switching day - day length can be 23 or 25 hours. In the PR, I propose to use Java 8 time API to calculate instances of `yesterday` and `tomorrow` timestamps.
### Why are the changes needed?
The changes fix test failures and make the tests tolerant to daylight time switching.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By existing test suites `DateTimeUtilsSuite` and `TimestampFormatterSuite`.
Closes#26273 from MaxGekk/midnight-tolerant.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
In the PR https://github.com/apache/spark/pull/26215, we supported pagination for sqlstats table in JDBC/ODBC server page. In this PR, we are extending the support of pagination to sqlstats session table by making use of existing pagination classes in https://github.com/apache/spark/pull/26215.
### Why are the changes needed?
Support pagination for sqlsessionstats table in JDBC/ODBC server page in the WEBUI. It will easier for user to analyse the table and it may fix the potential issues like oom while loading the page, that may occur similar to the SQL page (refer #22645)
### Does this PR introduce any user-facing change?
There will be no change in the sqlsessionstats table in JDBC/ODBC server page execpt pagination support.
### How was this patch tested?
Manually verified.
Before:
![Screenshot 2019-10-24 at 11 32 27 PM](https://user-images.githubusercontent.com/23054875/67512507-96715000-f6b6-11e9-9f1f-ab1877eb24e6.png)
After:
![Screenshot 2019-10-24 at 10 58 53 PM](https://user-images.githubusercontent.com/23054875/67512314-295dba80-f6b6-11e9-9e3e-dd50c6e62fe9.png)
Closes#26246 from shahidki31/SPARK_29589.
Authored-by: shahid <shahidki31@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
Reset the `WritableColumnVector` when getting "next" ColumnarBatch in `RowToColumnarExec`
### Why are the changes needed?
When converting `Iterator[InternalRow]` to `Iterator[ColumnarBatch]`, the vectors used to create a new `ColumnarBatch` should be reset in the iterator's "next()" method.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
N/A
Closes#26137 from rongma1997/reset-WritableColumnVector.
Authored-by: rongma1997 <rong.ma@intel.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
```
hive> select version();
OK
3.1.1 rf4e0529634b6231a0072295da48af466cf2f10b7
Time taken: 2.113 seconds, Fetched: 1 row(s)
```
### Why are the changes needed?
From hive behavior and I guess it is useful for debugging and developing etc.
### Does this PR introduce any user-facing change?
add a misc func
### How was this patch tested?
add ut
Closes#26209 from yaooqinn/SPARK-29554.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This is a follow-up of https://github.com/apache/spark/pull/24557 to fix `since` version.
### Why are the changes needed?
This is found during 3.0.0-preview preparation.
The version will be exposed to our SQL document like the following. We had better fix this.
- https://spark.apache.org/docs/latest/api/sql/#array_min
### Does this PR introduce any user-facing change?
Yes. It's exposed at `DESC FUNCTION EXTENDED` SQL command and SQL doc, but this is new at 3.0.0.
### How was this patch tested?
Manual.
```
spark-sql> DESC FUNCTION EXTENDED min_by;
Function: min_by
Class: org.apache.spark.sql.catalyst.expressions.aggregate.MinBy
Usage: min_by(x, y) - Returns the value of `x` associated with the minimum value of `y`.
Extended Usage:
Examples:
> SELECT min_by(x, y) FROM VALUES (('a', 10)), (('b', 50)), (('c', 20)) AS tab(x, y);
a
Since: 3.0.0
```
Closes#26264 from dongjoon-hyun/SPARK-27653.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Add ShowCreateTableStatement and make SHOW CREATE TABLE go through the same catalog/table resolution framework of v2 commands.
### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC t // success and describe the table t from my_catalog
SHOW CREATE TABLE t // report table not found as there is no table t in the session catalog
```
### Does this PR introduce any user-facing change?
yes. When running SHOW CREATE TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.
### How was this patch tested?
Unit tests.
Closes#26184 from viirya/SPARK-29527.
Lead-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Co-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This patch adds the functionality to measure records being written for JDBC writer. In reality, the value is meant to be a number of records being updated from queries, as per JDBC spec it will return updated count.
### Why are the changes needed?
Output metrics for JDBC writer are missing now. The value of "bytesWritten" is also missing, but we can't measure it from JDBC API.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Unit test added.
Closes#26109 from HeartSaVioR/SPARK-29461.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
First, a bit of background on the code being changed. The current code tracks
metric updates for each task, recording which metrics the task is monitoring
and the last update value.
Once a SQL execution finishes, then the metrics for all the stages are
aggregated, by building a list with all (metric ID, value) pairs collected
for all tasks in the stages related to the execution, then grouping by metric
ID, and then calculating the values shown in the UI.
That is full of inefficiencies:
- in normal operation, all tasks will be tracking and updating the same
metrics. So recording the metric IDs per task is wasteful.
- tracking by task means we might be double-counting values if you have
speculative tasks (as a comment in the code mentions).
- creating a list of (metric ID, value) is extremely inefficient, because now
you have a huge map in memory storing boxed versions of the metric IDs and
values.
- same thing for the aggregation part, where now a Seq is built with the values
for each metric ID.
The end result is that for large queries, this code can become both really
slow, thus affecting the processing of events, and memory hungry.
The updated code changes the approach to the following:
- stages track metrics by their ID; this means the stage tracking code
naturally groups values, making aggregation later simpler.
- each metric ID being tracked uses a long array matching the number of
partitions of the stage; this means that it's cheap to update the value of
the metric once a task ends.
- when aggregating, custom code just concatenates the arrays corresponding to
the matching metric IDs; this is cheaper than the previous, boxing-heavy
approach.
The end result is that the listener uses about half as much memory as before
for tracking metrics, since it doesn't need to track metric IDs per task.
I captured heap dumps with the old and the new code during metric aggregation
in the listener, for an execution with 3 stages, 100k tasks per stage, 50
metrics updated per task. The dumps contained just reachable memory - so data
kept by the listener plus the variables in the aggregateMetrics() method.
With the old code, the thread doing aggregation references >1G of memory - and
that does not include temporary data created by the "groupBy" transformation
(for which the intermediate state is not referenced in the aggregation method).
The same thread with the new code references ~250M of memory. The old code uses
about ~250M to track all the metric values for that execution, while the new
code uses about ~130M. (Note the per-thread numbers include the amount used to
track the metrics - so, e.g., in the old case, aggregation was referencing
about ~750M of temporary data.)
I'm also including a small benchmark (based on the Benchmark class) so that we
can measure how much changes to this code affect performance. The benchmark
contains some extra code to measure things the normal Benchmark class does not,
given that the code under test does not really map that well to the
expectations of that class.
Running with the old code (I removed results that don't make much
sense for this benchmark):
```
[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Linux 4.15.0-66-generic
[info] Intel(R) Core(TM) i7-6820HQ CPU 2.70GHz
[info] metrics aggregation (50 metrics, 100k tasks per stage): Best Time(ms) Avg Time(ms)
[info] --------------------------------------------------------------------------------------
[info] 1 stage(s) 2113 2118
[info] 2 stage(s) 4172 4392
[info] 3 stage(s) 7755 8460
[info]
[info] Stage Count Stage Proc. Time Aggreg. Time
[info] 1 614 1187
[info] 2 620 2480
[info] 3 718 5069
```
With the new code:
```
[info] Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Linux 4.15.0-66-generic
[info] Intel(R) Core(TM) i7-6820HQ CPU 2.70GHz
[info] metrics aggregation (50 metrics, 100k tasks per stage): Best Time(ms) Avg Time(ms)
[info] --------------------------------------------------------------------------------------
[info] 1 stage(s) 727 886
[info] 2 stage(s) 1722 1983
[info] 3 stage(s) 2752 3013
[info]
[info] Stage Count Stage Proc. Time Aggreg. Time
[info] 1 408 177
[info] 2 389 423
[info] 3 372 660
```
So the new code is faster than the old when processing task events, and about
an order of maginute faster when aggregating metrics.
Note this still leaves room for improvement; for example, using the above
measurements, 600ms is still a huge amount of time to spend in an event
handler. But I'll leave further enhancements for a separate change.
Tested with benchmarking code + existing unit tests.
Closes#26218 from vanzin/SPARK-29562.
Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Track timing info for each rule in optimization phase using `QueryPlanningTracker` in Structured Streaming
### Why are the changes needed?
In Structured Streaming we only track rule info in analysis phase, not in optimization phase.
### Does this PR introduce any user-facing change?
No
Closes#25914 from wenxuanguan/spark-29227.
Authored-by: wenxuanguan <choose_home@126.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add UncacheTableStatement and make UNCACHE TABLE go through the same catalog/table resolution framework of v2 commands.
### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC t // success and describe the table t from my_catalog
UNCACHE TABLE t // report table not found as there is no table t in the session catalog
```
### Does this PR introduce any user-facing change?
yes. When running UNCACHE TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.
### How was this patch tested?
New unit tests
Closes#26237 from imback82/uncache_table.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Remove the requirement of fetch_size>=0 from JDBCOptions to allow negative fetch size.
### Why are the changes needed?
Namely, to allow data fetch in stream manner (row-by-row fetch) against MySQL database.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Unit test (JDBCSuite)
This closes#26230 .
Closes#26244 from fuwhu/SPARK-21287-FIX.
Authored-by: fuwhu <bestwwg@163.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
# What changes were proposed in this pull request?
Add description for ignoreNullFields, which is commited in #26098 , in DataFrameWriter and readwriter.py.
Enable user to use ignoreNullFields in pyspark.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
run unit tests
Closes#26227 from stczwd/json-generator-doc.
Authored-by: stczwd <qcsd2011@163.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Only use antlr4 to parse the interval string, and remove the duplicated parsing logic from `CalendarInterval`.
### Why are the changes needed?
Simplify the code and fix inconsistent behaviors.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Pass the Jenkins with the updated test cases.
Closes#26190 from cloud-fan/parser.
Lead-authored-by: Wenchen Fan <wenchen@databricks.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Recent timezone definition changes in very new JDK 8 (and beyond) releases cause test failures. The below was observed on JDK 1.8.0_232. As before, the easy fix is to allow for these inconsequential variations in test results due to differing definition of timezones.
### Why are the changes needed?
Keeps test passing on the latest JDK releases.
### Does this PR introduce any user-facing change?
None
### How was this patch tested?
Existing tests
Closes#26236 from srowen/SPARK-29578.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
Supports pagination for SQL Statisitcs table in the JDBC/ODBC tab using existing Spark pagination framework.
### Why are the changes needed?
It will easier for user to analyse the table and it may fix the potential issues like oom while loading the page, that may occur similar to the SQL page (refer https://github.com/apache/spark/pull/22645)
### Does this PR introduce any user-facing change?
There will be no change in the `SQLStatistics` table in JDBC/ODBC server page execpt pagination support.
### How was this patch tested?
Manually verified.
Before PR:
![Screenshot 2019-10-22 at 11 37 29 PM](https://user-images.githubusercontent.com/23054875/67316080-73636680-f525-11e9-91bc-ff7e06e3736d.png)
After PR:
![Screenshot 2019-10-22 at 10 33 00 PM](https://user-images.githubusercontent.com/23054875/67316092-778f8400-f525-11e9-93f8-1e2815abd66f.png)
Closes#26215 from shahidki31/jdbcPagination.
Authored-by: shahid <shahidki31@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
Support SparkSQL use iN/EXISTS with subquery in JOIN condition.
### Why are the changes needed?
Support SQL use iN/EXISTS with subquery in JOIN condition.
### Does this PR introduce any user-facing change?
This PR is for enable user use subquery in `JOIN`'s ON condition. such as we have create three table
```
CREATE TABLE A(id String);
CREATE TABLE B(id String);
CREATE TABLE C(id String);
```
we can do query like :
```
SELECT A.id from A JOIN B ON A.id = B.id and A.id IN (select C.id from C)
```
### How was this patch tested?
ADDED UT
Closes#25854 from AngersZhuuuu/SPARK-29145.
Lead-authored-by: angerszhu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Reimplement the iterator in UnsafeExternalRowSorter in database style. This can be done by reusing the `RowIterator` in our code base.
### Why are the changes needed?
During the job in #26164, after involving a var `isReleased` in `hasNext`, there's possible that `isReleased` is false when calling `hasNext`, but it becomes true before calling `next`. A safer way is using database-style iterator: `advanceNext` and `getRow`.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing UT.
Closes#26229 from xuanyuanking/SPARK-21492-follow-up.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add CacheTableStatement and make CACHE TABLE go through the same catalog/table resolution framework of v2 commands.
### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC t // success and describe the table t from my_catalog
CACHE TABLE t // report table not found as there is no table t in the session catalog
```
### Does this PR introduce any user-facing change?
yes. When running CACHE TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.
### How was this patch tested?
Unit tests.
Closes#26179 from viirya/SPARK-29522.
Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This is a follow-up of #24052 to correct assert condition.
### Why are the changes needed?
To test IllegalArgumentException condition..
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Manual Test (during fixing of SPARK-29453 find this issue)
Closes#26234 from 07ARB/SPARK-29571.
Authored-by: 07ARB <ankitrajboudh@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This is a follow-up of https://github.com/apache/spark/pull/26189 to regenerate the result on EC2.
### Why are the changes needed?
This will be used for the other PR reviews.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
N/A.
Closes#26233 from dongjoon-hyun/SPARK-29533.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
### What changes were proposed in this pull request?
`OptimizeLocalShuffleReader` rule is very conservative and gives up optimization as long as there are extra shuffles introduced. It's very likely that most of the added local shuffle readers are fine and only one introduces extra shuffle.
However, it's very hard to make `OptimizeLocalShuffleReader` optimal, a simple workaround is to run this rule again right before executing a query stage.
### Why are the changes needed?
Optimize more shuffle reader to local shuffle reader.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing ut
Closes#26207 from JkSelf/resolve-multi-joins-issue.
Authored-by: jiake <ke.a.jia@intel.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add RefreshTableStatement and make REFRESH TABLE go through the same catalog/table resolution framework of v2 commands.
### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC t // success and describe the table t from my_catalog
REFRESH TABLE t // report table not found as there is no table t in the session catalog
```
### Does this PR introduce any user-facing change?
yes. When running REFRESH TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.
### How was this patch tested?
New unit tests
Closes#26183 from imback82/refresh_table.
Lead-authored-by: Terry Kim <yuminkim@gmail.com>
Co-authored-by: Terry Kim <terryk@terrys-mbp-2.lan>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
### What changes were proposed in this pull request?
As described in [SPARK-29542](https://issues.apache.org/jira/browse/SPARK-29542) , the descriptions of `spark.sql.files.*` are confused.
In this PR, I make their descriptions be clearly.
### Why are the changes needed?
It makes the descriptions of `spark.sql.files.*` be clearly.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing UT.
Closes#26200 from turboFei/SPARK-29542-partition-maxSize.
Authored-by: turbofei <fwang12@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This moves the tracking of active queries from a per SparkSession state, to the shared SparkSession for better safety in isolated Spark Session environments.
### Why are the changes needed?
We have checks to prevent the restarting of the same stream on the same spark session, but we can actually make that better in multi-tenant environments by actually putting that state in the SharedState instead of SessionState. This would allow a more comprehensive check for multi-tenant clusters.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Added tests to StreamingQueryManagerSuite
Closes#26018 from brkyvz/sharedStreamingQueryManager.
Lead-authored-by: Burak Yavuz <burak@databricks.com>
Co-authored-by: Burak Yavuz <brkyvz@gmail.com>
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
### What changes were proposed in this pull request?
This PR adds `CREATE NAMESPACE` support for V2 catalogs.
### Why are the changes needed?
Currently, you cannot explicitly create namespaces for v2 catalogs.
### Does this PR introduce any user-facing change?
The user can now perform the following:
```SQL
CREATE NAMESPACE mycatalog.ns
```
to create a namespace `ns` inside `mycatalog` V2 catalog.
### How was this patch tested?
Added unit tests.
Closes#26166 from imback82/create_namespace.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add ShowPartitionsStatement and make SHOW PARTITIONS go through the same catalog/table resolution framework of v2 commands.
### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users.
### Does this PR introduce any user-facing change?
Yes. When running SHOW PARTITIONS, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.
### How was this patch tested?
Unit tests.
Closes#26198 from huaxingao/spark-29539.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
### What changes were proposed in this pull request?
Add TruncateTableStatement and make TRUNCATE TABLE go through the same catalog/table resolution framework of v2 commands.
### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC t // success and describe the table t from my_catalog
TRUNCATE TABLE t // report table not found as there is no table t in the session catalog
```
### Does this PR introduce any user-facing change?
yes. When running TRUNCATE TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.
### How was this patch tested?
Unit tests.
Closes#26174 from viirya/SPARK-29517.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
We shall have a new mechanism that the downstream operators may notify its parents that they may release the output data stream. In this PR, we implement the mechanism as below:
- Add function named `cleanupResources` in SparkPlan, which default call children's `cleanupResources` function, the operator which need a resource cleanup should rewrite this with the self cleanup and also call `super.cleanupResources`, like SortExec in this PR.
- Add logic support on the trigger side, in this PR is SortMergeJoinExec, which make sure and call the `cleanupResources` to do the cleanup job for all its upstream(children) operator.
### Why are the changes needed?
Bugfix for SortMergeJoin memory leak, and implement a general framework for SparkPlan resource cleanup.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
UT: Add new test suite JoinWithResourceCleanSuite to check both standard and code generation scenario.
Integrate Test: Test with driver/executor default memory set 1g, local mode 10 thread. The below test(thanks taosaildrone for providing this test [here](https://github.com/apache/spark/pull/23762#issuecomment-463303175)) will pass with this PR.
```
from pyspark.sql.functions import rand, col
spark.conf.set("spark.sql.join.preferSortMergeJoin", "true")
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)
# spark.conf.set("spark.sql.sortMergeJoinExec.eagerCleanupResources", "true")
r1 = spark.range(1, 1001).select(col("id").alias("timestamp1"))
r1 = r1.withColumn('value', rand())
r2 = spark.range(1000, 1001).select(col("id").alias("timestamp2"))
r2 = r2.withColumn('value2', rand())
joined = r1.join(r2, r1.timestamp1 == r2.timestamp2, "inner")
joined = joined.coalesce(1)
joined.explain()
joined.show()
```
Closes#26164 from xuanyuanking/SPARK-21492.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR test `ThriftServerQueryTestSuite` in an asynchronous way.
### Why are the changes needed?
The default value of `spark.sql.hive.thriftServer.async` is `true`.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
```
build/sbt "hive-thriftserver/test-only *.ThriftServerQueryTestSuite" -Phive-thriftserver
build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite test -Phive-thriftserver
```
Closes#26172 from wangyum/SPARK-29516.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
### What changes were proposed in this pull request?
This PR remove unnecessary orc version and hive version in doc.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
N/A.
Closes#26146 from denglingang/SPARK-24576.
Lead-authored-by: denglingang <chitin1027@gmail.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
As I have comment in [SPARK-29516](https://github.com/apache/spark/pull/26172#issuecomment-544364977)
SparkSession.sql() method parse process not under current sparksession's conf, so some configuration about parser is not valid in multi-thread situation.
In this pr, we add a SQLConf parameter to AbstractSqlParser and initial it with SessionState's conf.
Then for each SparkSession's parser process. It will use's it's own SessionState's SQLConf and to be thread safe
### Why are the changes needed?
Fix bug
### Does this PR introduce any user-facing change?
NO
### How was this patch tested?
NO
Closes#26187 from AngersZhuuuu/SPARK-29530.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Only invoke `checkAndGlobPathIfNecessary()` when we have to use `InMemoryFileIndex`.
### Why are the changes needed?
Avoid unnecessary function invocation.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Pass Jenkins.
Closes#26196 from Ngone51/dev-avoid-unnecessary-invocation-on-globpath.
Authored-by: wuyi <ngone_5451@163.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
I extended `ExtractBenchmark` to support the `INTERVAL` type of the `source` parameter of the `date_part` function.
### Why are the changes needed?
- To detect performance issues while changing implementation of the `date_part` function in the future.
- To find out current performance bottlenecks in `date_part` for the `INTERVAL` type
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By running the benchmark and print out produced values per each `field` value.
Closes#26175 from MaxGekk/extract-interval-benchmark.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Added new benchmark `IntervalBenchmark` to measure performance of interval related functions. In the PR, I added benchmarks for casting strings to interval. In particular, interval strings with `interval` prefix and without it because there is special code for this da576a737c/common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java (L100-L103) . And also I added benchmarks for different number of units in interval strings, for example 1 unit is `interval 10 years`, 2 units w/o interval is `10 years 5 months`, and etc.
### Why are the changes needed?
- To find out current performance issues in casting to intervals
- The benchmark can be used while refactoring/re-implementing `CalendarInterval.fromString()` or `CalendarInterval.fromCaseInsensitiveString()`.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By running the benchmark via the command:
```shell
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.IntervalBenchmark"
```
Closes#26189 from MaxGekk/interval-from-string-benchmark.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This pr refine the code in ThriftServerQueryTestSuite.blackList to reuse the black list of SQLQueryTestSuite instead of duplicating all test cases from SQLQueryTestSuite.blackList.
### Why are the changes needed?
To reduce code duplication.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
N/A
Closes#26188 from fuwhu/SPARK-TBD.
Authored-by: fuwhu <bestwwg@163.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
### What changes were proposed in this pull request?
`CatalogTable` to `HiveTable` will change the table's ownership. How to reproduce:
```scala
import org.apache.spark.sql.catalyst.TableIdentifier
import org.apache.spark.sql.catalyst.catalog.{CatalogStorageFormat, CatalogTable, CatalogTableType}
import org.apache.spark.sql.types.{LongType, StructType}
val identifier = TableIdentifier("spark_29498", None)
val owner = "SPARK-29498"
val newTable = CatalogTable(
identifier,
tableType = CatalogTableType.EXTERNAL,
storage = CatalogStorageFormat(
locationUri = None,
inputFormat = None,
outputFormat = None,
serde = None,
compressed = false,
properties = Map.empty),
owner = owner,
schema = new StructType().add("i", LongType, false),
provider = Some("hive"))
spark.sessionState.catalog.createTable(newTable, false)
// The owner is not SPARK-29498
println(spark.sessionState.catalog.getTableMetadata(identifier).owner)
```
This PR makes it set the `HiveTable`'s owner to `CatalogTable`'s owner if it's owner is not empty when converting `CatalogTable` to `HiveTable`.
### Why are the changes needed?
We should not change the ownership of the table when converting `CatalogTable` to `HiveTable`.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
unit test
Closes#26160 from wangyum/SPARK-29498.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
```
bit_and(expression) -- The bitwise AND of all non-null input values, or null if none
bit_or(expression) -- The bitwise OR of all non-null input values, or null if none
```
More details:
https://www.postgresql.org/docs/9.3/functions-aggregate.html
### Why are the changes needed?
Postgres, Mysql and many other popular db support them.
### Does this PR introduce any user-facing change?
add two bit agg
### How was this patch tested?
add ut
Closes#26155 from yaooqinn/SPARK-27879.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR fix Fix the associated location already exists in `SQLQueryTestSuite`:
```
build/sbt "~sql/test-only *SQLQueryTestSuite -- -z postgreSQL/join.sql"
...
[info] - postgreSQL/join.sql *** FAILED *** (35 seconds, 420 milliseconds)
[info] postgreSQL/join.sql
[info] Expected "[]", but got "[org.apache.spark.sql.AnalysisException
[info] Can not create the managed table('`default`.`tt3`'). The associated location('file:/root/spark/sql/core/spark-warehouse/org.apache.spark.sql.SQLQueryTestSuite/tt3') already exists.;]" Result did not match for query #108
```
### Why are the changes needed?
Fix bug.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
N/A
Closes#26181 from wangyum/TestError.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Add RepairTableStatement and make REPAIR TABLE go through the same catalog/table resolution framework of v2 commands.
### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC t // success and describe the table t from my_catalog
MSCK REPAIR TABLE t // report table not found as there is no table t in the session catalog
```
### Does this PR introduce any user-facing change?
yes. When running MSCK REPAIR TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.
### How was this patch tested?
New unit tests
Closes#26168 from imback82/repair_table.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
### What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/25241 .
The typed interval expression should fail for invalid format.
### Why are the changes needed?
Te be consistent with the typed timestamp/date expression
### Does this PR introduce any user-facing change?
Yes. But this feature is not released yet.
### How was this patch tested?
updated test
Closes#26151 from cloud-fan/bug.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
### What changes were proposed in this pull request?
* Adding an additional check in `stringToTimestamp` to handle cases where the input has trailing ':'
* Added a test to make sure this works.
### Why are the changes needed?
In a couple of scenarios while converting from String to Timestamp `DateTimeUtils.stringToTimestamp` throws an array out of bounds exception if there is trailing ':'. The behavior of this method requires it to return `None` in case the format of the string is incorrect.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Added a test in the `DateTimeTestUtils` suite to test if my fix works.
Closes#26143 from rahulsmahadev/SPARK-29494.
Lead-authored-by: Rahul Mahadev <rahul.mahadev@databricks.com>
Co-authored-by: Rahul Shivu Mahadev <51690557+rahulsmahadev@users.noreply.github.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
Current Spark SQL `SHOW FUNCTIONS` don't show `!=`, `<>`, `between`, `case`
But these expressions is truly functions. We should show it in SQL `SHOW FUNCTIONS`
### Why are the changes needed?
SHOW FUNCTIONS show '!=', '<>' , 'between', 'case'
### Does this PR introduce any user-facing change?
SHOW FUNCTIONS show '!=', '<>' , 'between', 'case'
### How was this patch tested?
UT
Closes#26053 from AngersZhuuuu/SPARK-29379.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
The `date_part()` function can accept the `source` parameter of the `INTERVAL` type (`CalendarIntervalType`). The following values of the `field` parameter are supported:
- `"MILLENNIUM"` (`"MILLENNIA"`, `"MIL"`, `"MILS"`) - number of millenniums in the given interval. It is `YEAR / 1000`.
- `"CENTURY"` (`"CENTURIES"`, `"C"`, `"CENT"`) - number of centuries in the interval calculated as `YEAR / 100`.
- `"DECADE"` (`"DECADES"`, `"DEC"`, `"DECS"`) - decades in the `YEAR` part of the interval calculated as `YEAR / 10`.
- `"YEAR"` (`"Y"`, `"YEARS"`, `"YR"`, `"YRS"`) - years in a values of `CalendarIntervalType`. It is `MONTHS / 12`.
- `"QUARTER"` (`"QTR"`) - a quarter of year calculated as `MONTHS / 3 + 1`
- `"MONTH"` (`"MON"`, `"MONS"`, `"MONTHS"`) - the months part of the interval calculated as `CalendarInterval.months % 12`
- `"DAY"` (`"D"`, `"DAYS"`) - total number of days in `CalendarInterval.microseconds`
- `"HOUR"` (`"H"`, `"HOURS"`, `"HR"`, `"HRS"`) - the hour part of the interval.
- `"MINUTE"` (`"M"`, `"MIN"`, `"MINS"`, `"MINUTES"`) - the minute part of the interval.
- `"SECOND"` (`"S"`, `"SEC"`, `"SECONDS"`, `"SECS"`) - the seconds part with fractional microsecond part.
- `"MILLISECONDS"` (`"MSEC"`, `"MSECS"`, `"MILLISECON"`, `"MSECONDS"`, `"MS"`) - the millisecond part of the interval with fractional microsecond part.
- `"MICROSECONDS"` (`"USEC"`, `"USECS"`, `"USECONDS"`, `"MICROSECON"`, `"US"`) - the total number of microseconds in the `second`, `millisecond` and `microsecond` parts of the given interval.
- `"EPOCH"` - the total number of seconds in the interval including the fractional part with microsecond precision. Here we assume 365.25 days per year (leap year every four years).
For example:
```sql
> SELECT date_part('days', interval 1 year 10 months 5 days);
5
> SELECT date_part('seconds', interval 30 seconds 1 milliseconds 1 microseconds);
30.001001
```
### Why are the changes needed?
To maintain feature parity with PostgreSQL (https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT)
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
- Added new test suite `IntervalExpressionsSuite`
- Add new test cases to `date_part.sql`
Closes#25981 from MaxGekk/extract-from-intervals.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
A followup of [#25295](https://github.com/apache/spark/pull/25295).
1) change the logWarning to logDebug in `OptimizeLocalShuffleReader`.
2) update the test to check whether query stage reuse can work well with local shuffle reader.
### Why are the changes needed?
make code robust
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing tests
Closes#26157 from JkSelf/followup-25295.
Authored-by: jiake <ke.a.jia@intel.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
The handling of the catalog across plans should be as follows ([SPARK-29014](https://issues.apache.org/jira/browse/SPARK-29014)):
* The *current* catalog should be used when no catalog is specified
* The default catalog is the catalog *current* is initialized to
* If the *default* catalog is not set, then *current* catalog is the built-in Spark session catalog.
This PR addresses the issue where *current* catalog usage is not followed as describe above.
### Why are the changes needed?
It is a bug as described in the previous section.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Unit tests added.
Closes#26120 from imback82/cleanup_catalog.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add `AnalyzeTableStatement` and `AnalyzeColumnStatement`, and make ANALYZE TABLE go through the same catalog/table resolution framework of v2 commands.
### Why are the changes needed?
It's important to make all the commands have the same table resolution behavior, to avoid confusing end-users. e.g.
```
USE my_catalog
DESC t // success and describe the table t from my_catalog
ANALYZE TABLE t // report table not found as there is no table t in the session catalog
```
### Does this PR introduce any user-facing change?
yes. When running ANALYZE TABLE, Spark fails the command if the current catalog is set to a v2 catalog, or the table name specified a v2 catalog.
### How was this patch tested?
new tests
Closes#26129 from cloud-fan/analyze-table.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
### What changes were proposed in this pull request?
This patch proposes to delete old Hive external partition directory even the partition does not exist in Hive, when insert overwrite Hive external table partition.
### Why are the changes needed?
When insert overwrite to a Hive external table partition, if the partition does not exist, Hive will not check if the external partition directory exists or not before copying files. So if users drop the partition, and then do insert overwrite to the same partition, the partition will have both old and new data.
For example:
```scala
withSQLConf(HiveUtils.CONVERT_METASTORE_PARQUET.key -> "false") {
// test is an external Hive table.
sql("INSERT OVERWRITE TABLE test PARTITION(name='n1') SELECT 1")
sql("ALTER TABLE test DROP PARTITION(name='n1')")
sql("INSERT OVERWRITE TABLE test PARTITION(name='n1') SELECT 2")
sql("SELECT id FROM test WHERE name = 'n1' ORDER BY id") // Got both 1 and 2.
}
```
### Does this PR introduce any user-facing change?
Yes. This fix a correctness issue when users drop partition on a Hive external table partition and then insert overwrite it.
### How was this patch tested?
Added test.
Closes#25979 from viirya/SPARK-29295.
Lead-authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Co-authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In this change, we give preference to the original table's owner if it is not empty.
### Why are the changes needed?
When executing 'insert into/overwrite ...' DML, or 'alter table set tblproperties ...' DDL, spark would change the ownership of the table the one who runs the spark application.
### Does this PR introduce any user-facing change?
NO
### How was this patch tested?
Compare with the behavior of Apache Hive
Closes#26068 from yaooqinn/SPARK-29405.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### Why are the changes needed?
As mentioned in jira, sometimes we need to be able to support the retention of null columns when writing JSON.
For example, sparkmagic(used widely in jupyter with livy) will generate sql query results based on DataSet.toJSON and parse JSON to pandas DataFrame to display. If there is a null column, it is easy to have some column missing or even the query result is empty. The loss of the null column in the first row, may cause parsing exceptions or loss of entire column data.
### Does this PR introduce any user-facing change?
Example in spark-shell.
scala> spark.sql("select null as a, 1 as b").toJSON.collect.foreach(println)
{"b":1}
scala> spark.sql("set spark.sql.jsonGenerator.struct.ignore.null=false")
res2: org.apache.spark.sql.DataFrame = [key: string, value: string]
scala> spark.sql("select null as a, 1 as b").toJSON.collect.foreach(println)
{"a":null,"b":1}
### How was this patch tested?
Add new test to JacksonGeneratorSuite
Closes#26098 from stczwd/json.
Lead-authored-by: stczwd <qcsd2011@163.com>
Co-authored-by: Jackey Lee <qcsd2011@163.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
1. Regularize all the shuffle configurations related to adaptive execution.
2. Add default value for `BlockStoreShuffleReader.shouldBatchFetch`.
### Why are the changes needed?
It's a follow-up PR for #26040.
Regularize the existing `spark.sql.adaptive.shuffle` namespace in SQLConf.
### Does this PR introduce any user-facing change?
Rename one released user config `spark.sql.adaptive.minNumPostShufflePartitions` to `spark.sql.adaptive.shuffle.minNumPostShufflePartitions`, other changed configs is not released yet.
### How was this patch tested?
Existing UT.
Closes#26147 from xuanyuanking/SPARK-9853.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR proposes a few typos:
1. Sparks => Spark's
2. parallize => parallelize
3. doesnt => doesn't
Closes#26140 from plusplusjiajia/fix-typos.
Authored-by: Jiajia Li <jiajia.li@intel.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
BIT_COUNT(N) - Returns the number of bits that are set in the argument N as an unsigned 64-bit integer, or NULL if the argument is NULL
### Why are the changes needed?
Supported by MySQL,Microsoft SQL Server ,etc.
### Does this PR introduce any user-facing change?
add a built-in function
### How was this patch tested?
add uts
Closes#26139 from yaooqinn/SPARK-29491.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
This PR takes over #19788. After we split the shuffle fetch protocol from `OpenBlock` in #24565, this optimization can be extended in the new shuffle protocol. Credit to yucai, closes#19788.
### What changes were proposed in this pull request?
This PR adds the support for continuous shuffle block fetching in batch:
- Shuffle client changes:
- Add new feature tag `spark.shuffle.fetchContinuousBlocksInBatch`, implement the decision logic in `BlockStoreShuffleReader`.
- Merge the continuous shuffle block ids in batch if needed in ShuffleBlockFetcherIterator.
- Shuffle server changes:
- Add support in `ExternalBlockHandler` for the external shuffle service side.
- Make `ShuffleBlockResolver.getBlockData` accept getting block data by range.
- Protocol changes:
- Add new block id type `ShuffleBlockBatchId` represent continuous shuffle block ids.
- Extend `FetchShuffleBlocks` and `OneForOneBlockFetcher`.
- After the new shuffle fetch protocol completed in #24565, the backward compatibility for external shuffle service can be controlled by `spark.shuffle.useOldFetchProtocol`.
### Why are the changes needed?
In adaptive execution, one reducer may fetch multiple continuous shuffle blocks from one map output file. However, as the original approach, each reducer needs to fetch those 10 reducer blocks one by one. This way needs many IO and impacts performance. This PR is to support fetching those continuous shuffle blocks in one IO (batch way). See below example:
The shuffle block is stored like below:
![image](https://user-images.githubusercontent.com/2989575/51654634-c37fbd80-1fd3-11e9-935e-5652863676c3.png)
The ShuffleId format is s"shuffle_$shuffleId_$mapId_$reduceId", referring to BlockId.scala.
In adaptive execution, one reducer may want to read output for reducer 5 to 14, whose block Ids are from shuffle_0_x_5 to shuffle_0_x_14.
Before this PR, Spark needs 10 disk IOs + 10 network IOs for each output file.
After this PR, Spark only needs 1 disk IO and 1 network IO. This way can reduce IO dramatically.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Add new UT.
Integrate test with setting `spark.sql.adaptive.enabled=true`.
Closes#26040 from xuanyuanking/SPARK-9853.
Lead-authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Co-authored-by: yucai <yyu1@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
When adaptive execution is enabled, the Spark users who connected from JDBC always get adaptive execution error whatever the under root cause is. It's very confused. We have to check the driver log to find out why.
```shell
0: jdbc:hive2://localhost:10000> SELECT * FROM testData join testData2 ON key = v;
SELECT * FROM testData join testData2 ON key = v;
Error: Error running query: org.apache.spark.SparkException: Adaptive execution failed due to stage materialization failures. (state=,code=0)
0: jdbc:hive2://localhost:10000>
```
For example, a job queried from JDBC failed due to HDFS missing block. User still get the error message `Adaptive execution failed due to stage materialization failures`.
The easiest way to reproduce is changing the code of `AdaptiveSparkPlanExec`, to let it throws out an exception when it faces `StageSuccess`.
```scala
case class AdaptiveSparkPlanExec(
events.drainTo(rem)
(Seq(nextMsg) ++ rem.asScala).foreach {
case StageSuccess(stage, res) =>
// stage.resultOption = Some(res)
val ex = new SparkException("Wrapper Exception",
new IllegalArgumentException("Root cause is IllegalArgumentException for Test"))
errors.append(
new SparkException(s"Failed to materialize query stage: ${stage.treeString}", ex))
case StageFailure(stage, ex) =>
errors.append(
new SparkException(s"Failed to materialize query stage: ${stage.treeString}", ex))
```
### Why are the changes needed?
To make the error message more user-friend and more useful for query from JDBC.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Manually test query:
```shell
0: jdbc:hive2://localhost:10000> CREATE TEMPORARY VIEW testData (key, value) AS SELECT explode(array(1, 2, 3, 4)), cast(substring(rand(), 3, 4) as string);
CREATE TEMPORARY VIEW testData (key, value) AS SELECT explode(array(1, 2, 3, 4)), cast(substring(rand(), 3, 4) as string);
+---------+--+
| Result |
+---------+--+
+---------+--+
No rows selected (0.225 seconds)
0: jdbc:hive2://localhost:10000> CREATE TEMPORARY VIEW testData2 (k, v) AS SELECT explode(array(1, 1, 2, 2)), cast(substring(rand(), 3, 4) as int);
CREATE TEMPORARY VIEW testData2 (k, v) AS SELECT explode(array(1, 1, 2, 2)), cast(substring(rand(), 3, 4) as int);
+---------+--+
| Result |
+---------+--+
+---------+--+
No rows selected (0.043 seconds)
```
Before:
```shell
0: jdbc:hive2://localhost:10000> SELECT * FROM testData join testData2 ON key = v;
SELECT * FROM testData join testData2 ON key = v;
Error: Error running query: org.apache.spark.SparkException: Adaptive execution failed due to stage materialization failures. (state=,code=0)
0: jdbc:hive2://localhost:10000>
```
After:
```shell
0: jdbc:hive2://localhost:10000> SELECT * FROM testData join testData2 ON key = v;
SELECT * FROM testData join testData2 ON key = v;
Error: Error running query: java.lang.IllegalArgumentException: Root cause is IllegalArgumentException for Test (state=,code=0)
0: jdbc:hive2://localhost:10000>
```
Closes#25960 from LantaoJin/SPARK-29283.
Authored-by: lajin <lajin@ebay.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
### What changes were proposed in this pull request?
bool_or(x) <=> any/some(x) <=> max(x)
bool_and(x) <=> every(x) <=> min(x)
Args:
x: boolean
### Why are the changes needed?
PostgreSQL, Presto and Vertica, etc also support this feature:
### Does this PR introduce any user-facing change?
add new functions support
### How was this patch tested?
add ut
Closes#26126 from yaooqinn/SPARK-27880.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Proposed new expression `SubtractDates` which is used in `date1` - `date2`. It has the `INTERVAL` type, and returns the interval from `date1` (inclusive) and `date2` (exclusive). For example:
```sql
> select date'tomorrow' - date'yesterday';
interval 2 days
```
Closes#26034
### Why are the changes needed?
- To conform the SQL standard which states the result type of `date operand 1` - `date operand 2` must be the interval type. See [4.5.3 Operations involving datetimes and intervals](http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt).
- Improve Spark SQL UX and allow mixing date and timestamp in subtractions. For example: `select timestamp'now' + (date'2019-10-01' - date'2019-09-15')`
### Does this PR introduce any user-facing change?
Before the query below returns number of days:
```sql
spark-sql> select date'2019-10-05' - date'2018-09-01';
399
```
After it returns an interval:
```sql
spark-sql> select date'2019-10-05' - date'2018-09-01';
interval 1 years 1 months 4 days
```
### How was this patch tested?
- by new tests in `DateExpressionsSuite` and `TypeCoercionSuite`.
- by existing tests in `date.sql`
Closes#26112 from MaxGekk/date-subtract.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
### What changes were proposed in this pull request?
Change Literal.sql to output CAST('fpValue' AS FLOAT) instead of CAST(fpValue AS FLOAT) as the SQL for a floating point literal.
### Why are the changes needed?
The old version doesn't work for very small floating point numbers; the value will fail to parse if it doesn't fit in a DECIMAL(38).
This doesn't apply to doubles because they have special literal syntax.
### Does this PR introduce any user-facing change?
Not really.
### How was this patch tested?
New unit tests.
Closes#26114 from jose-torres/fpliteral.
Authored-by: Jose Torres <joseph.torres@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Support FETCH_PRIOR fetching in Thriftserver, and report correct fetch start offset it TFetchResultsResp.results.startRowOffset
The semantics of FETCH_PRIOR are as follow: Assuming the previous fetch returned a block of rows from offsets [10, 20)
* calling FETCH_PRIOR(maxRows=5) will scroll back and return rows [5, 10)
* calling FETCH_PRIOR(maxRows=10) again, will scroll back, but can't go earlier than 0. It will nevertheless return 10 rows, returning rows [0, 10) (overlapping with the previous fetch)
* calling FETCH_PRIOR(maxRows=4) again will again return rows starting from offset 0 - [0, 4)
* calling FETCH_NEXT(maxRows=6) after that will move the cursor forward and return rows [4, 10)
##### Client/server backwards/forwards compatibility:
Old driver with new server:
* Drivers that don't support FETCH_PRIOR will not attempt to use it
* Field TFetchResultsResp.results.startRowOffset was not set, old drivers don't depend on it.
New driver with old server
* Using an older thriftserver with FETCH_PRIOR will make the thriftserver return unsupported operation error. The driver can then recognize that it's an old server.
* Older thriftserver will return TFetchResultsResp.results.startRowOffset=0. If the client driver receives 0, it can know that it can not rely on it as correct offset. If the client driver intentionally wants to fetch from 0, it can use FETCH_FIRST.
### Why are the changes needed?
It's intended to be used to recover after connection errors. If a client lost connection during fetching (e.g. of rows [10, 20)), and wants to reconnect and continue, it could not know whether the request got lost before reaching the server, or on the response back. When it issued another FETCH_NEXT(10) request after reconnecting, because TFetchResultsResp.results.startRowOffset was not set, it could not know if the server will return rows [10,20) (because the previous request didn't reach it) or rows [20, 30) (because it returned data from the previous request but the connection got broken on the way back). Now, with TFetchResultsResp.results.startRowOffset the client can know after reconnecting which rows it is getting, and use FETCH_PRIOR to scroll back if a fetch block was lost in transmission.
Driver should always use FETCH_PRIOR after a broken connection.
* If the Thriftserver returns unsuported operation error, the driver knows that it's an old server that doesn't support it. The driver then must error the query, as it will also not support returning the correct startRowOffset, so the driver cannot reliably guarantee if it hadn't lost any rows on the fetch cursor.
* If the driver gets a response to FETCH_PRIOR, it should also have a correctly set startRowOffset, which the driver can use to position itself back where it left off before the connection broke.
* If FETCH_NEXT was used after a broken connection on the first fetch, and returned with an startRowOffset=0, then the client driver can't know if it's 0 because it's the older server version, or if it's genuinely 0. Better to call FETCH_PRIOR, as scrolling back may anyway be possibly required after a broken connection.
This way it is implemented in a backwards/forwards compatible way, and doesn't require bumping the protocol version. FETCH_ABSOLUTE might have been better, but that would require a bigger protocol change, as there is currently no field to specify the requested absolute offset.
### Does this PR introduce any user-facing change?
ODBC/JDBC drivers connecting to Thriftserver may now implement using the FETCH_PRIOR fetch order to scroll back in query results, and check TFetchResultsResp.results.startRowOffset if their cursor position is consistent after connection errors.
### How was this patch tested?
Added tests to HiveThriftServer2Suites
Closes#26014 from juliuszsompolski/SPARK-29349.
Authored-by: Juliusz Sompolski <julek@databricks.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
### What changes were proposed in this pull request?
A followup of https://github.com/apache/spark/pull/25295
This PR proposes a few code cleanups:
1. rename the special `getMapSizesByExecutorId` to `getMapSizesByMapIndex`
2. rename the parameter `mapId` to `mapIndex` as that's really a mapper index.
3. `BlockStoreShuffleReader` should take `blocksByAddress` directly instead of a map id.
4. rename `getMapReader` to `getReaderForOneMapper` to be more clearer.
### Why are the changes needed?
make code easier to understand
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
existing tests
Closes#26128 from cloud-fan/followup.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Updating univocity-parsers version to 2.8.3, which adds support for multiple character delimiters
Moving univocity-parsers version to spark-parent pom dependencyManagement section
Adding new utility method to build multi-char delimiter string, which delegates to existing one
Adding tests for multiple character delimited CSV
### What changes were proposed in this pull request?
Adds support for parsing CSV data using multiple-character delimiters. Existing logic for converting the input delimiter string to characters was kept and invoked in a loop. Project dependencies were updated to remove redundant declaration of `univocity-parsers` version, and also to change that version to the latest.
### Why are the changes needed?
It is quite common for people to have delimited data, where the delimiter is not a single character, but rather a sequence of characters. Currently, it is difficult to handle such data in Spark (typically needs pre-processing).
### Does this PR introduce any user-facing change?
Yes. Specifying the "delimiter" option for the DataFrame read, and providing more than one character, will no longer result in an exception. Instead, it will be converted as before and passed to the underlying library (Univocity), which has accepted multiple character delimiters since 2.8.0.
### How was this patch tested?
The `CSVSuite` tests were confirmed passing (including new methods), and `sbt` tests for `sql` were executed.
Closes#26027 from jeff303/SPARK-24540.
Authored-by: Jeff Evans <jeffrey.wayne.evans@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
When inserting a value into a column with the different data type, Spark performs type coercion. Currently, we support 3 policies for the store assignment rules: ANSI, legacy and strict, which can be set via the option "spark.sql.storeAssignmentPolicy":
1. ANSI: Spark performs the type coercion as per ANSI SQL. In practice, the behavior is mostly the same as PostgreSQL. It disallows certain unreasonable type conversions such as converting `string` to `int` and `double` to `boolean`. It will throw a runtime exception if the value is out-of-range(overflow).
2. Legacy: Spark allows the type coercion as long as it is a valid `Cast`, which is very loose. E.g., converting either `string` to `int` or `double` to `boolean` is allowed. It is the current behavior in Spark 2.x for compatibility with Hive. When inserting an out-of-range value to a integral field, the low-order bits of the value is inserted(the same as Java/Scala numeric type casting). For example, if 257 is inserted to a field of Byte type, the result is 1.
3. Strict: Spark doesn't allow any possible precision loss or data truncation in store assignment, e.g., converting either `double` to `int` or `decimal` to `double` is allowed. The rules are originally for Dataset encoder. As far as I know, no mainstream DBMS is using this policy by default.
Currently, the V1 data source uses "Legacy" policy by default, while V2 uses "Strict". This proposal is to use "ANSI" policy by default for both V1 and V2 in Spark 3.0.
### Why are the changes needed?
Following the ANSI SQL standard is most reasonable among the 3 policies.
### Does this PR introduce any user-facing change?
Yes.
The default store assignment policy is ANSI for both V1 and V2 data sources.
### How was this patch tested?
Unit test
Closes#26107 from gengliangwang/ansiPolicyAsDefault.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
Implement a rule in the new adaptive execution framework introduced in [SPARK-23128](https://issues.apache.org/jira/browse/SPARK-23128). This rule is used to optimize the shuffle reader to local shuffle reader when smj is converted to bhj in adaptive execution.
## How was this patch tested?
Existing tests
Closes#25295 from JkSelf/localShuffleOptimization.
Authored-by: jiake <ke.a.jia@intel.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
move the statement logical plans that were created for v2 commands to a new file `statements.scala`, under the same package of `v2Commands.scala`.
This PR also includes some minor cleanups:
1. remove `private[sql]` from `ParsedStatement` as it's in the private package.
2. remove unnecessary override of `output` and `children`.
3. add missing classdoc.
### Why are the changes needed?
Similar to https://github.com/apache/spark/pull/26111 , this is to better organize the logical plans of data source v2.
It's a bit weird to put the statements in the package `org.apache.spark.sql.catalyst.plans.logical.sql` as `sql` is not a good sub-package name in Spark SQL.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
existing tests
Closes#26125 from cloud-fan/statement.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
There will be 2 times unsafeProjection convert operation When we read a Parquet data file use non-vectorized mode:
1. `ParquetGroupConverter` call unsafeProjection function to covert `SpecificInternalRow` to `UnsafeRow` every times when read Parquet data file use `ParquetRecordReader`.
2. `ParquetFileFormat` will call unsafeProjection function to covert this `UnsafeRow` to another `UnsafeRow` again when partitionSchema is not empty in DataSourceV1 branch, and `PartitionReaderWithPartitionValues` will always do this convert operation in DataSourceV2 branch.
In this pr, remove `unsafeProjection` convert operation in `ParquetGroupConverter` and change `ParquetRecordReader` to produce `SpecificInternalRow` instead of `UnsafeRow`.
### Why are the changes needed?
The first time convert in `ParquetGroupConverter` is redundant and `ParquetRecordReader` return a `InternalRow(SpecificInternalRow)` is enough.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Unit Test
Closes#26106 from LuciferYang/spark-parquet-unsafe-projection.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Refine the document of v2 session catalog config, to clearly explain what it is, when it should be used and how to implement it.
### Why are the changes needed?
Make this config more understandable
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Pass the Jenkins with the newly updated test cases.
Closes#26071 from cloud-fan/config.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR adds JSON serialization for Spark external Rows.
### Why are the changes needed?
This is to be used for observable metrics where the `StreamingQueryProgress` contains a map of observed metrics rows which needs to be serialized in some cases.
### Does this PR introduce any user-facing change?
Yes, a user can call `toJson` on rows returned when collecting a DataFrame to the driver.
### How was this patch tested?
Added a new test suite: `RowJsonSuite` that should test this.
Closes#26013 from hvanhovell/SPARK-29347.
Authored-by: herman <herman@databricks.com>
Signed-off-by: herman <herman@databricks.com>
### What changes were proposed in this pull request?
move the v2 command logical plans from `basicLogicalOperators.scala` to a new file `v2Commands.scala`
### Why are the changes needed?
As we keep adding v2 commands, the `basicLogicalOperators.scala` grows bigger and bigger. It's better to have a separated file for them.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
not needed
Closes#26111 from cloud-fan/command.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR aims to fix the behavior of `mode("default")` to set `SaveMode.ErrorIfExists`. Also, this PR updates the exception message by adding `default` explicitly.
### Why are the changes needed?
This is reported during `GRAPH API` PR. This builder pattern should work like the documentation.
### Does this PR introduce any user-facing change?
Yes if the app has multiple `mode()` invocation including `mode("default")` and the `mode("default")` is the last invocation. This is really a corner case.
- Previously, the last invocation was handled as `No-Op`.
- After this bug fix, it will work like the documentation.
### How was this patch tested?
Pass the Jenkins with the newly added test case.
Closes#26094 from dongjoon-hyun/SPARK-29442.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
In the PR, I propose to move interval parsing to `CalendarInterval.fromCaseInsensitiveString()` which throws an `IllegalArgumentException` for invalid strings, and reuse it from `CalendarInterval.fromString()`. The former one handles `IllegalArgumentException` only and returns `NULL` for invalid interval strings. This will allow to support interval strings without the `interval` prefix in casting strings to intervals and in interval type constructor because they use `fromString()` for parsing string intervals.
For example:
```sql
spark-sql> select cast('1 year 10 days' as interval);
interval 1 years 1 weeks 3 days
spark-sql> SELECT INTERVAL '1 YEAR 10 DAYS';
interval 1 years 1 weeks 3 days
```
### Why are the changes needed?
To maintain feature parity with PostgreSQL which supports interval strings without prefix:
```sql
# select interval '2 months 1 microsecond';
interval
------------------------
2 mons 00:00:00.000001
```
and to improve Spark SQL UX.
### Does this PR introduce any user-facing change?
Yes, previously parsing of interval strings without `interval` gives `NULL`:
```sql
spark-sql> select interval '2 months 1 microsecond';
NULL
```
After:
```sql
spark-sql> select interval '2 months 1 microsecond';
interval 2 months 1 microseconds
```
### How was this patch tested?
- Added new tests to `CalendarIntervalSuite.java`
- A test for casting strings to intervals in `CastSuite`
- Test for interval type constructor from strings in `ExpressionParserSuite`
Closes#26079 from MaxGekk/interval-str-without-prefix.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Currently, `SHOW NAMESPACES` and `SHOW DATABASES` are separate code paths. This PR merges two implementations.
### Why are the changes needed?
To remove code/behavior duplication
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Added new unit tests.
Closes#26006 from imback82/combine_show.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR adds 2 changes regarding exception handling in `SQLQueryTestSuite` and `ThriftServerQueryTestSuite`
- fixes an expected output sorting issue in `ThriftServerQueryTestSuite` as if there is an exception then there is no need for sort
- introduces common exception handling in those 2 suites with a new `handleExceptions` method
### Why are the changes needed?
Currently `ThriftServerQueryTestSuite` passes on master, but it fails on one of my PRs (https://github.com/apache/spark/pull/23531) with this error (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111651/testReport/org.apache.spark.sql.hive.thriftserver/ThriftServerQueryTestSuite/sql_3/):
```
org.scalatest.exceptions.TestFailedException: Expected "
[Recursion level limit 100 reached but query has not exhausted, try increasing spark.sql.cte.recursion.level.limit
org.apache.spark.SparkException]
", but got "
[org.apache.spark.SparkException
Recursion level limit 100 reached but query has not exhausted, try increasing spark.sql.cte.recursion.level.limit]
" Result did not match for query #4 WITH RECURSIVE r(level) AS ( VALUES (0) UNION ALL SELECT level + 1 FROM r ) SELECT * FROM r
```
The unexpected reversed order of expected output (error message comes first, then the exception class) is due to this line: https://github.com/apache/spark/pull/26028/files#diff-b3ea3021602a88056e52bf83d8782de8L146. It should not sort the expected output if there was an error during execution.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing UTs.
Closes#26028 from peter-toth/SPARK-29359-better-exception-handling.
Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
### What changes were proposed in this pull request?
This PR is a very minor follow-up to become robust because `spark.sql.additionalRemoteRepositories` is a configuration which has a comma-separated value.
### Why are the changes needed?
This makes sure that `getHiveContribJar` will not fail on the configuration changes.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Manual. Change the default value with multiple repositories and run the following.
```
build/sbt -Phive "project hive" "test-only org.apache.spark.sql.hive.HiveSparkSubmitSuite"
```
Closes#26096 from dongjoon-hyun/SPARK-27831.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
### What changes were proposed in this pull request?
Revert this commit 18b7ad2fc5.
### Why are the changes needed?
See https://github.com/apache/spark/pull/16304#discussion_r92753590
### Does this PR introduce any user-facing change?
Yes
### How was this patch tested?
There is no test for that.
Closes#26101 from MaxGekk/revert-mean-seconds-per-month.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
Update ANSI mode related config names in comments as "spark.sql.ansi.enabled"
### Why are the changes needed?
The removed configuration `spark.sql.parser.ansi.enabled` and `spark.sql.failOnIntegralTypeOverflow` still exist in code comments.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Grep the whole code to ensure the remove config names no longer exist.
```
git grep "parser.ansi.enabled"
git grep failOnIntegralTypeOverflow
git grep decimalOperationsNullOnOverflow
git grep ANSI_SQL_PARSER
git grep FAIL_ON_INTEGRAL_TYPE_OVERFLOW
git grep DECIMAL_OPERATIONS_NULL_ON_OVERFLOW
```
Closes#26067 from gengliangwang/spark-28989-followup.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Use `.sameElements` to compare (non-nested) arrays, as `Arrays.deep` is removed in 2.13 and wasn't the best way to do this in the first place.
### Why are the changes needed?
To compile with 2.13.
### Does this PR introduce any user-facing change?
None.
### How was this patch tested?
Existing tests.
Closes#26073 from srowen/SPARK-29416.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Replace `Unit` with equivalent `()` where code refers to the `Unit` companion object.
### Why are the changes needed?
It doesn't compile otherwise in Scala 2.13.
- https://github.com/scala/scala/blob/v2.13.0/src/library/scala/Unit.scala#L30
### Does this PR introduce any user-facing change?
Should be no behavior change at all.
### How was this patch tested?
Existing tests.
Closes#26070 from srowen/SPARK-29411.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR adds an accumulator that computes a global aggregate over a number of rows. A user can define an arbitrary number of aggregate functions which can be computed at the same time.
The accumulator uses the standard technique for implementing (interpreted) aggregation in Spark. It uses projections and manual updates for each of the aggregation steps (initialize buffer, update buffer with new input row, merge two buffers and compute the final result on the buffer). Note that two of the steps (update and merge) use the aggregation buffer both as input and output.
Accumulators do not have an explicit point at which they get serialized. A somewhat surprising side effect is that the buffers of a `TypedImperativeAggregate` go over the wire as-is instead of serializing them. The merging logic for `TypedImperativeAggregate` assumes that the input buffer contains serialized buffers, this is violated by the accumulator's implicit serialization. In order to get around this I have added `mergeBuffersObjects` method that merges two unserialized buffers to `TypedImperativeAggregate`.
### Why are the changes needed?
This is the mechanism we are going to use to implement observable metrics.
### Does this PR introduce any user-facing change?
No, not yet.
### How was this patch tested?
Added `AggregatingAccumulator` test suite.
Closes#26012 from hvanhovell/SPARK-29346.
Authored-by: herman <herman@databricks.com>
Signed-off-by: herman <herman@databricks.com>
### What changes were proposed in this pull request?
DataSourceV2 Exec classes (ShowTablesExec, ShowNamespacesExec, etc.) all extend LeafExecNode. This results in running a job when executeCollect() is called. This breaks the previous behavior [SPARK-19650](https://issues.apache.org/jira/browse/SPARK-19650).
A new command physical operator will be introduced form which all V2 Exec classes derive to avoid running a job.
### Why are the changes needed?
It is a bug since the current behavior runs a spark job, which breaks the existing behavior: [SPARK-19650](https://issues.apache.org/jira/browse/SPARK-19650).
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing unit tests.
Closes#26048 from imback82/dsv2_command.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Invocations like `sc.parallelize(Array((1,2)))` cause a compile error in 2.13, like:
```
[ERROR] [Error] /Users/seanowen/Documents/spark_2.13/core/src/test/scala/org/apache/spark/ShuffleSuite.scala:47: overloaded method value apply with alternatives:
(x: Unit,xs: Unit*)Array[Unit] <and>
(x: Double,xs: Double*)Array[Double] <and>
(x: Float,xs: Float*)Array[Float] <and>
(x: Long,xs: Long*)Array[Long] <and>
(x: Int,xs: Int*)Array[Int] <and>
(x: Char,xs: Char*)Array[Char] <and>
(x: Short,xs: Short*)Array[Short] <and>
(x: Byte,xs: Byte*)Array[Byte] <and>
(x: Boolean,xs: Boolean*)Array[Boolean]
cannot be applied to ((Int, Int), (Int, Int), (Int, Int), (Int, Int))
```
Using a `Seq` instead appears to resolve it, and is effectively equivalent.
### Why are the changes needed?
To better cross-build for 2.13.
### Does this PR introduce any user-facing change?
None.
### How was this patch tested?
Existing tests.
Closes#26062 from srowen/SPARK-29401.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Syntax like `'foo` is deprecated in Scala 2.13. Replace usages with `Symbol("foo")`
### Why are the changes needed?
Avoids ~50 deprecation warnings when attempting to build with 2.13.
### Does this PR introduce any user-facing change?
None, should be no functional change at all.
### How was this patch tested?
Existing tests.
Closes#26061 from srowen/SPARK-29392.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Beautify comment
### Why are the changes needed?
The config name now is pretty weird.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
No test.
Closes#26054 from wangshisan/SPARK-29189.
Authored-by: gwang3 <gwang3@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Reimplement `org.apache.spark.sql.catalyst.util.QuantileSummaries#merge` and add a test-case showing the previous bug.
### Why are the changes needed?
The original Greenwald-Khanna paper, from which the algorithm behind `approxQuantile` was taken, does not cover how to merge the result of multiple parallel QuantileSummaries. The current implementation violates some invariants and therefore the effective error can be larger than the specified.
### Does this PR introduce any user-facing change?
Yes, for same cases, the results from `approxQuantile` (`percentile_approx` in SQL) will now be within the expected error margin. For example:
```scala
var values = (1 to 100).toArray
val all_quantiles = values.indices.map(i => (i+1).toDouble / values.length).toArray
for (n <- 0 until 5) {
var df = spark.sparkContext.makeRDD(values).toDF("value").repartition(5)
val all_answers = df.stat.approxQuantile("value", all_quantiles, 0.1)
val all_answered_ranks = all_answers.map(ans => values.indexOf(ans)).toArray
val error = all_answered_ranks.zipWithIndex.map({ case (answer, expected) => Math.abs(expected - answer) }).toArray
val max_error = error.max
print(max_error + "\n")
}
```
In the current build it returns:
```
16
12
10
11
17
```
I couldn't run the code with this patch applied to double check the implementation. Can someone please confirm it now outputs at most `10`, please?
### How was this patch tested?
A new unit test was added to uncover the previous bug.
Closes#26029 from sitegui/SPARK-29336.
Authored-by: Guilherme <sitegui@sitegui.com.br>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
Set the default value of the `spark.sql.legacy.sizeOfNull` config to `false`. That changes behavior of the `size()` function for `NULL`. The function will return `NULL` for `NULL` instead of `-1`.
### Why are the changes needed?
There is the agreement in the PR https://github.com/apache/spark/pull/21598#issuecomment-399695523 to change behavior in Spark 3.0.
### Does this PR introduce any user-facing change?
Yes.
Before:
```sql
spark-sql> select size(NULL);
-1
```
After:
```sql
spark-sql> select size(NULL);
NULL
```
### How was this patch tested?
By the `check outputs of expression examples` test in `SQLQuerySuite` which runs expression examples.
Closes#26051 from MaxGekk/sizeof-null-returns-null.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add back the resolved logical plan for UPDATE TABLE. It was in https://github.com/apache/spark/pull/25626 before but was removed later.
### Why are the changes needed?
In https://github.com/apache/spark/pull/25626 , we decided to not add the update API in DS v2, but we still want to implement UPDATE for builtin source like JDBC. We should at least add the resolved logical plan.
### Does this PR introduce any user-facing change?
no, UPDATE is still not supported yet.
### How was this patch tested?
new tests.
Closes#26025 from cloud-fan/update.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request?
This PR aims the followings.
- Refactor `TPCDSQueryBenchmark` to use main method to improve the usability.
- Reduce the number of iteration from 5 to 2 because it takes too long. (2 is okay because we have `Stdev` field now. If there is an irregular run, we can notice easily with that).
- Generate one result file for TPCDS scale factor 1. (Note that this test suite can be used for the other scale factors, too.)
- AWS EC2 `r3.xlarge` with `ami-06f2f779464715dc5 (ubuntu-bionic-18.04-amd64-server-20190722.1)` is used.
This PR adds a JDK8 result based on the TPCDS ScaleFactor 1G data generated by the following.
```
# `spark-tpcds-datagen` needs this. (JDK8)
$ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4
$ export SPARK_HOME=$PWD
$ ./build/mvn clean package -DskipTests
# Generate data. (JDK8)
$ git clone gitgithub.com:maropu/spark-tpcds-datagen.git
$ cd spark-tpcds-datagen/
$ build/mvn clean package
$ mkdir -p /data/tpcds
$ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4`
```
### Why are the changes needed?
Although the generated TPCDS data is random, we can keep the record.
### Does this PR introduce any user-facing change?
No. (This is dev-only test benchmark).
### How was this patch tested?
Manually run the benchmark. Please note that you need to have TPCDS data.
```
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark --data-location /data/tpcds/s1"
```
Closes#26049 from dongjoon-hyun/SPARK-25668.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
In our PROD env, we have a pure Spark cluster, I think this is also pretty common, where computation is separated from storage layer. In such deploy mode, data locality is never reachable.
And there are some configurations in Spark scheduler to reduce waiting time for data locality(e.g. "spark.locality.wait"). While, problem is that, in listing file phase, the location informations of all the files, with all the blocks inside each file, are all fetched from the distributed file system. Actually, in a PROD environment, a table can be so huge that even fetching all these location informations need take tens of seconds.
To improve such scenario, Spark need provide an option, where data locality can be totally ignored, all we need in the listing file phase are the files locations, without any block location informations.
### Why are the changes needed?
And we made a benchmark in our PROD env, after ignore the block locations, we got a pretty huge improvement.
Table Size | Total File Number | Total Block Number | List File Duration(With Block Location) | List File Duration(Without Block Location)
-- | -- | -- | -- | --
22.6T | 30000 | 120000 | 16.841s | 1.730s
28.8 T | 42001 | 148964 | 10.099s | 2.858s
3.4 T | 20000 | 20000 | 5.833s | 4.881s
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Via ut.
Closes#25869 from wangshisan/SPARK-29189.
Authored-by: gwang3 <gwang3@ebay.com>
Signed-off-by: Imran Rashid <irashid@cloudera.com>
### What changes were proposed in this pull request?
In the PR, I propose to pass the `Pattern.CASE_INSENSITIVE` flag while compiling interval patterns in `CalendarInterval`. This makes casting string values to intervals case insensitive and tolerant to case of the `interval`, `year(s)`, `month(s)`, `week(s)`, `day(s)`, `hour(s)`, `minute(s)`, `second(s)`, `millisecond(s)` and `microsecond(s)`.
### Why are the changes needed?
There are at least 2 reasons:
- To maintain feature parity with PostgreSQL which is not sensitive to case:
```sql
# select cast('10 Days' as INTERVAL);
interval
----------
10 days
(1 row)
```
- Spark is tolerant to case of interval literals. Case insensitivity in casting should be convenient for Spark users.
```sql
spark-sql> SELECT INTERVAL 1 YEAR 1 WEEK;
interval 1 years 1 weeks
```
### Does this PR introduce any user-facing change?
Yes, current implementation produces `NULL` for `interval`, `year`, ... `microsecond` that are not in lower case.
Before:
```sql
spark-sql> SELECT CAST('INTERVAL 10 DAYS' as INTERVAL);
NULL
```
After:
```sql
spark-sql> SELECT CAST('INTERVAL 10 DAYS' as INTERVAL);
interval 1 weeks 3 days
```
### How was this patch tested?
- by new tests in `CalendarIntervalSuite.java`
- new test in `CastSuite`
Closes#26010 from MaxGekk/interval-case-insensitive.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
I introduced new constants `SECONDS_PER_MONTH` and `MILLIS_PER_MONTH`, and reused it in calculations of seconds/milliseconds per month. `SECONDS_PER_MONTH` is 2629746 because the average year of the Gregorian calendar is 365.2425 days long or 60 * 60 * 24 * 365.2425 = 31556952.0 = 12 * 2629746 seconds per year.
### Why are the changes needed?
Spark uses the proleptic Gregorian calendar (see https://issues.apache.org/jira/browse/SPARK-26651) in which the average year is 365.2425 days (see https://en.wikipedia.org/wiki/Gregorian_calendar) but existing implementation assumes 31 days per months or 12 * 31 = 372 days. That's far away from the the truth.
### Does this PR introduce any user-facing change?
Yes, the changes affect at least 3 methods in `GroupStateImpl`, `EventTimeWatermark` and `MonthsBetween`. For example, the `month_between()` function will return different result in some cases.
Before:
```sql
spark-sql> select months_between('2019-09-15', '1970-01-01');
596.4516129
```
After:
```sql
spark-sql> select months_between('2019-09-15', '1970-01-01');
596.45996838
```
### How was this patch tested?
By existing test suite `DateTimeUtilsSuite`, `DateFunctionsSuite` and `DateExpressionsSuite`.
Closes#25998 from MaxGekk/days-in-year.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
Added new rules to `TypeCoercion.DateTimeOperations` for the `Subtract` expression which is replaced by existing `TimestampDiff` expression if one of its parameter has the `DATE` type and another one is the `TIMESTAMP` type. The date argument is casted to timestamp.
### Why are the changes needed?
- To maintain feature parity with PostgreSQL which supports subtraction of a date from a timestamp and a timestamp from a date:
```sql
maxim=# select timestamp'now' - date'epoch';
?column?
----------------------------
18175 days 21:07:33.412875
(1 row)
maxim=# select date'2020-01-01' - timestamp'now';
?column?
-------------------------
86 days 02:52:00.945296
(1 row)
```
- To conform to the SQL standard which defines datetime subtraction as an interval.
### Does this PR introduce any user-facing change?
Yes, currently the queries bellow fails with the error:
```sql
spark-sql> select timestamp'now' - date'2019-10-01';
Error in query: cannot resolve '(TIMESTAMP('2019-10-06 21:05:07.234') - DATE '2019-10-01')' due to data type mismatch: differing types in '(TIMESTAMP('2019-10-06 21:05:07.234') - DATE '2019-10-01')' (timestamp and date).; line 1 pos 7;
'Project [unresolvedalias((1570385107234000 - 18170), None)]
+- OneRowRelation
```
after the changes:
```sql
spark-sql> select timestamp'now' - date'2019-10-01';
interval 5 days 21 hours 4 minutes 55 seconds 878 milliseconds
```
### How was this patch tested?
- Add new cases to the `rule for date/timestamp operations` test in `TypeCoercionSuite`
- by 2 new test in `datetime.sql`
Closes#26036 from MaxGekk/date-timestamp-subtract.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Change to use AtomicLong instead of a var in the test.
### Why are the changes needed?
Fix flaky test.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing UT.
Closes#26020 from xuanyuanking/SPARK-25159.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Added new expression `TimestampDiff` for timestamp subtractions. It accepts 2 timestamp expressions and returns another one of the `CalendarIntervalType`. While creating an instance of `CalendarInterval`, it initializes only the microsecond field by difference of the given timestamps in microseconds, and set the `months` field to zero. Also I added an rule for conversion `Subtract` to `TimestampDiff`, and enabled already ported test queries in `postgreSQL/timestamp.sql`.
### Why are the changes needed?
To maintain feature parity with PostgreSQL which allows to get timestamp difference:
```sql
# select timestamp'today' - timestamp'yesterday';
?column?
----------
1 day
(1 row)
```
### Does this PR introduce any user-facing change?
Yes, previously users got the following error from timestamp subtraction:
```sql
spark-sql> select timestamp'today' - timestamp'yesterday';
Error in query: cannot resolve '(TIMESTAMP('2019-10-04 00:00:00') - TIMESTAMP('2019-10-03 00:00:00'))' due to data type mismatch: '(TIMESTAMP('2019-10-04 00:00:00') - TIMESTAMP('2019-10-03 00:00:00'))' requires (numeric or interval) type, not timestamp; line 1 pos 7;
'Project [unresolvedalias((1570136400000000 - 1570050000000000), None)]
+- OneRowRelation
```
after the changes they should get an interval:
```sql
spark-sql> select timestamp'today' - timestamp'yesterday';
interval 1 days
```
### How was this patch tested?
- Added tests for `TimestampDiff` to `DateExpressionsSuite`
- By new test in `TypeCoercionSuite`.
- Enabled tests in `postgreSQL/timestamp.sql`.
Closes#26022 from MaxGekk/timestamp-diff.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Currently we deal with different `ParsedStatement` in many places and write duplicated catalog/table lookup logic. In general the lookup logic is
1. try look up the catalog by name. If no such catalog, and default catalog is not set, convert `ParsedStatement` to v1 command like `ShowDatabasesCommand`. Otherwise, convert `ParsedStatement` to v2 command like `ShowNamespaces`.
2. try look up the table by name. If no such table, fail. If the table is a `V1Table`, convert `ParsedStatement` to v1 command like `CreateTable`. Otherwise, convert `ParsedStatement` to v2 command like `CreateV2Table`.
However, since the code is duplicated we don't apply this lookup logic consistently. For example, we forget to consider the v2 session catalog in several places.
This PR centralizes the catalog/table lookup logic by 3 rules.
1. `ResolveCatalogs` (in catalyst). This rule resolves v2 catalog from the multipart identifier in SQL statements, and convert the statement to v2 command if the resolved catalog is not session catalog. If the command needs to resolve the table (e.g. ALTER TABLE), put an `UnresolvedV2Table` in the command.
2. `ResolveTables` (in catalyst). It resolves `UnresolvedV2Table` to `DataSourceV2Relation`.
3. `ResolveSessionCatalog` (in sql/core). This rule is only effective if the resolved catalog is session catalog. For commands that don't need to resolve the table, this rule converts the statement to v1 command directly. Otherwise, it converts the statement to v1 command if the resolved table is v1 table, and convert to v2 command if the resolved table is v2 table. Hopefully we can remove this rule eventually when v1 fallback is not needed anymore.
### Why are the changes needed?
Reduce duplicated code and make the catalog/table lookup logic consistent.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
existing tests
Closes#25747 from cloud-fan/lookup.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Add access modifier `protected` for `sparkConf` in SQLQueryTestSuite, because in the parent trait SharedSparkSession, it is protected.
### Why are the changes needed?
Code consistency.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing UT.
Closes#26019 from xuanyuanking/SPARK-29203.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
1. With ANSI store assignment policy, an exception is thrown on casting failure
2. Introduce a new expression `AnsiCast` for the ANSI store assignment policy, so that the store assignment policy configuration won't affect the general `Cast`.
### Why are the changes needed?
As per ANSI SQL standard, ANSI store assignment policy should throw an exception on insertion failure, such as inserting out-of-range value to a numeric field.
### Does this PR introduce any user-facing change?
With ANSI store assignment policy, an exception is thrown on casting failure
### How was this patch tested?
Unit test
Closes#25997 from gengliangwang/newCast.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Dynamic partition pruning filters are added as an in-subquery containing a `BroadcastExchangeExec` in case of a broadcast hash join. This PR makes the `ReuseExchange` rule visit in-subquery nodes, to ensure the new `BroadcastExchangeExec` added by dynamic partition pruning can be reused.
### Why are the changes needed?
This initial dynamic partition pruning PR did not enable this reuse, which means a broadcast exchange would be executed twice, in the main query and in the DPP filter.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added broadcast exchange reuse check in `DynamicPartitionPruningSuite`
Closes#26015 from maryannxue/exchange-reuse.
Authored-by: maryannxue <maryannxue@apache.org>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request?
Add an overload for the higher order function `filter` that takes array index as its second argument to `org.apache.spark.sql.functions`.
### Why are the changes needed?
See: SPARK-28962 and SPARK-27297. Specifically ueshin pointing out the discrepency here: https://github.com/apache/spark/pull/24232#issuecomment-533288653
### Does this PR introduce any user-facing change?
### How was this patch tested?
Updated the these test suites:
`test.org.apache.spark.sql.JavaHigherOrderFunctionsSuite`
and
`org.apache.spark.sql.DataFrameFunctionsSuite`
Closes#26007 from nvander1/add_index_overload_for_filter.
Authored-by: Nik Vanderhoof <nikolasrvanderhoof@gmail.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
This PR renames `object JSONBenchmark` to `object JsonBenchmark` and the benchmark result file `JSONBenchmark-results.txt` to `JsonBenchmark-results.txt`.
### Why are the changes needed?
Since the file name doesn't match with `object JSONBenchmark`, it makes a confusion when we run the benchmark. In addition, this makes the automation difficult.
```
$ find . -name JsonBenchmark.scala
./sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonBenchmark.scala
```
```
$ build/sbt "sql/test:runMain org.apache.spark.sql.execution.datasources.json.JsonBenchmark"
[info] Running org.apache.spark.sql.execution.datasources.json.JsonBenchmark
[error] Error: Could not find or load main class org.apache.spark.sql.execution.datasources.json.JsonBenchmark
```
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
This is just renaming.
Closes#26008 from dongjoon-hyun/SPARK-RENAME-JSON.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR regenerates the `sql/core` benchmarks in JDK8/11 to compare the result. In general, we compare the ratio instead of the time. However, in this PR, the average time is compared. This PR should be considered as a rough comparison.
**A. EXPECTED CASES(JDK11 is faster in general)**
- [x] BloomFilterBenchmark (JDK11 is faster except one case)
- [x] BuiltInDataSourceWriteBenchmark (JDK11 is faster at CSV/ORC)
- [x] CSVBenchmark (JDK11 is faster except five cases)
- [x] ColumnarBatchBenchmark (JDK11 is faster at `boolean`/`string` and some cases in `int`/`array`)
- [x] DatasetBenchmark (JDK11 is faster with `string`, but is slower for `long` type)
- [x] ExternalAppendOnlyUnsafeRowArrayBenchmark (JDK11 is faster except two cases)
- [x] ExtractBenchmark (JDK11 is faster except HOUR/MINUTE/SECOND/MILLISECONDS/MICROSECONDS)
- [x] HashedRelationMetricsBenchmark (JDK11 is faster)
- [x] JSONBenchmark (JDK11 is much faster except eight cases)
- [x] JoinBenchmark (JDK11 is faster except five cases)
- [x] OrcNestedSchemaPruningBenchmark (JDK11 is faster in nine cases)
- [x] PrimitiveArrayBenchmark (JDK11 is faster)
- [x] SortBenchmark (JDK11 is faster except `Arrays.sort` case)
- [x] UDFBenchmark (N/A, values are too small)
- [x] UnsafeArrayDataBenchmark (JDK11 is faster except one case)
- [x] WideTableBenchmark (JDK11 is faster except two cases)
**B. CASES WE NEED TO INVESTIGATE MORE LATER**
- [x] AggregateBenchmark (JDK11 is slower in general)
- [x] CompressionSchemeBenchmark (JDK11 is slower in general except `string`)
- [x] DataSourceReadBenchmark (JDK11 is slower in general)
- [x] DateTimeBenchmark (JDK11 is slightly slower in general except `parsing`)
- [x] MakeDateTimeBenchmark (JDK11 is slower except two cases)
- [x] MiscBenchmark (JDK11 is slower except ten cases)
- [x] OrcV2NestedSchemaPruningBenchmark (JDK11 is slower)
- [x] ParquetNestedSchemaPruningBenchmark (JDK11 is slower except six cases)
- [x] RangeBenchmark (JDK11 is slower except one case)
`FilterPushdownBenchmark/InExpressionBenchmark/WideSchemaBenchmark` will be compared later because it took long timer.
### Why are the changes needed?
According to the result, there are some difference between JDK8/JDK11.
This will be a baseline for the future improvement and comparison. Also, as a reproducible environment, the following environment is used.
- Instance: `r3.xlarge`
- OS: `CentOS Linux release 7.5.1804 (Core)`
- JDK:
- `OpenJDK Runtime Environment (build 1.8.0_222-b10)`
- `OpenJDK Runtime Environment 18.9 (build 11.0.4+11-LTS)`
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
This is a test-only PR. We need to run benchmark.
Closes#26003 from dongjoon-hyun/SPARK-29320.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Scala 2.13 removes the parallel collections classes to a separate library, so first, this establishes a `scala-2.13` profile to bring it back, for future use.
However the library enables use of `.par` implicit conversions via a new class that is not in 2.12, which makes cross-building hard. This implements a suggested workaround from https://github.com/scala/scala-parallel-collections/issues/22 to avoid `.par` entirely.
### Why are the changes needed?
To compile for 2.13 and later to work with 2.13.
### Does this PR introduce any user-facing change?
Should not, no.
### How was this patch tested?
Existing tests.
Closes#25980 from srowen/SPARK-29296.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
LOAD DATA command resolves the partition column name as case sensitive manner,
where as in insert commandthe partition column name will be resolved using
the SQLConf resolver where the names will be resolved based on `spark.sql.caseSensitive` property. Same logic can be applied for resolving the partition column names in LOAD COMMAND.
### Why are the changes needed?
It's to handle the partition column name correctly according to the configuration.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing UT and manual testing.
Closes#24903 from sujith71955/master_paritionColName.
Lead-authored-by: s71955 <sujithchacko.2010@gmail.com>
Co-authored-by: sujith71955 <sujithchacko.2010@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR proposes to avoid abstract classes introduced at https://github.com/apache/spark/pull/24965 but instead uses trait and object.
- `abstract class BaseArrowPythonRunner` -> `trait PythonArrowOutput` to allow mix-in
**Before:**
```
BasePythonRunner
├── BaseArrowPythonRunner
│ ├── ArrowPythonRunner
│ └── CoGroupedArrowPythonRunner
├── PythonRunner
└── PythonUDFRunner
```
**After:**
```
└── BasePythonRunner
├── ArrowPythonRunner
├── CoGroupedArrowPythonRunner
├── PythonRunner
└── PythonUDFRunner
```
- `abstract class BasePandasGroupExec ` -> `object PandasGroupUtils` to decouple
**Before:**
```
└── BasePandasGroupExec
├── FlatMapGroupsInPandasExec
└── FlatMapCoGroupsInPandasExec
```
**After:**
```
├── FlatMapGroupsInPandasExec
└── FlatMapCoGroupsInPandasExec
```
### Why are the changes needed?
The problem is that R code path is being matched with Python side:
**Python:**
```
└── BasePythonRunner
├── ArrowPythonRunner
├── CoGroupedArrowPythonRunner
├── PythonRunner
└── PythonUDFRunner
```
**R:**
```
└── BaseRRunner
├── ArrowRRunner
└── RRunner
```
I would like to match the hierarchy and decouple other stuff for now if possible. Ideally we should deduplicate both code paths. Internal implementation is also similar intentionally.
`BasePandasGroupExec` case is similar as well. R (with Arrow optimization, in particular) has some duplicated codes with Pandas UDFs.
`FlatMapGroupsInRWithArrowExec` <> `FlatMapGroupsInPandasExec`
`MapPartitionsInRWithArrowExec` <> `ArrowEvalPythonExec`
In order to prepare deduplication here as well, it might better avoid changing hierarchy alone in Python side.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Locally tested existing tests. Jenkins tests should verify this too.
Closes#25989 from HyukjinKwon/SPARK-29317.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Lambda functions to array `filter` can now take as input the index as well as the element. This behavior matches array `transform`.
### Why are the changes needed?
See JIRA. It's generally useful, and particularly so if you're working with fixed length arrays.
### Does this PR introduce any user-facing change?
Previously filter lambdas had to look like
`filter(arr, el -> whatever)`
Now, lambdas can take an index argument as well
`filter(array, (el, idx) -> whatever)`
### How was this patch tested?
I added unit tests to `HigherOrderFunctionsSuite`.
Closes#25666 from henrydavidge/filter-idx.
Authored-by: Henry D <henrydavidge@gmail.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
This PR exposes USE CATALOG/USE SQL commands as described in this [SPIP](https://docs.google.com/document/d/1jEcvomPiTc5GtB9F7d2RTVVpMY64Qy7INCA_rFEd9HQ/edit#)
It also exposes `currentCatalog` in `CatalogManager`.
Finally, it changes `SHOW NAMESPACES` and `SHOW TABLES` to use the current catalog if no catalog is specified (instead of default catalog).
### Why are the changes needed?
There is currently no mechanism to change current catalog/namespace thru SQL commands.
### Does this PR introduce any user-facing change?
Yes, you can perform the following:
```scala
// Sets the current catalog to 'testcat'
spark.sql("USE CATALOG testcat")
// Sets the current catalog to 'testcat' and current namespace to 'ns1.ns2'.
spark.sql("USE ns1.ns2 IN testcat")
// Now, the following will use 'testcat' as the current catalog and 'ns1.ns2' as the current namespace.
spark.sql("SHOW NAMESPACES")
```
### How was this patch tested?
Added new unit tests.
Closes#25771 from imback82/use_namespace.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to specify the save mode explicitly while writing to the `noop` datasource in benchmarks. I set `Overwrite` mode in the following benchmarks:
- JsonBenchmark
- CSVBenchmark
- UDFBenchmark
- MakeDateTimeBenchmark
- ExtractBenchmark
- DateTimeBenchmark
- NestedSchemaPruningBenchmark
### Why are the changes needed?
Otherwise writing to `noop` fails with:
```
[error] Exception in thread "main" org.apache.spark.sql.AnalysisException: TableProvider implementation noop cannot be written with ErrorIfExists mode, please use Append or Overwrite modes instead.;
[error] at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:284)
```
most likely due to https://github.com/apache/spark/pull/25876
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
I generated results of `ExtractBenchmark` via the command:
```
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.ExtractBenchmark"
```
Closes#25988 from MaxGekk/noop-overwrite-mode.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Added new expression `SecondWithFraction` which produces the `seconds` part of timestamps/dates with fractional part containing microseconds. This expression is used only in the `DatePart` expression. As the result, `date_part()` and `extract` return seconds and microseconds as the fractional part of the seconds part when `field` is `SECOND` (or synonyms).
### Why are the changes needed?
The `date_part()` and `extract` were added to maintain feature parity with PostgreSQL which has different behavior for the `SECOND` value of the `field` parameter. The fix is needed to behave in the same way. Here is PostgreSQL's output:
```sql
# SELECT date_part('SECONDS', timestamp'2019-10-01 00:00:01.000001');
date_part
-----------
1.000001
(1 row)
```
### Does this PR introduce any user-facing change?
Yes, type of `date_part('SECOND', ...)` is changed from `INT` to `DECIMAL(8, 6)`.
Before:
```sql
spark-sql> SELECT date_part('SECONDS', '2019-10-01 00:00:01.000001');
1
```
After:
```sql
spark-sql> SELECT date_part('SECONDS', '2019-10-01 00:00:01.000001');
1.000001
```
### How was this patch tested?
- Added new tests to `DateExpressionSuite` for the `SecondWithFraction` expression
- Regenerated results of `date_part.sql`, `extract.sql` and `timestamp.sql`
- Updated results of `ExtractBenchmark`
Closes#25986 from MaxGekk/extract-seconds-from-timestamp.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
For issue mentioned in [SPARK-29022](https://issues.apache.org/jira/browse/SPARK-29022)
Spark SQL CLI can't use class as serde class in jars add by SQL `ADD JAR`.
When we create table with `serde` class contains by jar added by SQL 'ADD JAR'.
We can create table with `serde` class construct success since we call `HiveClientImpl.createTable` under `withHiveState` method, it will add `clientLoader.classLoader` to `HiveClientImpl.state.getConf.classLoader`.
Jars added by SQL `ADD JAR` will be add to
1. `sparkSession.sharedState.jarClassLoader`.
2. 'HiveClientLoader.clientLoader.classLoader'
In Current spark-sql MODE, `HiveClientImpl.state` will use CliSessionState created when initialize
SparkSQLCliDriver, When we select data from table, it will check `serde` class, when call method `HiveTableScanExec#addColumnMetadataToConf()` to check for table desc serde class.
```
val deserializer = tableDesc.getDeserializerClass.getConstructor().newInstance()
deserializer.initialize(hiveConf, tableDesc.getProperties)
```
`getDeserializer` will use CliSessionState's hiveConf's classLoader in `Spark SQL CLI` mode.
But when we call `ADD JAR` in spark, the jar won't be added to `Classloader of CliSessionState' conf `, then `ClassNotFound` error happen.
So we reset `CliSessionState conf's classLoader ` to `sharedState.jarClassLoader` when `sharedState.jarClassLoader` has added jar passed by `HIVEAUXJARS`
Then when we use `ADD JAR ` to add jar, jar path will be added to CliSessionState's conf's ClassLoader
### Why are the changes needed?
Fix bug
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
ADD UT
Closes#25729 from AngersZhuuuu/SPARK-29015.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
This PR aims to remove `scalatest` deprecation warnings with the following changes.
- `org.scalatest.mockito.MockitoSugar` -> `org.scalatestplus.mockito.MockitoSugar`
- `org.scalatest.selenium.WebBrowser` -> `org.scalatestplus.selenium.WebBrowser`
- `org.scalatest.prop.Checkers` -> `org.scalatestplus.scalacheck.Checkers`
- `org.scalatest.prop.GeneratorDrivenPropertyChecks` -> `org.scalatestplus.scalacheck.ScalaCheckDrivenPropertyChecks`
### Why are the changes needed?
According to the Jenkins logs, there are 118 warnings about this.
```
grep "is deprecated" ~/consoleText | grep scalatest | wc -l
118
```
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
After Jenkins passes, we need to check the Jenkins log.
Closes#25982 from dongjoon-hyun/SPARK-29307.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Check schema fields to see if they contain the exact column name, add to error message in DataSet#resolve
Add test for extra error message piece
Adds an additional check in `DataSet#resolve`, in the else clause (i.e. column not resolved), that appends a suffix to the error message for the `AnalysisException` if that column name is literally found in the schema fields, to suggest to the user that it might need to be quoted via backticks.
### Why are the changes needed?
Forgetting to quote such column names is a common occurrence for new Spark users.
### Does this PR introduce any user-facing change?
No (other than the extra suffix on the error message).
### How was this patch tested?
`test` was run for `core` in `sbt`, and passed.
Closes#25807 from jeff303/SPARK-25153.
Authored-by: Jeff Evans <jeffrey.wayne.evans@gmail.com>
Signed-off-by: Holden Karau <hkarau@apple.com>
### What changes were proposed in this pull request?
This PR regenerate the benchmark results in `catalyst` and `avro` module in order to compare JDK8/JDK11 result.
### Why are the changes needed?
This PR aims to verify that there is no regression on JDK11.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
This is a test-only update. We need to run the benchmark manually.
Closes#25972 from dongjoon-hyun/SPARK-29300.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Scala 2.13 emits a deprecation warning for procedure-like declarations:
```
def foo() {
...
```
This is equivalent to the following, so should be changed to avoid a warning:
```
def foo(): Unit = {
...
```
### Why are the changes needed?
It will avoid about a thousand compiler warnings when we start to support Scala 2.13. I wanted to make the change in 3.0 as there are less likely to be back-ports from 3.0 to 2.4 than 3.1 to 3.0, for example, minimizing that downside to touching so many files.
Unfortunately, that makes this quite a big change.
### Does this PR introduce any user-facing change?
No behavior change at all.
### How was this patch tested?
Existing tests.
Closes#25968 from srowen/SPARK-29291.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Follow up from https://github.com/apache/spark/pull/24981 incorporating some comments from HyukjinKwon.
Specifically:
- Adding `CoGroupedData` to `pyspark/sql/__init__.py __all__` so that documentation is generated.
- Added pydoc, including example, for the use case whereby the user supplies a cogrouping function including a key.
- Added the boilerplate for doctests to cogroup.py. Note that cogroup.py only contains the apply() function which has doctests disabled as per the other Pandas Udfs.
- Restricted the newly exposed RelationalGroupedDataset constructor parameters to access only by the sql package.
- Some minor formatting tweaks.
This was tested by running the appropriate unit tests. I'm unsure as to how to check that my change will cause the documentation to be generated correctly, but it someone can describe how I can do this I'd be happy to check.
Closes#25939 from d80tb7/SPARK-27463-fixes.
Authored-by: Chris Martin <chris@cmartinit.co.uk>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
Please refer [the link on dev. mailing list](https://lists.apache.org/thread.html/cc6489a19316e7382661d305fabd8c21915e5faf6a928b4869ac2b4a%3Cdev.spark.apache.org%3E) to see rationalization of this patch.
This patch adds the functionality to detect the possible correct issue on multiple stateful operations in single streaming query and logs warning message to inform end users.
This patch also documents some notes to inform caveats when using multiple stateful operations in single query, and provide one known alternative.
## How was this patch tested?
Added new UTs in UnsupportedOperationsSuite to test various combination of stateful operators on streaming query.
Closes#24890 from HeartSaVioR/SPARK-28074.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
This patch adds AliasIdentifier to the list of classes that should be converted to Json in TreeNode.shouldConvertToJson.
### Why are the changes needed?
When asking prettyJson of an analyzed query plan which contains SubqueryAlias. The field of name of SubqueryAlias is "null", like:
```
[ {
"class" : "org.apache.spark.sql.catalyst.plans.logical.SubqueryAlias",
"num-children" : 1,
"name" : null,
"child" : 0
}, {
"class" : "org.apache.spark.sql.catalyst.plans.logical.Project",
...
```
Seems the alias name was in the Json before SPARK-19602.
It is fixed by this patch:
```
[ {
"class" : "org.apache.spark.sql.catalyst.plans.logical.SubqueryAlias",
"num-children" : 1,
"name" : {
"product-class" : "org.apache.spark.sql.catalyst.AliasIdentifier",
"identifier" : "t1"
},
"child" : 0
}, {
"class" : "org.apache.spark.sql.catalyst.plans.logical.Project",
...
```
### Does this PR introduce any user-facing change?
Yes. This patch changes null value of name field of SubqueryAlias in Json string to the alias identifier.
### How was this patch tested?
Added unit test.
Closes#25959 from viirya/SPARK-29186.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
### What changes were proposed in this pull request?
Some of the columns of JDBC/ODBC server tab in Web UI are hard to understand.
We have documented it at SPARK-28373 but I think it is better to have some tooltips in the SQL statistics table to explain the columns
![image](https://user-images.githubusercontent.com/12819544/64489775-38e48980-d257-11e9-868a-5f5f6a0f1e46.png)
The columns with new tooltips are finish time, close time, execution time and duration
![image](https://user-images.githubusercontent.com/12819544/64489858-1141f100-d258-11e9-9e4e-fae3299da465.png)
Improvements in UIUtils can be used in other tables in WebUI to add tooltips
### Why are the changes needed?
It is interesting to improve the undestanding of the WebUI
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Unit tests are added and manual test.
Closes#25723 from planga82/feature/SPARK-29019_tooltipjdbcServer.
Lead-authored-by: Unknown <soypab@gmail.com>
Co-authored-by: Pablo <soypab@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
HiveClientImpl may be log sensitive information. e.g. url, secret and token:
```scala
logDebug(
s"""
|Applying Hadoop/Hive/Spark and extra properties to Hive Conf:
|$k=${if (k.toLowerCase(Locale.ROOT).contains("password")) "xxx" else v}
""".stripMargin)
```
So redact it. Use SQLConf.get.redactOptions.
I add a new overloading function to fit this situation for one by one kv pair situation.
### Why are the changes needed?
Redact sensitive information when construct HiveClientImpl
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
MT
Run command
` /sbin/start-thriftserver.sh`
In log we can get
```
19/09/28 08:27:02 main DEBUG HiveClientImpl:
Applying Hadoop/Hive/Spark and extra properties to Hive Conf:
hive.druid.metadata.password=*********(redacted)
```
Closes#25954 from AngersZhuuuu/SPARK-29247.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Support the syntax of `ALTER (DATABASE|SCHEMA) database_name SET LOCATION` path. Please note that only Hive 3.x metastore support this syntax.
Ref:
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDLhttps://issues.apache.org/jira/browse/HIVE-8472
### Why are the changes needed?
Support more syntax.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Unit test.
Closes#25883 from wangyum/SPARK-28476.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request?
Added try exception
### Why are the changes needed?
The behaviors of run commands during exception handling are different depends on explain command. I think it should be unified.
[ >spark.sql("explain cost select * from hoge").show(false) ]
![cost](https://user-images.githubusercontent.com/55128575/65225389-09a80500-db00-11e9-9246-0f1a3a881595.png)
[ >spark.sql("explain extended select * from hoge").show(false) ]
![extemded](https://user-images.githubusercontent.com/55128575/65225430-188eb780-db00-11e9-99bf-ff550b2ffd12.png)
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
tested manually
Closes#25848 from TomokoKomiyama/fix-explain.
Authored-by: TomokoKomiyama <btkomiyamatm@oss.nttdata.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
This PR moves Hive test jars(`hive-contrib-*.jar` and `hive-hcatalog-core-*.jar`) from maven dependency to local file.
### Why are the changes needed?
`--jars` can't be tested since `hive-contrib-*.jar` and `hive-hcatalog-core-*.jar` are already in classpath.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
manual test
Closes#25690 from wangyum/SPARK-27831-revert.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
### What changes were proposed in this pull request?
The `SET` commands do not contain the `_FUNC_` pattern a priori. In the PR, I propose filter out such commands in the `using _FUNC_ instead of function names in examples` test.
### Why are the changes needed?
After the merge of https://github.com/apache/spark/pull/25942, examples will require particular settings. Currently, the whole expression example has to be ignored which is so much. It makes sense to ignore only `SET` commands in expression examples.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By running the `using _FUNC_ instead of function names in examples` test.
Closes#25958 from MaxGekk/dont-check-_FUNC_-in-set.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This patch fixes examples of Like/RLike to test its origin intention correctly. The example doesn't consider the default value of spark.sql.parser.escapedStringLiterals: it's false by default.
Please take a look at current example of Like:
d72f39897b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala (L97-L106)
If spark.sql.parser.escapedStringLiterals=false, then it should fail as there's `\U` in pattern (spark.sql.parser.escapedStringLiterals=false by default) but it doesn't fail.
```
The escape character is '\'. If an escape character precedes a special symbol or another
escape character, the following character is matched literally. It is invalid to escape
any other character.
```
For the query
```
SET spark.sql.parser.escapedStringLiterals=false;
SELECT '%SystemDrive%\Users\John' like '\%SystemDrive\%\Users%';
```
SQL parser removes single `\` (not sure that is intended) so the expressions of Like are constructed as following (I've printed out expression of left and right for Like/RLike):
> LIKE - left `%SystemDrive%UsersJohn` / right `\%SystemDrive\%Users%`
which are no longer having origin intention (see left).
Below query tests the origin intention:
```
SET spark.sql.parser.escapedStringLiterals=false;
SELECT '%SystemDrive%\\Users\\John' like '\%SystemDrive\%\\\\Users%';
```
> LIKE - left `%SystemDrive%\Users\John` / right `\%SystemDrive\%\\Users%`
Note that `\\\\` is needed in pattern as `StringUtils.escapeLikeRegex` requires `\\` to represent normal character of `\`.
Same for RLIKE:
```
SET spark.sql.parser.escapedStringLiterals=true;
SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\\Users.*';
```
> RLIKE - left `%SystemDrive%\Users\John` / right `%SystemDrive%\\Users.*`
which is OK, but
```
SET spark.sql.parser.escapedStringLiterals=false;
SELECT '%SystemDrive%\Users\John' rlike '%SystemDrive%\Users.*';
```
> RLIKE - left `%SystemDrive%UsersJohn` / right `%SystemDrive%Users.*`
which no longer haves origin intention.
Below query tests the origin intention:
```
SET spark.sql.parser.escapedStringLiterals=true;
SELECT '%SystemDrive%\\Users\\John' rlike '%SystemDrive%\\\\Users.*';
```
> RLIKE - left `%SystemDrive%\Users\John` / right `%SystemDrive%\\Users.*`
### Why are the changes needed?
Because the example doesn't test the origin intention. Spark is now running automated tests from these examples, so now it's not only documentation issue but also test issue.
### Does this PR introduce any user-facing change?
No, as it only corrects documentation.
### How was this patch tested?
Added debug log (like above) and ran queries from `spark-sql`.
Closes#25957 from HeartSaVioR/SPARK-29281.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
In the PR, I propose to clone Spark session per-each expression example. Examples can modify SQL settings, and can influence on each other if they run in the same Spark session in parallel.
### Why are the changes needed?
This should fix test failures like [this](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-jdk-11/478/testReport/junit/org.apache.spark.sql/SQLQuerySuite/check_outputs_of_expression_examples/) checking of the `Like` example:
```
org.apache.spark.sql.AnalysisException: the pattern '\%SystemDrive\%\Users%' is invalid, the escape character is not allowed to precede 'U';
at org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:48)
at org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:57)
at org.apache.spark.sql.catalyst.expressions.Like.escape(regexpExpressions.scala:108)
```
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By running `check outputs of expression examples` in `org.apache.spark.sql.SQLQuerySuite`
Closes#25956 from MaxGekk/fix-expr-examples-checks.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
## What changes were proposed in this pull request?
SPARK-27210 enables ManifestFileCommitProtocol to clean up incomplete output files in task level if task is aborted.
This patch extends the area of cleaning up, proposes ManifestFileCommitProtocol to clean up complete but invalid output files in job level if job aborts. Please note that this works as 'best-effort', not kind of guarantee, as we have in HadoopMapReduceCommitProtocol.
## How was this patch tested?
Added UT.
Closes#24186 from HeartSaVioR/SPARK-27254.
Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
### What changes were proposed in this pull request?
Hive 2.3 will set a new UDFClassLoader to hiveConf.classLoader when initializing SessionState since HIVE-11878, and
1. ADDJarCommand will add jars to clientLoader.classLoader.
2. --jar passed jar will be added to clientLoader.classLoader
3. jar passed by hive conf `hive.aux.jars` [SPARK-28954](https://github.com/apache/spark/pull/25653) [SPARK-28840](https://github.com/apache/spark/pull/25542) will be added to clientLoader.classLoader too
For these reason we cannot load the jars added by ADDJarCommand because of class loader got changed. We reset it to clientLoader.ClassLoader here.
### Why are the changes needed?
support for jdk11
### Does this PR introduce any user-facing change?
NO
### How was this patch tested?
UT
```
export JAVA_HOME=/usr/lib/jdk-11.0.3
export PATH=$JAVA_HOME/bin:$PATH
build/sbt -Phive-thriftserver -Phadoop-3.2
hive/test-only *HiveSparkSubmitSuite -- -z "SPARK-8368: includes jars passed in through --jars"
hive-thriftserver/test-only *HiveThriftBinaryServerSuite -- -z "test add jar"
```
Closes#25775 from AngersZhuuuu/SPARK-29015-STS-JDK11.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
New test compares outputs of expression examples in comments with results of `hiveResultString()`. Also I fixed existing examples where actual and expected outputs are different.
### Why are the changes needed?
This prevents mistakes in expression examples, and fixes existing mistakes in comments.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Add new test to `SQLQuerySuite`.
Closes#25942 from MaxGekk/run-expr-examples.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Added a new config "spark.sql.additionalRemoteRepositories", a comma-delimited string config of the optional additional remote maven mirror.
### Why are the changes needed?
We need to connect the Maven repositories in IsolatedClientLoader for downloading Hive jars,
end-users can set this config if the default maven central repo is unreachable.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing UT.
Closes#25849 from xuanyuanking/SPARK-29175.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Call fs.exists only when necessary in InsertIntoHadoopFsRelationCommand.
### Why are the changes needed?
When saving a dataframe into Hadoop, spark first checks if the file exists before inspecting the SaveMode to determine if it should actually insert data. However, the pathExists variable is actually not used in the case of SaveMode.Append. In some file systems, the exists call can be expensive and hence this PR makes that call only when necessary.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing unit tests should cover it since this doesn't change the behavior.
Closes#25928 from rahij/rr/exists-upstream.
Authored-by: Rahij Ramsharan <rramsharan@palantir.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
After https://github.com/apache/spark/pull/25158 and https://github.com/apache/spark/pull/25458, SQL features of PostgreSQL are introduced into Spark. AFAIK, both features are implementation-defined behaviors, which are not specified in ANSI SQL.
In such a case, this proposal is to add a configuration `spark.sql.dialect` for choosing a database dialect.
After this PR, Spark supports two database dialects, `Spark` and `PostgreSQL`. With `PostgreSQL` dialect, Spark will:
1. perform integral division with the / operator if both sides are integral types;
2. accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type.
### Why are the changes needed?
Unify the external database dialect with one configuration, instead of small flags.
### Does this PR introduce any user-facing change?
A new configuration `spark.sql.dialect` for choosing a database dialect.
### How was this patch tested?
Existing tests.
Closes#25697 from gengliangwang/dialect.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Rename the package pgSQL to postgreSQL
### Why are the changes needed?
To address the comment in https://github.com/apache/spark/pull/25697#discussion_r328431070 . The official full name seems more reasonable.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing unit tests.
Closes#25936 from gengliangwang/renamePGSQL.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
### What changes were proposed in this pull request?
It is very confusing that the default save mode is different between the internal implementation of a Data source. The reason that we had to have saveModeForDSV2 was that there was no easy way to check the existence of a Table in DataSource v2. Now, we have catalogs for that. Therefore we should be able to remove the different save modes. We also have a plan forward for `save`, where we can't really check the existence of a table, and therefore create one. That will come in a future PR.
### Why are the changes needed?
Because it is confusing that the internal implementation of a data source (which is generally non-obvious to users) decides which default save mode is used within Spark.
### Does this PR introduce any user-facing change?
It changes the default save mode for V2 Tables in the DataFrameWriter APIs
### How was this patch tested?
Existing tests
Closes#25876 from brkyvz/removeSM.
Lead-authored-by: Burak Yavuz <brkyvz@gmail.com>
Co-authored-by: Burak Yavuz <burak@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This patch proposes to skip PlanExpression when doing subexpression elimination on executors.
### Why are the changes needed?
Subexpression elimination can possibly cause NPE when applying on execution subquery expression like ScalarSubquery on executors. It is because PlanExpression wraps query plan. To compare query plan on executor when eliminating subexpression, can cause unexpected error, like NPE when accessing transient fields.
The NPE looks like:
```
[info] - SPARK-29239: Subquery should not cause NPE when eliminating subexpression *** FAILED *** (175 milliseconds)
[info] org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1395.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1395.0 (TID 3447, 10.0.0.196, executor driver): java.lang.NullPointerException
[info] at org.apache.spark.sql.execution.LocalTableScanExec.stringArgs(LocalTableScanExec.scala:62)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:506)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:534)
[info] at org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:179)
[info] at org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:181)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:647)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:675)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:675)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:569)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:559)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:551)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:548)
[info] at org.apache.spark.sql.catalyst.errors.package$TreeNodeException.<init>(package.scala:36)
[info] at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:436)
[info] at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:425)
[info] at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:102)
[info] at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:63)
[info] at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:132)
[info] at org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:261)
```
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Added unit test.
Closes#25925 from viirya/SPARK-29239.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Don't allow calling append, overwrite, or overwritePartitions after tableProperty is used in DataFrameWriterV2 because table properties are not set as part of operations on existing tables. Only tables that are created or replaced can set table properties.
### Why are the changes needed?
The properties are discarded otherwise, so this avoids confusing behavior.
### Does this PR introduce any user-facing change?
Yes, but to a new API, DataFrameWriterV2.
### How was this patch tested?
Removed test cases that used this method and the append, etc. methods because they no longer compile.
Closes#25931 from rdblue/fix-dfw-v2-table-properties.
Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to replace function names in some expression examples by `_FUNC_`, and add a test to check that `_FUNC_` always present in all examples.
### Why are the changes needed?
Binding of a function name to an expression is performed in `FunctionRegistry` which is single source of truth. Expression examples should avoid using function name directly because this can make the examples invalid in the future.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Added new test to `SQLQuerySuite` which analyses expression example, and check presence of `_FUNC_`.
Closes#25924 from MaxGekk/fix-func-examples.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
when the current catalog is session catalog, get/set the current namespace from/to the `SessionCatalog`.
### Why are the changes needed?
It's super confusing that we don't have a single source of truth for the current namespace of the session catalog. It can be in `CatalogManager` or `SessionCatalog`.
Ideally, we should always track the current catalog/namespace in `CatalogManager`. However, there are many commands that do not support v2 catalog API. They ignore the current catalog in `CatalogManager` and blindly go to `SessionCatalog`. This means, we must keep track of the current namespace of session catalog even if the current catalog is not session catalog.
Thus, we can't use `CatalogManager` to track the current namespace of session catalog because it changes when the current catalog is changed. To keep single source of truth, we should only track the current namespace of session catalog in `SessionCatalog`.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Newly added and updated test cases.
Closes#25903 from cloud-fan/current.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
### What changes were proposed in this pull request?
Copy any "spark.hive.foo=bar" spark properties into hadoop conf as "hive.foo=bar"
### Why are the changes needed?
Providing spark side config entry for hive configurations.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
UT.
Closes#25661 from WeichenXu123/add_hive_conf.
Authored-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Change the remote repo used in IsolatedClientLoader from datanucleus to google mirror.
### Why are the changes needed?
We need to connect the Maven repositories in IsolatedClientLoader for downloading Hive jars. The repository currently used is "http://www.datanucleus.org/downloads/maven2", which is [no longer maintained](http://www.datanucleus.org:15080/downloads/maven2/README.txt). This will cause downloading failure and make hive test cases flaky while Jenkins host is blocked by maven central repo.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing UT.
Closes#25915 from xuanyuanking/SPARK-29229.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Move the rule `RemoveAllHints` after the batch `Resolution`.
### Why are the changes needed?
User-defined hints can be resolved by the rules injected via `extendedResolutionRules` or `postHocResolutionRules`.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Added a test case
Closes#25746 from gatorsmile/moveRemoveAllHints.
Authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Added "array indices start at 1" in annotation to make it clear for the usage of function slice, in R Scala Python component
### Why are the changes needed?
It will throw exception if the value stare is 0, but array indices start at 0 most of times in other scenarios.
### Does this PR introduce any user-facing change?
Yes, more info provided to user.
### How was this patch tested?
No tests added, only doc change.
Closes#25704 from sheepstop/master.
Authored-by: sheepstop <yangting617@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR enable `ThriftServerQueryTestSuite` and fix previously flaky test by:
1. Start thriftserver in `beforeAll()`.
2. Disable `spark.sql.hive.thriftServer.async`.
### Why are the changes needed?
Improve test coverage.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
```shell
build/sbt "hive-thriftserver/test-only *.ThriftServerQueryTestSuite " -Phive-thriftserver
build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite test -Phive-thriftserver
```
Closes#25868 from wangyum/SPARK-28527-enable.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
## What changes were proposed in this pull request?
partition path should be qualified to store in catalog.
There are some scenes:
1. ALTER TABLE t PARTITION(b=1) SET LOCATION '/path/x'
should be qualified: file:/path/x
**Hive 2.0.0 does not support for location without schema here.**
```
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. {0} is not absolute or has no scheme information. Please specify a complete absolute uri with scheme information.
```
2. ALTER TABLE t PARTITION(b=1) SET LOCATION 'x'
should be qualified: file:/tablelocation/x
**Hive 2.0.0 does not support for relative location here.**
3. ALTER TABLE t ADD PARTITION(b=1) LOCATION '/path/x'
should be qualified: file:/path/x
**the same with Hive 2.0.0**
4. ALTER TABLE t ADD PARTITION(b=1) LOCATION 'x'
should be qualified: file:/tablelocation/x
**the same with Hive 2.0.0**
Currently only ALTER TABLE t ADD PARTITION(b=1) LOCATION for hive serde table has the expected qualified path. we should make other scenes to be consist with it.
Another change is for alter table location.
## How was this patch tested?
add / modify existing TestCases
Closes#17254 from windpiger/qualifiedPartitionPath.
Authored-by: windpiger <songjun@outlook.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR reduce shuffle partitions from 200 to 4 in `SQLQueryTestSuite` to reduce testing time.
### Why are the changes needed?
Reduce testing time.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Manually tested in my local:
Before:
```
...
[info] - subquery/in-subquery/in-joins.sql (6 minutes, 19 seconds)
[info] - subquery/in-subquery/not-in-joins.sql (2 minutes, 17 seconds)
[info] - subquery/scalar-subquery/scalar-subquery-predicate.sql (45 seconds, 763 milliseconds)
...
Run completed in 1 hour, 22 minutes.
```
After:
```
...
[info] - subquery/in-subquery/in-joins.sql (1 minute, 12 seconds)
[info] - subquery/in-subquery/not-in-joins.sql (27 seconds, 541 milliseconds)
[info] - subquery/scalar-subquery/scalar-subquery-predicate.sql (17 seconds, 360 milliseconds)
...
Run completed in 47 minutes.
```
Closes#25891 from wangyum/SPARK-29203.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
### What changes were proposed in this pull request?
Discuss in https://github.com/apache/spark/pull/25611
If cancel() and close() is called very quickly after the query is started, then they may both call cleanup() before Spark Jobs are started. Then sqlContext.sparkContext.cancelJobGroup(statementId) does nothing.
But then the execute thread can start the jobs, and only then get interrupted and exit through here. But then it will exit here, and no-one will cancel these jobs and they will keep running even though this execution has exited.
So when execute() was interrupted by `cancel()`, when get into catch block, we should call canJobGroup again to make sure the job was canceled.
### Why are the changes needed?
### Does this PR introduce any user-facing change?
NO
### How was this patch tested?
MT
Closes#25743 from AngersZhuuuu/SPARK-29036.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
### What changes were proposed in this pull request?
This PR supports UPDATE in the parser and add the corresponding logical plan. The SQL syntax is a standard UPDATE statement:
```
UPDATE tableName tableAlias SET colName=value [, colName=value]+ WHERE predicate?
```
### Why are the changes needed?
With this change, we can start to implement UPDATE in builtin sources and think about how to design the update API in DS v2.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
New test cases added.
Closes#25626 from xianyinxin/SPARK-28892.
Authored-by: xy_xin <xianyin.xxy@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This pr proposes to check method bytecode size in `BenchmarkQueryTest`. This metric is critical for performance numbers.
### Why are the changes needed?
For performance checks
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
N/A
Closes#25788 from maropu/CheckMethodSize.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR add support sorting `Execution Time` and `Duration` columns for `ThriftServerSessionPage`.
### Why are the changes needed?
Previously, it's not sorted correctly.
### Does this PR introduce any user-facing change?
Yes.
### How was this patch tested?
Manually do the following and test sorting on those columns in the Spark Thrift Server Session Page.
```
$ sbin/start-thriftserver.sh
$ bin/beeline -u jdbc:hive2://localhost:10000
0: jdbc:hive2://localhost:10000> create table t(a int);
+---------+--+
| Result |
+---------+--+
+---------+--+
No rows selected (0.521 seconds)
0: jdbc:hive2://localhost:10000> select * from t;
+----+--+
| a |
+----+--+
+----+--+
No rows selected (0.772 seconds)
0: jdbc:hive2://localhost:10000> show databases;
+---------------+--+
| databaseName |
+---------------+--+
| default |
+---------------+--+
1 row selected (0.249 seconds)
```
**Sorted by `Execution Time` column**:
![image](https://user-images.githubusercontent.com/5399861/65387476-53038900-dd7a-11e9-885c-fca80287f550.png)
**Sorted by `Duration` column**:
![image](https://user-images.githubusercontent.com/5399861/65387481-6e6e9400-dd7a-11e9-9318-f917247efaa8.png)
Closes#25892 from wangyum/SPARK-28599.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR aims to add tag `ExtendedSQLTest` for `SQLQueryTestSuite`.
This doesn't affect our Jenkins test coverage.
Instead, this tag gives us an ability to parallelize them by splitting this test suite and the other suites.
### Why are the changes needed?
`SQLQueryTestSuite` takes 45 mins alone because it has many SQL scripts to run.
<img width="906" alt="time" src="https://user-images.githubusercontent.com/9700541/65353553-4af0f100-dba2-11e9-9f2f-386742d28f92.png">
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
```
build/sbt "sql/test-only *.SQLQueryTestSuite" -Dtest.exclude.tags=org.apache.spark.tags.ExtendedSQLTest
...
[info] SQLQueryTestSuite:
[info] ScalaTest
[info] Run completed in 3 seconds, 147 milliseconds.
[info] Total number of tests run: 0
[info] Suites: completed 1, aborted 0
[info] Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0
[info] No tests were executed.
[info] Passed: Total 0, Failed 0, Errors 0, Passed 0
[success] Total time: 22 s, completed Sep 20, 2019 12:23:13 PM
```
Closes#25872 from dongjoon-hyun/SPARK-29191.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Rewrite
```
NOT isnull(x) -> isnotnull(x)
NOT isnotnull(x) -> isnull(x)
```
### Why are the changes needed?
Make LogicalPlan more readable and useful for query canonicalization. Make same condition equal when judge query canonicalization equal
### Does this PR introduce any user-facing change?
NO
### How was this patch tested?
Newly added UTs.
Closes#25878 from AngersZhuuuu/SPARK-29162.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Supported special string values for `DATE` type. They are simply notational shorthands that will be converted to ordinary date values when read. The following string values are supported:
- `epoch [zoneId]` - `1970-01-01`
- `today [zoneId]` - the current date in the time zone specified by `spark.sql.session.timeZone`.
- `yesterday [zoneId]` - the current date -1
- `tomorrow [zoneId]` - the current date + 1
- `now` - the date of running the current query. It has the same notion as `today`.
For example:
```sql
spark-sql> SELECT date 'tomorrow' - date 'yesterday';
2
```
### Why are the changes needed?
To maintain feature parity with PostgreSQL, see [8.5.1.4. Special Values](https://www.postgresql.org/docs/12/datatype-datetime.html)
### Does this PR introduce any user-facing change?
Previously, the parser fails on the special values with the error:
```sql
spark-sql> select date 'today';
Error in query:
Cannot parse the DATE value: today(line 1, pos 7)
```
After the changes, the special values are converted to appropriate dates:
```sql
spark-sql> select date 'today';
2019-09-06
```
### How was this patch tested?
- Added tests to `DateFormatterSuite` to check parsing special values from regular strings.
- Tests in `DateTimeUtilsSuite` check parsing those values from `UTF8String`
- Uncommented tests in `date.sql`
Closes#25708 from MaxGekk/datetime-special-values.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Refactoring of the `DateTimeUtils.getEpoch()` function by avoiding decimal operations that are pretty expensive, and converting the final result to the decimal type at the end.
### Why are the changes needed?
The changes improve performance of the `getEpoch()` method at least up to **20 times**.
Before:
```
Invoke extract for timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
cast to timestamp 256 277 33 39.0 25.6 1.0X
EPOCH of timestamp 23455 23550 131 0.4 2345.5 0.0X
```
After:
```
Invoke extract for timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
cast to timestamp 255 294 34 39.2 25.5 1.0X
EPOCH of timestamp 1049 1054 9 9.5 104.9 0.2X
```
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By existing test from `DateExpressionSuite`.
Closes#25881 from MaxGekk/optimize-extract-epoch.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Changed the `DateTimeUtils.getMilliseconds()` by avoiding the decimal division, and replacing it by setting scale and precision while converting microseconds to the decimal type.
### Why are the changes needed?
This improves performance of `extract` and `date_part()` by more than **50 times**:
Before:
```
Invoke extract for timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative Invoke extract for timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
cast to timestamp 397 428 45 25.2 39.7 1.0X
MILLISECONDS of timestamp 36723 36761 63 0.3 3672.3 0.0X
```
After:
```
Invoke extract for timestamp: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
cast to timestamp 278 284 6 36.0 27.8 1.0X
MILLISECONDS of timestamp 592 606 13 16.9 59.2 0.5X
```
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By existing test suite - `DateExpressionsSuite`
Closes#25871 from MaxGekk/optimize-epoch-millis.
Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This patch fixes the issue brought by [SPARK-21870](http://issues.apache.org/jira/browse/SPARK-21870): when generating code for parameter type, it doesn't consider array type in javaType. At least we have one, Spark should generate code for BinaryType as `byte[]`, but Spark create the code for BinaryType as `[B` and generated code fails compilation.
Below is the generated code which failed compilation (Line 380):
```
/* 380 */ private void agg_doAggregate_count_0([B agg_expr_1_1, boolean agg_exprIsNull_1_1, org.apache.spark.sql.catalyst.InternalRow agg_unsafeRowAggBuffer_1) throws java.io.IOException {
/* 381 */ // evaluate aggregate function for count
/* 382 */ boolean agg_isNull_26 = false;
/* 383 */ long agg_value_28 = -1L;
/* 384 */ if (!false && agg_exprIsNull_1_1) {
/* 385 */ long agg_value_31 = agg_unsafeRowAggBuffer_1.getLong(1);
/* 386 */ agg_isNull_26 = false;
/* 387 */ agg_value_28 = agg_value_31;
/* 388 */ } else {
/* 389 */ long agg_value_33 = agg_unsafeRowAggBuffer_1.getLong(1);
/* 390 */
/* 391 */ long agg_value_32 = -1L;
/* 392 */
/* 393 */ agg_value_32 = agg_value_33 + 1L;
/* 394 */ agg_isNull_26 = false;
/* 395 */ agg_value_28 = agg_value_32;
/* 396 */ }
/* 397 */ // update unsafe row buffer
/* 398 */ agg_unsafeRowAggBuffer_1.setLong(1, agg_value_28);
/* 399 */ }
```
There wasn't any test for HashAggregateExec specifically testing this, but randomized test in ObjectHashAggregateSuite could encounter this and that's why ObjectHashAggregateSuite is flaky.
### Why are the changes needed?
Without the fix, generated code from HashAggregateExec may fail compilation.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Added new UT. Without the fix, newly added UT fails.
Closes#25830 from HeartSaVioR/SPARK-29140.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
In the PR, I propose to change behavior of the `date_part()` function in handling `null` field, and make it the same as PostgreSQL has. If `field` parameter is `null`, the function should return `null` of the `double` type as PostgreSQL does:
```sql
# select date_part(null, date '2019-09-20');
date_part
-----------
(1 row)
# select pg_typeof(date_part(null, date '2019-09-20'));
pg_typeof
------------------
double precision
(1 row)
```
### Why are the changes needed?
The `date_part()` function was added to maintain feature parity with PostgreSQL but current behavior of the function is different in handling null as `field`.
### Does this PR introduce any user-facing change?
Yes.
Before:
```sql
spark-sql> select date_part(null, date'2019-09-20');
Error in query: null; line 1 pos 7
```
After:
```sql
spark-sql> select date_part(null, date'2019-09-20');
NULL
```
### How was this patch tested?
Add new tests to `DateFunctionsSuite for 2 cases:
- `field` = `null`, `source` = a date literal
- `field` = `null`, `source` = a date column
Closes#25865 from MaxGekk/date_part-null.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Modify the approach in `DataFrameNaFunctions.fillValue`, the new one uses `df.withColumns` which only address the columns need to be filled. After this change, there are no more ambiguous fileds detected for joined dataframe.
### Why are the changes needed?
Before this change, when you have a joined table that has the same field name from both original table, fillna will fail even if you specify a subset that does not include the 'ambiguous' fields.
```
scala> val df1 = Seq(("f1-1", "f2", null), ("f1-2", null, null), ("f1-3", "f2", "f3-1"), ("f1-4", "f2", "f3-1")).toDF("f1", "f2", "f3")
scala> val df2 = Seq(("f1-1", null, null), ("f1-2", "f2", null), ("f1-3", "f2", "f4-1")).toDF("f1", "f2", "f4")
scala> val df_join = df1.alias("df1").join(df2.alias("df2"), Seq("f1"), joinType="left_outer")
scala> df_join.na.fill("", cols=Seq("f4"))
org.apache.spark.sql.AnalysisException: Reference 'f2' is ambiguous, could be: df1.f2, df2.f2.;
```
### Does this PR introduce any user-facing change?
Yes, fillna operation will pass and give the right answer for a joined table.
### How was this patch tested?
Local test and newly added UT.
Closes#25768 from xuanyuanking/SPARK-29063.
Lead-authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Co-authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR allows Python toLocalIterator to prefetch the next partition while the first partition is being collected. The PR also adds a demo micro bench mark in the examples directory, we may wish to keep this or not.
### Why are the changes needed?
In https://issues.apache.org/jira/browse/SPARK-23961 / 5e79ae3b40 we changed PySpark to only pull one partition at a time. This is memory efficient, but if partitions take time to compute this can mean we're spending more time blocking.
### Does this PR introduce any user-facing change?
A new param is added to toLocalIterator
### How was this patch tested?
New unit test inside of `test_rdd.py` checks the time that the elements are evaluated at. Another test that the results remain the same are added to `test_dataframe.py`.
I also ran a micro benchmark in the examples directory `prefetch.py` which shows an improvement of ~40% in this specific use case.
>
> 19/08/16 17:11:36 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
> Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
> Setting default log level to "WARN".
> To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
> Running timers:
>
> [Stage 32:> (0 + 1) / 1]
> Results:
>
> Prefetch time:
>
> 100.228110831
>
>
> Regular time:
>
> 188.341721614
>
>
>
Closes#25515 from holdenk/SPARK-27659-allow-pyspark-tolocalitr-to-prefetch.
Authored-by: Holden Karau <hkarau@apple.com>
Signed-off-by: Holden Karau <hkarau@apple.com>
### What changes were proposed in this pull request?
Currently the checks in the Analyzer require that V2 Tables have BATCH_WRITE defined for all tables that have V1 Write fallbacks. This is confusing as these tables may not have the V2 writer interface implemented yet. This PR adds this table capability to these checks.
In addition, this allows V2 tables to leverage the V1 APIs for DataFrameWriter.save if they do extend the V1_BATCH_WRITE capability. This way, these tables can continue to receive partitioning information and also perform checks for the existence of tables, and support all SaveModes.
### Why are the changes needed?
Partitioned saves through DataFrame.write are otherwise broken for V2 tables that support the V1
write API.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
V1WriteFallbackSuite
Closes#25767 from brkyvz/bwcheck.
Authored-by: Burak Yavuz <brkyvz@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This pr is to propagate all the SQL configurations to executors in `SQLQueryTestSuite`. When the propagation enabled in the tests, a potential bug below becomes apparent;
```
CREATE TABLE num_data (id int, val decimal(38,10)) USING parquet;
....
select sum(udf(CAST(null AS Decimal(38,0)))) from range(1,4): QueryOutput(select sum(udf(CAST(null AS Decimal(38,0)))) from range(1,4),struct<>,java.lang.IllegalArgumentException
[info] requirement failed: MutableProjection cannot use UnsafeRow for output data types: decimal(38,0)) (SQLQueryTestSuite.scala:380)
```
The root culprit is that `InterpretedMutableProjection` has incorrect validation in the interpreter mode: `validExprs.forall { case (e, _) => UnsafeRow.isFixedLength(e.dataType) }`. This validation should be the same with the condition (`isMutable`) in `HashAggregate.supportsAggregate`: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala#L1126
### Why are the changes needed?
Bug fixes.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Added tests in `AggregationQuerySuite`
Closes#25831 from maropu/SPARK-29122.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This is a follow-up of the [review comment](https://github.com/apache/spark/pull/25706#discussion_r321923311).
This patch unifies the default wait time to be 10 seconds as it would fit most of UTs (as they have smaller timeouts) and doesn't bring additional latency since it will return if the condition is met.
This patch doesn't touch the one which waits 100000 milliseconds (100 seconds), to not break anything unintentionally, though I'd rather questionable that we really need to wait for 100 seconds.
### Why are the changes needed?
It simplifies the test code and get rid of various heuristic values on timeout.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
CI build will test the patch, as it would be the best environment to test the patch (builds are running there).
Closes#25837 from HeartSaVioR/MINOR-unify-default-wait-time-for-wait-until-empty.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR aims to use `tryWithResource` for ORC file.
### Why are the changes needed?
This is a follow-up to address https://github.com/apache/spark/pull/25006#discussion_r298788206 .
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Pass the Jenkins with the existing tests.
Closes#25842 from dongjoon-hyun/SPARK-28208.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
This adds a new write API as proposed in the [SPIP to standardize logical plans](https://issues.apache.org/jira/browse/SPARK-23521). This new API:
* Uses clear verbs to execute writes, like `append`, `overwrite`, `create`, and `replace` that correspond to the new logical plans.
* Only creates v2 logical plans so the behavior is always consistent.
* Does not allow table configuration options for operations that cannot change table configuration. For example, `partitionedBy` can only be called when the writer executes `create` or `replace`.
Here are a few example uses of the new API:
```scala
df.writeTo("catalog.db.table").append()
df.writeTo("catalog.db.table").overwrite($"date" === "2019-06-01")
df.writeTo("catalog.db.table").overwritePartitions()
df.writeTo("catalog.db.table").asParquet.create()
df.writeTo("catalog.db.table").partitionedBy(days($"ts")).createOrReplace()
df.writeTo("catalog.db.table").using("abc").replace()
```
## How was this patch tested?
Added `DataFrameWriterV2Suite` that tests the new write API. Existing tests for v2 plans.
Closes#25681 from rdblue/SPARK-28612-add-data-frame-writer-v2.
Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
### What changes were proposed in this pull request?
This patch proposes to change the log level of logging generated code in case of compile error being occurred in UT. This would help to investigate compilation issue of generated code easier, as currently we got exception message of line number but there's no generated code being logged actually (as in most cases of UT the threshold of log level is at least WARN).
### Why are the changes needed?
This would help investigating issue on compilation error for generated code in UT.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
N/A
Closes#25835 from HeartSaVioR/MINOR-always-log-generated-code-on-fail-to-compile-in-unit-testing.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This addresses about 15 miscellaneous warnings that appear in the current build.
### Why are the changes needed?
No functional changes, it just slightly reduces the amount of extra warning output.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests, run manually.
Closes#25852 from srowen/BuildWarnings.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Currently, there are new configurations for compatibility with ANSI SQL:
* `spark.sql.parser.ansi.enabled`
* `spark.sql.decimalOperations.nullOnOverflow`
* `spark.sql.failOnIntegralTypeOverflow`
This PR is to add new configuration `spark.sql.ansi.enabled` and remove the 3 options above. When the configuration is true, Spark tries to conform to the ANSI SQL specification. It will be disabled by default.
### Why are the changes needed?
Make it simple and straightforward.
### Does this PR introduce any user-facing change?
The new features for ANSI compatibility will be set via one configuration `spark.sql.ansi.enabled`.
### How was this patch tested?
Existing unit tests.
Closes#25693 from gengliangwang/ansiEnabled.
Lead-authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Co-authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request?
Refactored SQL-related benchmark and made them depend on `SqlBasedBenchmark`. In particular, creation of Spark session are moved into `override def getSparkSession: SparkSession`.
### Why are the changes needed?
This should simplify maintenance of SQL-based benchmarks by reducing the number of dependencies. In the future, it should be easier to refactor & extend all SQL benchmarks by changing only one trait. Finally, all SQL-based benchmarks will look uniformly.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By running the modified benchmarks.
Closes#25828 from MaxGekk/sql-benchmarks-refactoring.
Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
This PR upgrade Scala to **2.12.10**.
Release notes:
- Fix regression in large string interpolations with non-String typed splices
- Revert "Generate shallower ASTs in pattern translation"
- Fix regression in classpath when JARs have 'a.b' entries beside 'a/b'
- Faster compiler: 5–10% faster since 2.12.8
- Improved compatibility with JDK 11, 12, and 13
- Experimental support for build pipelining and outline type checking
More details:
https://github.com/scala/scala/releases/tag/v2.12.10https://github.com/scala/scala/releases/tag/v2.12.9
## How was this patch tested?
Existing tests
Closes#25404 from wangyum/SPARK-28683.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Previous comment was true for Apache Spark 2.3.0. The 2.4.0 release brought multiple watermark policy and therefore stating that the 'min' is always chosen is misleading.
This PR updates the comments about multiple watermark policy. They aren't true anymore since in case of multiple watermarks, we can configure which one will be applied to the query. This change was brought with Apache Spark 2.4.0 release.
### Why are the changes needed?
It introduces some confusion about the real execution of the commented code.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
The tests weren't added because the change is only about the documentation level. I affirm that the contribution is my original work and that I license the work to the project under the project's open source license.
Closes#25832 from bartosz25/fix_comments_multiple_watermark_policy.
Authored-by: bartosz25 <bartkonieczny@yahoo.fr>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
It upgrades ORC from 1.5.5 to 1.5.6 and adds closes the ORC readers when they aren't used to
create RecordReaders.
## How was this patch tested?
The changed unit tests were run.
Closes#25006 from omalley/spark-28208.
Lead-authored-by: Owen O'Malley <omalley@apache.org>
Co-authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
Simplify the return type for `lookupV2Relation` which makes the 3 callers more straightforward.
## How was this patch tested?
Existing unit tests.
Closes#25735 from jzhuge/lookupv2relation.
Authored-by: John Zhuge <jzhuge@apache.org>
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
### What changes were proposed in this pull request?
#DataSet
fruit,color,price,quantity
apple,red,1,3
banana,yellow,2,4
orange,orange,3,5
xxx
This PR aims to fix the below
```
scala> spark.conf.set("spark.sql.csv.parser.columnPruning.enabled", false)
scala> spark.read.option("header", "true").option("mode", "DROPMALFORMED").csv("fruit.csv").count
res1: Long = 4
```
This is caused by the issue [SPARK-24645](https://issues.apache.org/jira/browse/SPARK-24645).
SPARK-24645 issue can also be solved by [SPARK-25387](https://issues.apache.org/jira/browse/SPARK-25387)
### Why are the changes needed?
SPARK-24645 caused this regression, so reverted the code as it can also be solved by SPARK-25387
### Does this PR introduce any user-facing change?
No,
### How was this patch tested?
Added UT, and also tested the bug SPARK-24645
**SPARK-24645 regression**
![image](https://user-images.githubusercontent.com/35216143/65067957-4c08ff00-d9a5-11e9-8d43-a4a23a61e8b8.png)
Closes#25820 from sandeep-katta/SPARK-29101.
Authored-by: sandeep katta <sandeep.katta2007@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Supported special string values for `TIMESTAMP` type. They are simply notational shorthands that will be converted to ordinary timestamp values when read. The following string values are supported:
- `epoch [zoneId]` - `1970-01-01 00:00:00+00 (Unix system time zero)`
- `today [zoneId]` - midnight today.
- `yesterday [zoneId]` -midnight yesterday
- `tomorrow [zoneId]` - midnight tomorrow
- `now` - current query start time.
For example:
```sql
spark-sql> SELECT timestamp 'tomorrow';
2019-09-07 00:00:00
```
### Why are the changes needed?
To maintain feature parity with PostgreSQL, see [8.5.1.4. Special Values](https://www.postgresql.org/docs/12/datatype-datetime.html)
### Does this PR introduce any user-facing change?
Previously, the parser fails on the special values with the error:
```sql
spark-sql> select timestamp 'today';
Error in query:
Cannot parse the TIMESTAMP value: today(line 1, pos 7)
```
After the changes, the special values are converted to appropriate dates:
```sql
spark-sql> select timestamp 'today';
2019-09-06 00:00:00
```
### How was this patch tested?
- Added tests to `TimestampFormatterSuite` to check parsing special values from regular strings.
- Tests in `DateTimeUtilsSuite` check parsing those values from `UTF8String`
- Uncommented tests in `timestamp.sql`
Closes#25716 from MaxGekk/timestamp-special-values.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
1. After https://github.com/apache/spark/pull/21599, if the option "spark.sql.failOnIntegralTypeOverflow" is enabled, all the Binary Arithmetic operator will used the exact version function.
However, only `Add`/`Substract`/`Multiply` has a corresponding exact function in java.lang.Math . When the option "spark.sql.failOnIntegralTypeOverflow" is enabled, a runtime exception "BinaryArithmetics must override either exactMathMethod or genCode" is thrown if the other Binary Arithmetic operators are used, such as "Divide", "Remainder".
The exact math method should be called only when there is a corresponding function in `java.lang.Math`
2. Revise the log output of casting to `Int`/`Short`
3. Enable `spark.sql.failOnIntegralTypeOverflow` for pgSQL tests in `SQLQueryTestSuite`.
### Why are the changes needed?
1. Fix the bugs of https://github.com/apache/spark/pull/21599
2. The test case of pgSQL intends to check the overflow of integer/long type. We should enable `spark.sql.failOnIntegralTypeOverflow`.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Unit test.
Closes#25804 from gengliangwang/enableIntegerOverflowInSQLTest.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In this PR, I fix some annotation errors and remove meaningless annotations in project.
### Why are the changes needed?
There are some annotation errors and meaningless annotations in project.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Verified manually.
Closes#25809 from turboFei/SPARK-29113.
Authored-by: turbofei <fwang12@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
**What changes were proposed in this pull request?**
Issue 1 : modifications not required as these are different formats for the same info. In the case of a Spark DataFrame, null is correct.
Issue 2 mentioned in JIRA Spark SQL "desc formatted tablename" is not showing the header # col_name,data_type,comment , seems to be the header has been removed knowingly as part of SPARK-20954.
Issue 3:
Corrected the Last Access time, the value shall display 'UNKNOWN' as currently system wont support the last access time evaluation, since hive was setting Last access time as '0' in metastore even though spark CatalogTable last access time value set as -1. this will make the validation logic of LasAccessTime where spark sets 'UNKNOWN' value if last access time value set as -1 (means not evaluated).
**Does this PR introduce any user-facing change?**
No
**How was this patch tested?**
Locally and corrected a ut.
Attaching the test report below
![SPARK-28930](https://user-images.githubusercontent.com/12999161/64484908-83a1d980-d236-11e9-8062-9facf3003e5e.PNG)
Closes#25720 from sujith71955/master_describe_info.
Authored-by: s71955 <sujithchacko.2010@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Added new benchmarks for `make_date()` and `make_timestamp()` to detect performance issues, and figure out functions speed on foldable arguments.
- `make_date()` is benchmarked on fully foldable arguments.
- `make_timestamp()` is benchmarked on corner case `60.0`, foldable time fields and foldable date.
### Why are the changes needed?
To find out inputs where `make_date()` and `make_timestamp()` have performance problems. This should be useful in the future optimizations of the functions and users apps.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By running the benchmark and manually checking generated dates/timestamps.
Closes#25813 from MaxGekk/make_datetime-benchmark.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This pr refines the code of DELETE, including, 1, make `whereClause` to be optional, in which case DELETE will delete all of the data of a table; 2, add more test cases; 3, some other refines.
This is a following-up of SPARK-28351.
### Why are the changes needed?
An optional where clause in DELETE respects the SQL standard.
### Does this PR introduce any user-facing change?
Yes. But since this is a non-released feature, this change does not have any end-user affects.
### How was this patch tested?
New case is added.
Closes#25652 from xianyinxin/SPARK-28950.
Authored-by: xy_xin <xianyin.xxy@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to create an instance of `TimestampFormatter` only once at the initialization, and reuse it inside of `nullSafeEval()` and `doGenCode()` in the case when the `fmt` parameter is foldable.
### Why are the changes needed?
The changes improve performance of the `date_format()` function.
Before:
```
format date: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
format date wholestage off 7180 / 7181 1.4 718.0 1.0X
format date wholestage on 7051 / 7194 1.4 705.1 1.0X
```
After:
```
format date: Best/Avg Time(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------
format date wholestage off 4787 / 4839 2.1 478.7 1.0X
format date wholestage on 4736 / 4802 2.1 473.6 1.0X
```
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
By existing test suites `DateExpressionsSuite` and `DateFunctionsSuite`.
Closes#25782 from MaxGekk/date_format-foldable.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This patch proposes to add new tests to test the username of HiveClient to prevent changing the semantic unintentionally. The owner of Hive table has been changed back-and-forth, principal -> username -> principal, and looks like the change is not intentional. (Please refer [SPARK-28996](https://issues.apache.org/jira/browse/SPARK-28996) for more details.) This patch intends to prevent this.
This patch also renames previous HiveClientSuite(s) to HivePartitionFilteringSuite(s) as it was commented as TODO, as well as previous tests are too narrowed to test only partition filtering.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Newly added UTs.
Closes#25696 from HeartSaVioR/SPARK-28996.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
When InSet generates Java switch-based code, if the input set is empty, we don't generate switch condition, but a simple expression that is default case of original switch.
### Why are the changes needed?
SPARK-26205 adds an optimization to InSet that generates Java switch condition for certain cases. When the given set is empty, it is possibly that codegen causes compilation error:
```
[info] - SPARK-29100: InSet with empty input set *** FAILED *** (58 milliseconds)
[info] Code generation of input[0, int, true] INSET () failed:
[info] org.codehaus.janino.InternalCompilerException: failed to compile: org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in "generated.java": Compiling "apply(java.lang.Object _i)"; apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous size 0, now 1
[info] org.codehaus.janino.InternalCompilerException: failed to compile: org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in "generated.java": Compiling "apply(java.lang.Object _i)"; apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous size 0, now 1
[info] at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1308)
[info] at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1386)
[info] at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1383)
```
### Does this PR introduce any user-facing change?
Yes. Previously, when users have InSet against an empty set, generated code causes compilation error. This patch fixed it.
### How was this patch tested?
Unit test added.
Closes#25806 from viirya/SPARK-29100.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This pr proposes to define an individual method for each common subexpression in HashAggregateExec. In the current master, the common subexpr elimination code in HashAggregateExec is expanded in a single method; 4664a082c2/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala (L397)
The method size can be too big for JIT compilation, so I believe splitting it is beneficial for performance. For example, in a query `SELECT SUM(a + b), AVG(a + b + c) FROM VALUES (1, 1, 1) t(a, b, c)`,
the current master generates;
```
/* 098 */ private void agg_doConsume_0(InternalRow localtablescan_row_0, int agg_expr_0_0, int agg_expr_1_0, int agg_expr_2_0) throws java.io.IOException {
/* 099 */ // do aggregate
/* 100 */ // common sub-expressions
/* 101 */ int agg_value_6 = -1;
/* 102 */
/* 103 */ agg_value_6 = agg_expr_0_0 + agg_expr_1_0;
/* 104 */
/* 105 */ int agg_value_5 = -1;
/* 106 */
/* 107 */ agg_value_5 = agg_value_6 + agg_expr_2_0;
/* 108 */ boolean agg_isNull_4 = false;
/* 109 */ long agg_value_4 = -1L;
/* 110 */ if (!false) {
/* 111 */ agg_value_4 = (long) agg_value_5;
/* 112 */ }
/* 113 */ int agg_value_10 = -1;
/* 114 */
/* 115 */ agg_value_10 = agg_expr_0_0 + agg_expr_1_0;
/* 116 */ // evaluate aggregate functions and update aggregation buffers
/* 117 */ agg_doAggregate_sum_0(agg_value_10);
/* 118 */ agg_doAggregate_avg_0(agg_value_4, agg_isNull_4);
/* 119 */
/* 120 */ }
```
On the other hand, this pr generates;
```
/* 121 */ private void agg_doConsume_0(InternalRow localtablescan_row_0, int agg_expr_0_0, int agg_expr_1_0, int agg_expr_2_0) throws java.io.IOException {
/* 122 */ // do aggregate
/* 123 */ // common sub-expressions
/* 124 */ long agg_subExprValue_0 = agg_subExpr_0(agg_expr_2_0, agg_expr_0_0, agg_expr_1_0);
/* 125 */ int agg_subExprValue_1 = agg_subExpr_1(agg_expr_0_0, agg_expr_1_0);
/* 126 */ // evaluate aggregate functions and update aggregation buffers
/* 127 */ agg_doAggregate_sum_0(agg_subExprValue_1);
/* 128 */ agg_doAggregate_avg_0(agg_subExprValue_0);
/* 129 */
/* 130 */ }
```
I run some micro benchmarks for this pr;
```
(base) maropu~:$system_profiler SPHardwareDataType
Hardware:
Hardware Overview:
Processor Name: Intel Core i5
Processor Speed: 2 GHz
Number of Processors: 1
Total Number of Cores: 2
L2 Cache (per Core): 256 KB
L3 Cache: 4 MB
Memory: 8 GB
(base) maropu~:$java -version
java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)
(base) maropu~:$ /bin/spark-shell --master=local[1] --conf spark.driver.memory=8g --conf spark.sql.shurtitions=1 -v
val numCols = 40
val colExprs = "id AS key" +: (0 until numCols).map { i => s"id AS _c$i" }
spark.range(3000000).selectExpr(colExprs: _*).createOrReplaceTempView("t")
val aggExprs = (2 until numCols).map { i =>
(0 until i).map(d => s"_c$d")
.mkString("AVG(", " + ", ")")
}
// Drops the time of a first run then pick that of a second run
timer { sql(s"SELECT ${aggExprs.mkString(", ")} FROM t").write.format("noop").save() }
// the master
maxCodeGen: 12957
Elapsed time: 36.309858661s
// this pr
maxCodeGen=4184
Elapsed time: 2.399490285s
```
### Why are the changes needed?
To avoid the too-long-function issue in JVMs.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added tests in `WholeStageCodegenSuite`
Closes#25710 from maropu/SplitSubexpr.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
…create table through spark-sql or beeline
## What changes were proposed in this pull request?
fix table owner use user instead of principal when create table through spark-sql
private val userName = conf.getUser will get ugi's userName which is principal info, and i copy the source code into HiveClientImpl, and use ugi.getShortUserName() instead of ugi.getUserName(). The owner display correctly.
## How was this patch tested?
1. create a table in kerberos cluster
2. use "desc formatted tbName" check owner
Please review http://spark.apache.org/contributing.html before opening a pull request.
Closes#23952 from hddong/SPARK-26929-fix-table-owner.
Lead-authored-by: hongdd <jn_hdd@163.com>
Co-authored-by: hongdongdong <hongdongdong@cmss.chinamobile.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
### What changes were proposed in this pull request?
This pr proposes to print bytecode statistics (max class bytecode size, max method bytecode size, max constant pool size, and # of inner classes) for generated classes in debug prints, `debugCodegen`. Since these metrics are critical for codegen framework developments, I think its worth printing there. This pr intends to enable `debugCodegen` to print these metrics as following;
```
scala> sql("SELECT sum(v) FROM VALUES(1) t(v)").debugCodegen
Found 2 WholeStageCodegen subtrees.
== Subtree 1 / 2 (maxClassCodeSize:2693; maxMethodCodeSize:124; maxConstantPoolSize:130(0.20% used); numInnerClasses:0) ==
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
*(1) HashAggregate(keys=[], functions=[partial_sum(cast(v#0 as bigint))], output=[sum#5L])
+- *(1) LocalTableScan [v#0]
Generated code:
/* 001 */ public Object generate(Object[] references) {
/* 002 */ return new GeneratedIteratorForCodegenStage1(references);
/* 003 */ }
...
```
### Why are the changes needed?
For efficient developments
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Manually tested
Closes#25766 from maropu/PrintBytecodeStats.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
It's more clear to only do table lookup in `ResolveTables` rule (for v2 tables) and `ResolveRelations` rule (for v1 tables). `ResolveInsertInto` should only resolve the `InsertIntoStatement` with resolved relations.
### Why are the changes needed?
to make the code simpler
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
existing tests
Closes#25774 from cloud-fan/simplify.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR add the `namespaces` keyword to `TableIdentifierParserSuite`.
### Why are the changes needed?
Improve the test.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
N/A
Closes#25758 from highmoutain/3.0bugfix.
Authored-by: changchun.wang <251922566@qq.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This patch fixes the bug regarding NPE in SQLConf.get, which is only possible when SparkContext._dagScheduler is null due to stopping SparkContext. The logic doesn't seem to consider active SparkContext could be in progress of stopping.
Note that it can't be encountered easily as SparkContext.stop() blocks the main thread, but there're many cases which SQLConf.get is accessed concurrently while SparkContext.stop() is executing - users run another threads, or listener is accessing SQLConf.get after dagScheduler is set to null (this is the case what I encountered.)
### Why are the changes needed?
The bug brings NPE.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Previous patch #25753 was tested with new UT, and due to disruption with other tests in concurrent test run, the test is excluded in this patch.
Closes#25790 from HeartSaVioR/SPARK-29046-v2.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
In the PR, I propose to fix comments of date-time expressions, and replace the `yyyy` pattern by `uuuu` when the implementation supposes the former one.
### Why are the changes needed?
To make comments consistent to implementations.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By running Scala Style checker.
Closes#25796 from MaxGekk/year-pattern-uuuu-followup.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
ThriftServerSessionPage displays timestamp 0 (1970/01/01) instead of nothing if query finish time and close time are not set.
![image](https://user-images.githubusercontent.com/25019163/64711118-6d578000-d4b9-11e9-9b11-2e3616319a98.png)
Change it to display nothing, like ThriftServerPage.
### Why are the changes needed?
Obvious bug.
### Does this PR introduce any user-facing change?
Finish time and Close time will be displayed correctly on ThriftServerSessionPage in JDBC/ODBC Spark UI.
### How was this patch tested?
Manual test.
Closes#25762 from juliuszsompolski/SPARK-29056.
Authored-by: Juliusz Sompolski <julek@databricks.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
### What changes were proposed in this pull request?
The `Column.isInCollection()` with a large size collection will generate an expression with large size children expressions. This make analyzer and optimizer take a long time to run.
In this PR, in `isInCollection()` function, directly generate `InSet` expression, avoid generating too many children expressions.
### Why are the changes needed?
`Column.isInCollection()` with a large size collection sometimes become a bottleneck when running sql.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Manually benchmark it in spark-shell:
```
def testExplainTime(collectionSize: Int) = {
val df = spark.range(10).withColumn("id2", col("id") + 1)
val list = Range(0, collectionSize).toList
val startTime = System.currentTimeMillis()
df.where(col("id").isInCollection(list)).where(col("id2").isInCollection(list)).explain()
val elapsedTime = System.currentTimeMillis() - startTime
println(s"cost time: ${elapsedTime}ms")
}
```
Then test on collection size 5, 10, 100, 1000, 10000, test result is:
collection size | explain time (before) | explain time (after)
------ | ------ | ------
5 | 26ms | 29ms
10 | 30ms | 48ms
100 | 104ms | 50ms
1000 | 1202ms | 58ms
10000 | 10012ms | 523ms
Closes#25754 from WeichenXu123/improve_in_collection.
Lead-authored-by: WeichenXu <weichen.xu@databricks.com>
Co-authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request?
This PR adds a utility class `AdaptiveSparkPlanHelper` which provides methods related to tree traversal of an `AdaptiveSparkPlanExec` plan. Unlike their counterparts in `TreeNode` or
`QueryPlan`, these methods traverse down leaf nodes of adaptive plans, i.e., `AdaptiveSparkPlanExec` and `QueryStageExec`.
### Why are the changes needed?
This utility class can greatly simplify tree traversal code for adaptive spark plans.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Refined `AdaptiveQueryExecSuite` with the help of the new utility methods.
Closes#25764 from maryannxue/aqe-utils.
Authored-by: maryannxue <maryannxue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to extend `ExtractBenchmark` and add new ones for:
- `EXTRACT` and `DATE` as input column
- the `DATE_PART` function and `DATE`/`TIMESTAMP` input column
### Why are the changes needed?
The `EXTRACT` expression is rebased on the `DATE_PART` expression by the PR https://github.com/apache/spark/pull/25410 where some of sub-expressions take `DATE` column as the input (`Millennium`, `Year` and etc.) but others require `TIMESTAMP` column (`Hour`, `Minute`). Separate benchmarks for `DATE` should exclude overhead of implicit conversions `DATE` <-> `TIMESTAMP`.
### Does this PR introduce any user-facing change?
No, it doesn't.
### How was this patch tested?
- Regenerated results of `ExtractBenchmark`
Closes#25772 from MaxGekk/date_part-benchmark.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
reorganize the packages of DS v2 interfaces/classes:
1. `org.spark.sql.connector.catalog`: put `TableCatalog`, `Table` and other related interfaces/classes
2. `org.spark.sql.connector.expression`: put `Expression`, `Transform` and other related interfaces/classes
3. `org.spark.sql.connector.read`: put `ScanBuilder`, `Scan` and other related interfaces/classes
4. `org.spark.sql.connector.write`: put `WriteBuilder`, `BatchWrite` and other related interfaces/classes
### Why are the changes needed?
Data Source V2 has evolved a lot. It's a bit weird that `Expression` is in `org.spark.sql.catalog.v2` and `Table` is in `org.spark.sql.sources.v2`.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing tests
Closes#25700 from cloud-fan/package.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In method `SQLMetricsTestUtils.testMetricsDynamicPartition()`, there is a CREATE TABLE sentence without `withTable` block. It causes test failure if use same table name in other unit tests.
### Why are the changes needed?
To avoid "table already exists" in tests.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Exist UT
Closes#25752 from LantaoJin/SPARK-29045.
Authored-by: LantaoJin <jinlantao@gmail.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
# What changes were proposed in this pull request?
This patch fixes the bug regarding NPE in SQLConf.get, which is only possible when SparkContext._dagScheduler is null due to stopping SparkContext. The logic doesn't seem to consider active SparkContext could be in progress of stopping.
Note that it can't be encountered easily as `SparkContext.stop()` blocks the main thread, but there're many cases which SQLConf.get is accessed concurrently while SparkContext.stop() is executing - users run another threads, or listener is accessing SQLConf.get after dagScheduler is set to null (this is the case what I encountered.)
### Why are the changes needed?
The bug brings NPE.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added new UT to verify NPE doesn't occur. Without patch, the test fails with throwing NPE.
Closes#25753 from HeartSaVioR/SPARK-29046.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
JIRA :https://issues.apache.org/jira/browse/SPARK-29050
'a hdfs' change into 'an hdfs'
'an unique' change into 'a unique'
'an url' change into 'a url'
'a error' change into 'an error'
Closes#25756 from dengziming/feature_fix_typos.
Authored-by: dengziming <dengziming@growingio.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Remove `InsertIntoTable` and replace it's usage by `InsertIntoStatement`
### Why are the changes needed?
`InsertIntoTable` and `InsertIntoStatement` are almost identical (except some namings). It doesn't make sense to keep 2 identical plans. After the removal of `InsertIntoTable`, the analysis process becomes:
1. parser creates `InsertIntoStatement`
2. v2 rule `ResolveInsertInto` converts `InsertIntoStatement` to v2 commands.
3. v1 rules like `DataSourceAnalysis` and `HiveAnalysis` convert `InsertIntoStatement` to v1 commands.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing tests
Closes#25763 from cloud-fan/remove.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
- For trait without companion object constructor, currently the method to get constructor parameters `constructParams` in `ScalaReflection` will throw exception.
```
scala.ScalaReflectionException: <none> is not a term
at scala.reflect.api.Symbols$SymbolApi.asTerm(Symbols.scala:211)
at scala.reflect.api.Symbols$SymbolApi.asTerm$(Symbols.scala:211)
at scala.reflect.internal.Symbols$SymbolContextApiImpl.asTerm(Symbols.scala:106)
at org.apache.spark.sql.catalyst.ScalaReflection.getCompanionConstructor(ScalaReflection.scala:909)
at org.apache.spark.sql.catalyst.ScalaReflection.constructParams(ScalaReflection.scala:914)
at org.apache.spark.sql.catalyst.ScalaReflection.constructParams$(ScalaReflection.scala:912)
at org.apache.spark.sql.catalyst.ScalaReflection$.constructParams(ScalaReflection.scala:47)
at org.apache.spark.sql.catalyst.ScalaReflection.getConstructorParameters(ScalaReflection.scala:890)
at org.apache.spark.sql.catalyst.ScalaReflection.getConstructorParameters$(ScalaReflection.scala:886)
at org.apache.spark.sql.catalyst.ScalaReflection$.getConstructorParameters(ScalaReflection.scala:47)
```
- Instead this PR would throw exception:
```
Unable to find constructor for type [XXX]. This could happen if [XXX] is an interface or a trait without companion object constructor
UnsupportedOperationException:
```
In the normal usage of ExpressionEncoder, this can happen if the type is interface extending `scala.Product`. Also, since this is a protected method, this could have been other arbitrary types without constructor.
### Why are the changes needed?
- The error message `<none> is not a term` isn't helpful for users to understand the problem.
### Does this PR introduce any user-facing change?
- The exception would be thrown instead of runtime exception from the `scala.ScalaReflectionException`.
### How was this patch tested?
- Added a unit test to illustrate the `type` where expression encoder will fail and trigger the proposed error message.
Closes#25736 from mickjermsurawong-stripe/SPARK-29026.
Authored-by: Mick Jermsurawong <mickjermsurawong@stripe.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Current Spark Thrift Server return TypeInfo includes
1. INTERVAL_YEAR_MONTH
2. INTERVAL_DAY_TIME
3. UNION
4. USER_DEFINED
Spark doesn't support INTERVAL_YEAR_MONTH, INTERVAL_YEAR_MONTH, UNION
and won't return USER)DEFINED type.
This PR overwrite GetTypeInfoOperation with SparkGetTypeInfoOperation to exclude types which we don't need.
In hive-1.2.1 Type class is `org.apache.hive.service.cli.Type`
In hive-2.3.x Type class is `org.apache.hadoop.hive.serde2.thrift.Type`
Use ThrifrserverShimUtils to fit version problem and exclude types we don't need
### Why are the changes needed?
We should return type info of Spark's own type info
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Manuel test & Added UT
Closes#25694 from AngersZhuuuu/SPARK-28982.
Lead-authored-by: angerszhu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
### What changes were proposed in this pull request?
Implement the SHOW DATABASES logical and physical plans for data source v2 tables.
### Why are the changes needed?
To support `SHOW DATABASES` SQL commands for v2 tables.
### Does this PR introduce any user-facing change?
`spark.sql("SHOW DATABASES")` will return namespaces if the default catalog is set:
```
+---------------+
| namespace|
+---------------+
| ns1|
| ns1.ns1_1|
|ns1.ns1_1.ns1_2|
+---------------+
```
### How was this patch tested?
Added unit tests to `DataSourceV2SQLSuite`.
Closes#25601 from imback82/show_databases.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Removes useless `Properties` created according to hvanhovell 's suggestion.
### Why are the changes needed?
Avoid useless code.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
existing UTs
Closes#25742 from mgaido91/SPARK-28939_followup.
Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
The intent to use the --hiveconf/--hivevar parameter is just an initialization value, so setting it once in ```SparkSQLSessionManager#openSession``` is sufficient, and each time the ```SparkExecuteStatementOperation``` setting causes the variable to not be modified.
### Why are the changes needed?
It is wrong to set the --hivevar/--hiveconf variable in every ```SparkExecuteStatementOperation```, which prevents variable updates.
### Does this PR introduce any user-facing change?
```
cat <<EOF > test.sql
select '\${a}', '\${b}';
set b=bvalue_MOD_VALUE;
set b;
EOF
beeline -u jdbc:hive2://localhost:10000 --hiveconf a=avalue --hivevar b=bvalue -f test.sql
```
current result:
```
+-----------------+-----------------+--+
| avalue | bvalue |
+-----------------+-----------------+--+
| avalue | bvalue |
+-----------------+-----------------+--+
+-----------------+-----------------+--+
| key | value |
+-----------------+-----------------+--+
| b | bvalue |
+-----------------+-----------------+--+
1 row selected (0.022 seconds)
```
after modification:
```
+-----------------+-----------------+--+
| avalue | bvalue |
+-----------------+-----------------+--+
| avalue | bvalue |
+-----------------+-----------------+--+
+-----------------+-----------------+--+
| key | value |
+-----------------+-----------------+--+
| b | bvalue_MOD_VALUE|
+-----------------+-----------------+--+
1 row selected (0.022 seconds)
```
### How was this patch tested?
modified the existing unit test
Closes#25722 from cxzl25/fix_SPARK-26598.
Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
### What changes were proposed in this pull request?
This PR aims to recover the JDK11 compilation with a workaround.
For now, the master branch is broken like the following due to a [Scala bug](https://github.com/scala/bug/issues/10418) which is fixed in `2.13.0-RC2`.
```
[ERROR] [Error] /spark/sql/core/src/main/scala/org/apache/spark/sql/execution/SQLExecutionRDD.scala:42: ambiguous reference to overloaded definition,
both method putAll in class Properties of type (x$1: java.util.Map[_, _])Unit
and method putAll in class Hashtable of type (x$1: java.util.Map[_ <: Object, _ <: Object])Unit
match argument types (java.util.Map[String,String])
```
- https://github.com/apache/spark/actions (JDK11 build monitoring)
### Why are the changes needed?
This workaround recovers JDK11 compilation.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Manual build with JDK11 because this is JDK11 compilation fix.
- Jenkins builds with JDK8 and tests with JDK11.
- GitHub action will verify this after merging.
Closes#25738 from dongjoon-hyun/SPARK-28939.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Remove the project node if the streaming scan is columnar
### Why are the changes needed?
This is a followup of https://github.com/apache/spark/pull/25586. Batch and streaming share the same DS v2 read API so both can support columnar reads. We should apply #25586 to streaming scan as well.
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
existing tests
Closes#25727 from cloud-fan/follow.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
This is a ANSI SQL and feature id is `T312`
```
<binary overlay function> ::=
OVERLAY <left paren> <binary value expression> PLACING <binary value expression>
FROM <start position> [ FOR <string length> ] <right paren>
```
This PR related to https://github.com/apache/spark/pull/24918 and support treat byte array.
ref: https://www.postgresql.org/docs/11/functions-binarystring.html
## How was this patch tested?
new UT.
There are some show of the PR on my production environment.
```
spark-sql> select overlay(encode('Spark SQL', 'utf-8') PLACING encode('_', 'utf-8') FROM 6);
Spark_SQL
Time taken: 0.285 s
spark-sql> select overlay(encode('Spark SQL', 'utf-8') PLACING encode('CORE', 'utf-8') FROM 7);
Spark CORE
Time taken: 0.202 s
spark-sql> select overlay(encode('Spark SQL', 'utf-8') PLACING encode('ANSI ', 'utf-8') FROM 7 FOR 0);
Spark ANSI SQL
Time taken: 0.165 s
spark-sql> select overlay(encode('Spark SQL', 'utf-8') PLACING encode('tructured', 'utf-8') FROM 2 FOR 4);
Structured SQL
Time taken: 0.141 s
```
Closes#25172 from beliefer/ansi-overlay-byte-array.
Lead-authored-by: gengjiaan <gengjiaan@360.cn>
Co-authored-by: Jiaan Geng <beliefer@163.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
- Remove SQLContext.createExternalTable and Catalog.createExternalTable, deprecated in favor of createTable since 2.2.0, plus tests of deprecated methods
- Remove HiveContext, deprecated in 2.0.0, in favor of `SparkSession.builder.enableHiveSupport`
- Remove deprecated KinesisUtils.createStream methods, plus tests of deprecated methods, deprecate in 2.2.0
- Remove deprecated MLlib (not Spark ML) linear method support, mostly utility constructors and 'train' methods, and associated docs. This includes methods in LinearRegression, LogisticRegression, Lasso, RidgeRegression. These have been deprecated since 2.0.0
- Remove deprecated Pyspark MLlib linear method support, including LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD
- Remove 'runs' argument in KMeans.train() method, which has been a no-op since 2.0.0
- Remove deprecated ChiSqSelector isSorted protected method
- Remove deprecated 'yarn-cluster' and 'yarn-client' master argument in favor of 'yarn' and deploy mode 'cluster', etc
Notes:
- I was not able to remove deprecated DataFrameReader.json(RDD) in favor of DataFrameReader.json(Dataset); the former was deprecated in 2.2.0, but, it is still needed to support Pyspark's .json() method, which can't use a Dataset.
- Looks like SQLContext.createExternalTable was not actually deprecated in Pyspark, but, almost certainly was meant to be? Catalog.createExternalTable was.
- I afterwards noted that the toDegrees, toRadians functions were almost removed fully in SPARK-25908, but Felix suggested keeping just the R version as they hadn't been technically deprecated. I'd like to revisit that. Do we really want the inconsistency? I'm not against reverting it again, but then that implies leaving SQLContext.createExternalTable just in Pyspark too, which seems weird.
- I *kept* LogisticRegressionWithSGD, LinearRegressionWithSGD, LassoWithSGD, RidgeRegressionWithSGD in Pyspark, though deprecated, as it is hard to remove them (still used by StreamingLogisticRegressionWithSGD?) and they are not fully removed in Scala. Maybe should not have been deprecated.
### Why are the changes needed?
Deprecated items are easiest to remove in a major release, so we should do so as much as possible for Spark 3. This does not target items deprecated 'recently' as of Spark 2.3, which is still 18 months old.
### Does this PR introduce any user-facing change?
Yes, in that deprecated items are removed from some public APIs.
### How was this patch tested?
Existing tests.
Closes#25684 from srowen/SPARK-28980.
Lead-authored-by: Sean Owen <sean.owen@databricks.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
The PR proposes to create a custom `RDD` which enables to propagate `SQLConf` also in cases not tracked by SQL execution, as it happens when a `Dataset` is converted to and RDD either using `.rdd` or `.queryExecution.toRdd` and then the returned RDD is used to invoke actions on it.
In this way, SQL configs are effective also in these cases, while earlier they were ignored.
### Why are the changes needed?
Without this patch, all the times `.rdd` or `.queryExecution.toRdd` are used, all the SQL configs set are ignored. An example of a reproducer can be:
```
withSQLConf(SQLConf.SUBEXPRESSION_ELIMINATION_ENABLED.key, "false") {
val df = spark.range(2).selectExpr((0 to 5000).map(i => s"id as field_$i"): _*)
df.createOrReplaceTempView("spark64kb")
val data = spark.sql("select * from spark64kb limit 10")
// Subexpression elimination is used here, despite it should have been disabled
data.describe()
}
```
### Does this PR introduce any user-facing change?
When a user calls `.queryExecution.toRdd`, a `SQLExecutionRDD` is returned wrapping the `RDD` of the execute. When `.rdd` is used, an additional `SQLExecutionRDD` is present in the hierarchy.
### How was this patch tested?
added UT
Closes#25643 from mgaido91/SPARK-28939.
Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
The `V2SessionCatalog` has 2 functionalities:
1. work as an adapter: provide v2 APIs and translate calls to the `SessionCatalog`.
2. allow users to extend it, so that they can add hooks to apply custom logic before calling methods of the builtin catalog (session catalog).
To leverage the second functionality, users must extend `V2SessionCatalog` which is an internal class. There is no doc to explain this usage.
This PR does 2 things:
1. refine the document of the config `spark.sql.catalog.session`.
2. add a public abstract class `CatalogExtension` for users to write implementations.
TODOs for followup PRs:
1. discuss if we should allow users to completely overwrite the v2 session catalog with a new one.
2. discuss to change the name of session catalog, so that it's less likely to conflict with existing namespace names.
## How was this patch tested?
existing tests
Closes#25104 from cloud-fan/session-catalog.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
`bin/spark-shell` support query interval value:
```scala
scala> spark.sql("SELECT interval 3 months 1 hours AS i").show(false)
+-------------------------+
|i |
+-------------------------+
|interval 3 months 1 hours|
+-------------------------+
```
But `sbin/start-thriftserver.sh` can't support query interval value:
```sql
0: jdbc:hive2://localhost:10000/default> SELECT interval 3 months 1 hours AS i;
Error: java.lang.IllegalArgumentException: Unrecognized type name: interval (state=,code=0)
```
This PR maps `CalendarIntervalType` to `StringType` for `TableSchema` to make Thriftserver support query interval value because we do not support `INTERVAL_YEAR_MONTH` type and `INTERVAL_DAY_TIME`:
02c33694c8/sql/hive-thriftserver/v1.2.1/src/main/java/org/apache/hive/service/cli/Type.java (L73-L78)
[SPARK-27791](https://issues.apache.org/jira/browse/SPARK-27791): Support SQL year-month INTERVAL type
[SPARK-27793](https://issues.apache.org/jira/browse/SPARK-27793): Support SQL day-time INTERVAL type
## How was this patch tested?
unit tests
Closes#25277 from wangyum/Thriftserver-support-interval-type.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request?
When we set spark.sql.decimalOperations.allowPrecisionLoss to false.
For the sql below, the result will overflow and return null.
Case a:
`select case when 1=2 then 1 else 1.000000000000000000000001 end * 1`
Similar with the division operation.
This sql below will lost precision.
Case b:
`select case when 1=2 then 1 else 1.000000000000000000000001 end / 1`
Let us check the code of TypeCoercion.scala.
a75467432e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala (L864-L875).
For binaryOperator, if the two operands have differnt datatype, rule ImplicitTypeCasts will find a common type and cast both operands to common type.
So, for these cases menthioned, their left operand is Decimal(34, 24) and right operand is Literal.
Their common type is Decimal(34,24), and Literal(1) will be casted to Decimal(34,24).
Then both operands are decimal type and they will be processed by decimalAndDecimal method of DecimalPrecision class.
Let's check the relative code.
a75467432e/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/DecimalPrecision.scala (L123-L153)
When we don't allow precision loss, the result type of multiply operation in case a is Decimal(38, 38), and that of division operation in case b is Decimal(38, 20).
Then the multi operation in case a will overflow and division operation in case b will lost precision.
In this PR, we skip to handle the binaryOperator if DecimalType operands are involved and rule `DecimalPrecision` will handle it.
### Why are the changes needed?
Data will corrupt without this change.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Unit test.
Closes#25701 from turboFei/SPARK-29000.
Authored-by: turbofei <fwang12@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
The PR proposes to split the code for subexpression elimination before inlining the function calls all in the apply method for `Generate[Mutable|Unsafe]Projection`.
### Why are the changes needed?
Before this PR, code generation can fail due to the 64KB code size limit if a lot of subexpression elimination functions are generated. The added UT is a reproducer for the issue (thanks to the JIRA reporter and HyukjinKwon for it).
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
added UT
Closes#25642 from mgaido91/SPARK-28916.
Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This pr cleans up string template formats for generated code in HashAggregateExec. This changes comes from rednaxelafx comment: https://github.com/apache/spark/pull/20965#discussion_r316418729
### Why are the changes needed?
To improve code-readability.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
N/A
Closes#25714 from maropu/SPARK-21870-FOLLOWUP.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This PR aims to avoid AQE regressions by avoiding changing a sort merge join to a broadcast hash join when the expected build plan has a high ratio of empty partitions, in which case sort merge join can actually perform faster. This PR achieves this by adding an internal join hint in order to let the planner know which side has this high ratio of empty partitions and it should avoid planning it as a build plan of a BHJ. Still, it won't affect the other side if the other side qualifies for a build plan of a BHJ.
### Why are the changes needed?
It is a performance improvement for AQE.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added UT.
Closes#25703 from maryannxue/aqe-demote-bhj.
Authored-by: maryannxue <maryannxue@apache.org>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
## What changes were proposed in this pull request?
In the PR, I propose new function `date_part()`. The function is modeled on the traditional Ingres equivalent to the SQL-standard function `extract`:
```
date_part('field', source)
```
and added for feature parity with PostgreSQL (https://www.postgresql.org/docs/11/functions-datetime.html#FUNCTIONS-DATETIME-EXTRACT).
The `source` can have `DATE` or `TIMESTAMP` type. Supported string values of `'field'` are:
- `millennium` - the current millennium for given date (or a timestamp implicitly casted to a date). For example, years in the 1900s are in the second millennium. The third millennium started _January 1, 2001_.
- `century` - the current millennium for given date (or timestamp). The first century starts at 0001-01-01 AD.
- `decade` - the current decade for given date (or timestamp). Actually, this is the year field divided by 10.
- isoyear` - the ISO 8601 week-numbering year that the date falls in. Each ISO 8601 week-numbering year begins with the Monday of the week containing the 4th of January.
- `year`, `month`, `day`, `hour`, `minute`, `second`
- `week` - the number of the ISO 8601 week-numbering week of the year. By definition, ISO weeks start on Mondays and the first week of a year contains January 4 of that year.
- `quarter` - the quarter of the year (1 - 4)
- `dayofweek` - the day of the week for date/timestamp (1 = Sunday, 2 = Monday, ..., 7 = Saturday)
- `dow` - the day of the week as Sunday (0) to Saturday (6)
- `isodow` - the day of the week as Monday (1) to Sunday (7)
- `doy` - the day of the year (1 - 365/366)
- `milliseconds` - the seconds field including fractional parts multiplied by 1,000.
- `microseconds` - the seconds field including fractional parts multiplied by 1,000,000.
- `epoch` - the number of seconds since 1970-01-01 00:00:00 local time in microsecond precision.
Here are examples:
```sql
spark-sql> select date_part('year', timestamp'2019-08-12 01:00:00.123456');
2019
spark-sql> select date_part('week', timestamp'2019-08-12 01:00:00.123456');
33
spark-sql> select date_part('doy', timestamp'2019-08-12 01:00:00.123456');
224
```
I changed implementation of `extract` to re-use `date_part()` internally.
## How was this patch tested?
Added `date_part.sql` and regenerated results of `extract.sql`.
Closes#25410 from MaxGekk/date_part.
Lead-authored-by: Maxim Gekk <max.gekk@gmail.com>
Co-authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Add a listListeners() method to StreamingQueryManager that lists all StreamingQueryListeners that have been added to that manager.
### Why are the changes needed?
While it's best practice to keep handles on all listeners added, it's still nice to have an API to be able to list what listeners have been added to a StreamingQueryManager.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Modified existing unit tests to use the new API instead of using reflection.
Closes#25518 from mukulmurthy/26046-listener.
Authored-by: Mukul Murthy <mukul.murthy@gmail.com>
Signed-off-by: Jose Torres <torres.joseph.f+github@gmail.com>
## What changes were proposed in this pull request?
This PR disables schema verification and allows schema auto-creation in the Derby database, in case the config for the Metastore is set otherwise.
## How was this patch tested?
NA
Closes#25663 from bogdanghit/hive-schema.
Authored-by: Bogdan Ghit <bogdan.ghit@databricks.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
### What changes were proposed in this pull request?
merge the `V2WriteSupportCheck` and `V2StreamingScanSupportCheck` to one rule: `TableCapabilityCheck`.
### Why are the changes needed?
It's a little confusing to have 2 rules to check DS v2 table capability, while one rule says it checks write and another rule says it checks streaming scan. We can clearly tell it from the rule names that the batch scan check is missing.
It's better to have a centralized place for this check, with a name that clearly says it checks table capability.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing tests
Closes#25679 from cloud-fan/dsv2-check.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to port `pgSQL/aggregates_part3.sql` into UDF test base.
<details><summary>Diff comparing to 'pgSQL/aggregates_part3.sql'</summary>
<p>
```diff
diff --git a/sql/core/src/test/resources/sql-tests/results/pgSQL/aggregates_part3.sql.out b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-aggregates_part3.sql.out
index f102383cb4d..eff33f280cf 100644
--- a/sql/core/src/test/resources/sql-tests/results/pgSQL/aggregates_part3.sql.out
+++ b/sql/core/src/test/resources/sql-tests/results/udf/pgSQL/udf-aggregates_part3.sql.out
-3,7 +3,7
-- !query 0
-select max(min(unique1)) from tenk1
+select udf(max(min(unique1))) from tenk1
-- !query 0 schema
struct<>
-- !query 0 output
-12,11 +12,11 It is not allowed to use an aggregate function in the argument of another aggreg
-- !query 1
-select (select count(*)
- from (values (1)) t0(inner_c))
+select udf((select udf(count(*))
+ from (values (1)) t0(inner_c))) as col
from (values (2),(3)) t1(outer_c)
-- !query 1 schema
-struct<scalarsubquery():bigint>
+struct<col:bigint>
-- !query 1 output
1
1
```
</p>
</details>
### Why are the changes needed?
To improve test coverage in UDFs.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Manually tested via:
```bash
build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/pgSQL/udf-aggregates_part3.sql"
```
as guided in https://issues.apache.org/jira/browse/SPARK-27921Closes#25676 from HyukjinKwon/SPARK-28272.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to port `pgSQL/aggregates_part4.sql` into UDF test base.
<details><summary>Diff comparing to 'pgSQL/aggregates_part3.sql'</summary>
<p>
```diff
```
</p>
</details>
### Why are the changes needed?
To improve test coverage in UDFs.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Manually tested via:
```bash
build/sbt "sql/test-only *SQLQueryTestSuite -- -z udf/pgSQL/udf-aggregates_part4.sql"
```
as guided in https://issues.apache.org/jira/browse/SPARK-27921Closes#25677 from HyukjinKwon/SPARK-28971.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
`DataFrameReader.json()` accepts a partition column that is of numeric, date or timestamp type, according to the implementation in `JDBCRelation.scala`. Update the scaladoc accordingly, to match the documentation in `sql-data-sources-jdbc.md` too.
### Why are the changes needed?
scaladoc is incorrect.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
N/A
Closes#25687 from srowen/SPARK-28977.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Support generator in aggregate expressions.
In this PR, I check the aggregate logical plan, if its aggregateExpressions include generator, then convert this agg plan into "normal agg plan + generator plan + projection plan". I.e:
```
aggregate(with generator)
|--child_plan
```
===>
```
project
|--generator(resolved)
|--aggregate
|--child_plan
```
### Why are the changes needed?
We should support sql like:
```
select explode(array(min(a), max(a))) from t
```
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Unit test added.
Closes#25512 from WeichenXu123/explode_bug.
Authored-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Remove unnecessary physical projection added to ensure rows are `UnsafeRow` when the DSv2 scan is columnar. This is not needed because conversions are automatically added to convert from columnar operators to `UnsafeRow` when the next operator does not support columnar execution.
### Why are the changes needed?
Removes an extra projection and copy.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#25586 from rdblue/SPARK-28878-remove-dsv2-project-with-columnar.
Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Adds the provider information to the table properties in saveAsTable.
### Why are the changes needed?
Otherwise, catalog implementations don't know what kind of Table definition to create.
### Does this PR introduce any user-facing change?
nope
### How was this patch tested?
Existing unit tests check the existence of the provider now.
Closes#25669 from brkyvz/provider.
Authored-by: Burak Yavuz <brkyvz@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Rename `UnresolvedTable` to `V1Table` because it is not unresolved.
### Why are the changes needed?
The class name is inaccurate. This should be fixed before it is in a release.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#25683 from rdblue/SPARK-28979-rename-unresolved-table.
Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Replaces some incorrect usage of `new Configuration()` as it will load default configs defined in Hadoop
### Why are the changes needed?
Unexpected config could be accessed instead of the expected config, see SPARK-28203 for example
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existed tests.
Closes#25616 from advancedxy/remove_invalid_configuration.
Authored-by: Xianjin YE <advancedxy@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
This patch implements dynamic partition pruning by adding a dynamic-partition-pruning filter if there is a partitioned table and a filter on the dimension table. The filter is then planned using a heuristic approach:
1. As a broadcast relation if it is a broadcast hash join. The broadcast relation will then be transformed into a reused broadcast exchange by the `ReuseExchange` rule; or
2. As a subquery duplicate if the estimated benefit of partition table scan being saved is greater than the estimated cost of the extra scan of the duplicated subquery; otherwise
3. As a bypassed condition (`true`).
### Why are the changes needed?
This is an important performance feature.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added UT
- Testing DPP by enabling / disabling the reuse broadcast results feature and / or the subquery duplication feature.
- Testing DPP with reused broadcast results.
- Testing the key iterators on different HashedRelation types.
- Testing the packing and unpacking of the broadcast keys in a LongType.
Closes#25600 from maryannxue/dpp.
Authored-by: maryannxue <maryannxue@apache.org>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
## What changes were proposed in this pull request?
This adds namespace support to V2SessionCatalog.
## How was this patch tested?
WIP: will add tests for v2 session catalog namespace methods.
Closes#25363 from rdblue/SPARK-28628-support-namespaces-in-v2-session-catalog.
Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
## What changes were proposed in this pull request?
Datasource table now supports partition tables long ago. This commit adds the ability to translate
the InsertIntoTable(HiveTableRelation) to datasource table insertion.
## How was this patch tested?
Existing tests with some modification
Closes#25306 from advancedxy/SPARK-28573.
Authored-by: Xianjin YE <advancedxy@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
drop the table after the test `query builtin functions don't call the external catalog` executed
This is required for [SPARK-25464](https://github.com/apache/spark/pull/22466)
## How was this patch tested?
existing UT
Closes#25427 from sandeep-katta/cleanuptable.
Authored-by: sandeep katta <sandeep.katta2007@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
The Experimental and Evolving annotations are both (like Unstable) used to express that a an API may change. However there are many things in the code that have been marked that way since even Spark 1.x. Per the dev thread, anything introduced at or before Spark 2.3.0 is pretty much 'stable' in that it would not change without a deprecation cycle. Therefore I'd like to remove most of these annotations. And, remove the `:: Experimental ::` scaladoc tag too. And likewise for Python, R.
The changes below can be summarized as:
- Generally, anything introduced at or before Spark 2.3.0 has been unmarked as neither Evolving nor Experimental
- Obviously experimental items like DSv2, Barrier mode, ExperimentalMethods are untouched
- I _did_ unmark a few MLlib classes introduced in 2.4, as I am quite confident they're not going to change (e.g. KolmogorovSmirnovTest, PowerIterationClustering)
It's a big change to review, so I'd suggest scanning the list of _files_ changed to see if any area seems like it should remain partly experimental and examine those.
### Why are the changes needed?
Many of these annotations are incorrect; the APIs are de facto stable. Leaving them also makes legitimate usages of the annotations less meaningful.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#25558 from srowen/SPARK-28855.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
## What changes were proposed in this pull request?
This adds a new write API as proposed in the [SPIP to standardize logical plans](https://issues.apache.org/jira/browse/SPARK-23521). This new API:
* Uses clear verbs to execute writes, like `append`, `overwrite`, `create`, and `replace` that correspond to the new logical plans.
* Only creates v2 logical plans so the behavior is always consistent.
* Does not allow table configuration options for operations that cannot change table configuration. For example, `partitionedBy` can only be called when the writer executes `create` or `replace`.
Here are a few example uses of the new API:
```scala
df.writeTo("catalog.db.table").append()
df.writeTo("catalog.db.table").overwrite($"date" === "2019-06-01")
df.writeTo("catalog.db.table").overwritePartitions()
df.writeTo("catalog.db.table").asParquet.create()
df.writeTo("catalog.db.table").partitionedBy(days($"ts")).createOrReplace()
df.writeTo("catalog.db.table").using("abc").replace()
```
## How was this patch tested?
Added `DataFrameWriterV2Suite` that tests the new write API. Existing tests for v2 plans.
Closes#25354 from rdblue/SPARK-28612-add-data-frame-writer-v2.
Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
### What changes were proposed in this pull request?
See https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/109834/testReport/junit/org.apache.spark.sql/SQLQueryTestSuite/
![Screen Shot 2019-08-28 at 4 08 58 PM](https://user-images.githubusercontent.com/6477701/63833484-2a23ea00-c9ae-11e9-91a1-0859cb183fea.png)
```xml
<?xml version="1.0" encoding="UTF-8"?>
<testsuite hostname="C02Y52ZLJGH5" name="org.apache.spark.sql.SQLQueryTestSuite" tests="3" errors="0" failures="0" skipped="0" time="14.475">
...
<testcase classname="org.apache.spark.sql.SQLQueryTestSuite" name="sql - Scala UDF" time="6.703">
</testcase>
<testcase classname="org.apache.spark.sql.SQLQueryTestSuite" name="sql - Regular Python UDF" time="4.442">
</testcase>
<testcase classname="org.apache.spark.sql.SQLQueryTestSuite" name="sql - Scalar Pandas UDF" time="3.33">
</testcase>
<system-out/>
<system-err/>
</testsuite>
```
Root cause seems a bug in SBT - it truncates the test name based on the last dot.
https://github.com/sbt/sbt/issues/2949https://github.com/sbt/sbt/blob/v0.13.18/testing/src/main/scala/sbt/JUnitXmlTestsListener.scala#L71-L79
I tried to find a better way but couldn't find. Therefore, this PR proposes a workaround by appending the test file name into the assert log:
```diff
[info] - inner-join.sql *** FAILED *** (4 seconds, 306 milliseconds)
+ [info] inner-join.sql
[info] Expected "1 a
[info] 1 a
[info] 1 b
[info] 1[]", but got "1 a
[info] 1 a
[info] 1 b
[info] 1[ b]" Result did not match for query #6
[info] SELECT tb.* FROM ta INNER JOIN tb ON ta.a = tb.a AND ta.tag = tb.tag (SQLQueryTestSuite.scala:377)
[info] org.scalatest.exceptions.TestFailedException:
[info] at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:528)
```
It will at least prevent us to search full logs to identify which test file is failed by clicking filed test.
Note that this PR does not fully fix the issue but only fix the logs on its failed tests.
### Why are the changes needed?
To debug Jenkins logs easier. Otherwise, we should open full logs and search which test was failed.
### Does this PR introduce any user-facing change?
It will print out the file name of failed tests in Jenkins' test reports.
### How was this patch tested?
Manually tested but Jenkins tests are required in this PR.
Now it at least shows which file it is:
![Screen Shot 2019-08-30 at 10 16 32 PM](https://user-images.githubusercontent.com/6477701/64023705-de22a200-cb73-11e9-8806-2e98ad35adef.png)
Closes#25630 from HyukjinKwon/SPARK-28894-1.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
This PR aims to add "true", "yes", "1", "false", "no", "0", and unique prefixes as input for the boolean data type and ignore input whitespace. Please see the following what string representations are using for the boolean type in other databases.
https://www.postgresql.org/docs/devel/datatype-boolean.htmlhttps://docs.aws.amazon.com/redshift/latest/dg/r_Boolean_type.html
## How was this patch tested?
Added new tests to CastSuite.
Closes#25458 from younggyuchun/SPARK-27931.
Authored-by: younggyu chun <younggyuchun@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Adds support for the V2SessionCatalog for ALTER TABLE statements.
Implementation changes are ~50 loc. The rest is just test refactoring.
### Why are the changes needed?
To allow V2 DataSources to plug in through a configurable plugin interface without requiring the explicit use of catalog identifiers, and leverage ALTER TABLE statements.
### How was this patch tested?
By re-using existing tests in DataSourceV2SQLSuite.
Closes#25502 from brkyvz/alterV3.
Authored-by: Burak Yavuz <brkyvz@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
There are 2 in-memory `TableCatalog` and `Table` implementations for testing, in sql/catalyst and sql/core. This PR merges them.
After merging, there are 3 classes:
1. `InMemoryTable`
2. `InMemoryTableCatalog`
3. `StagingInMemoryTableCatalog`
For better maintainability, these 3 classes are put in 3 different files.
### Why are the changes needed?
reduce duplicated code
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
N/A
Closes#25610 from cloud-fan/dsv2-test.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Ryan Blue <blue@apache.org>
### What changes were proposed in this pull request?
Disallow conversions between `timestamp` type and `long` type in table insertion with ANSI store assignment policy.
### Why are the changes needed?
In the PR https://github.com/apache/spark/pull/25581, timestamp type is allowed to be converted to long type, since timestamp type is represented by long type internally, and both legacy mode and strict mode allows the conversion.
After reconsideration, I think we should disallow it. As per ANSI SQL section "4.4.2 Characteristics of numbers":
> A number is assignable only to sites of numeric type.
In PostgreSQL, the conversion between timestamp and long is also disallowed.
### Does this PR introduce any user-facing change?
Conversion between timestamp and long is disallowed in table insertion with ANSI store assignment policy.
### How was this patch tested?
Unit test
Closes#25615 from gengliangwang/disallowTimeStampToLong.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR replaces the hard-coded non-nullability of the array elements returned by `freqItems()` with a nullability that reflects the original schema. Essentially [the functional change](https://github.com/apache/spark/pull/25575/files#diff-bf59bb9f3dc351f5bf6624e5edd2dcf4R122) to the schema generation is:
```
StructField(name + "_freqItems", ArrayType(dataType, false))
```
Becomes:
```
StructField(name + "_freqItems", ArrayType(dataType, originalField.nullable))
```
Respecting the original nullability prevents issues when Spark depends on `ArrayType`'s `containsNull` being accurate. The example that uncovered this is calling `collect()` on the dataframe (see [ticket](https://issues.apache.org/jira/browse/SPARK-28818) for full repro). Though it's likely that there a several places where this could cause a problem.
I've also refactored a small amount of the surrounding code to remove some unnecessary steps and group together related operations.
### Why are the changes needed?
I think it's pretty clear why this change is needed. It fixes a bug that currently prevents users from calling `df.freqItems.collect()` along with potentially causing other, as yet unknown, issues.
### Does this PR introduce any user-facing change?
Nullability of columns when calling freqItems on them is now respected after the change.
### How was this patch tested?
I added a test that specifically tests the carry-through of the nullability as well as explicitly calling `collect()` to catch the exact regression that was observed. I also ran the test against the old version of the code and it fails as expected.
Closes#25575 from MGHawes/mhawes/SPARK-28818.
Authored-by: Matt Hawes <mhawes@palantir.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Hive 3.1.2 has been released. This PR upgrades the Hive Metastore Client to 3.1.2 for Hive 3.1.
Hive 3.1.2 release notes:
https://issues.apache.org/jira/secure/ReleaseNote.jspa?version=12344397&styleName=Html&projectId=12310843
### Why are the changes needed?
This is an improvement to support a newly release 3.1.2. Otherwise, it will throws `UnsupportedOperationException` if user `set spark.sql.hive.metastore.version=3.1.2`:
```scala
Exception in thread "main" java.lang.UnsupportedOperationException: Unsupported Hive Metastore version (3.1.2). Please set spark.sql.hive.metastore.version with a valid version.
at org.apache.spark.sql.hive.client.IsolatedClientLoader$.hiveVersion(IsolatedClientLoader.scala:109)
```
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing UT
Closes#25604 from wangyum/SPARK-28890.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Revise the documentation of SQL option `spark.sql.storeAssignmentPolicy`.
### Why are the changes needed?
1. Need to point out the ANSI mode is mostly the same with PostgreSQL
2. Need to point out Legacy mode allows type coercion as long as it is valid casting
3. Better examples.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Uni test
Closes#25605 from gengliangwang/reviseDoc.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
**What changes were proposed in this pull request?**
Moving the call for checkColumnNameDuplication out of generateViewProperties. This way we can choose ifcheckColumnNameDuplication will be performed on analyzed or aliased plan without having to pass an additional argument(aliasedPlan) to generateViewProperties.
Before the pr column name duplication was performed on the query output of below sql(c1, c1) and the pr makes it perform check on the user provided schema of view definition(c1, c2)
**Why are the changes needed?**
Changes are to fix SPARK-23519 bug. Below queries would cause an exception. This pr fixes them and also added a test case.
`CREATE TABLE t23519 AS SELECT 1 AS c1
CREATE VIEW v23519 (c1, c2) AS SELECT c1, c1 FROM t23519`
**Does this PR introduce any user-facing change?**
No
**How was this patch tested?**
new unit test added in SQLViewSuite
Closes#25570 from hem1891/SPARK-23519.
Lead-authored-by: hemanth meka <hmeka@tibco.com>
Co-authored-by: hem1891 <hem1891@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
This is caused by 2 PRs that were merged at the same time:
cb06209fc92b24a71fecCloses#25597 from cloud-fan/hot-fix.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Introduce ANSI store assignment policy for table insertion.
With ANSI policy, Spark performs the type coercion of table insertion as per ANSI SQL.
### Why are the changes needed?
In Spark version 2.4 and earlier, when inserting into a table, Spark will cast the data type of input query to the data type of target table by coercion. This can be super confusing, e.g. users make a mistake and write string values to an int column.
In data source V2, by default, only upcasting is allowed when inserting data into a table. E.g. int -> long and int -> string are allowed, while decimal -> double or long -> int are not allowed. The rules of UpCast was originally created for Dataset type coercion. They are quite strict and different from the behavior of all existing popular DBMS. This is breaking change. It is possible that existing queries are broken after 3.0 releases.
Following ANSI SQL standard makes Spark consistent with the table insertion behaviors of popular DBMS like PostgreSQL/Oracle/Mysql.
### Does this PR introduce any user-facing change?
A new optional mode for table insertion.
### How was this patch tested?
Unit test
Closes#25581 from gengliangwang/ANSImode.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Make `spark.sql.crossJoin.enabled` default value true
### Why are the changes needed?
For implicit cross join, we can set up a watchdog to cancel it if running for a long time.
When "spark.sql.crossJoin.enabled" is false, because `CheckCartesianProducts` is implemented in logical plan stage, it may generate some mismatching error which may confuse end user:
* it's done in logical phase, so we may fail queries that can be executed via broadcast join, which is very fast.
* if we move the check to the physical phase, then a query may success at the beginning, and begin to fail when the table size gets larger (other people insert data to the table). This can be quite confusing.
* the CROSS JOIN syntax doesn't work well if join reorder happens.
* some non-equi-join will generate plan using cartesian product, but `CheckCartesianProducts` do not detect it and raise error.
So that in order to address this in simpler way, we can turn off showing this cross-join error by default.
For reference, I list some cases raising mismatching error here:
Providing:
```
spark.range(2).createOrReplaceTempView("sm1") // can be broadcast
spark.range(50000000).createOrReplaceTempView("bg1") // cannot be broadcast
spark.range(60000000).createOrReplaceTempView("bg2") // cannot be broadcast
```
1) Some join could be convert to broadcast nested loop join, but CheckCartesianProducts raise error. e.g.
```
select sm1.id, bg1.id from bg1 join sm1 where sm1.id < bg1.id
```
2) Some join will run by CartesianJoin but CheckCartesianProducts DO NOT raise error. e.g.
```
select bg1.id, bg2.id from bg1 join bg2 where bg1.id < bg2.id
```
### Does this PR introduce any user-facing change?
### How was this patch tested?
Closes#25520 from WeichenXu123/SPARK-28621.
Authored-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR makes `spark.sql.statistics.fallBackToHdfs` not support Hive partitioned tables.
### Why are the changes needed?
The current implementation is incorrect for external partitions and it is expensive to support partitioned table with external partitions.
### Does this PR introduce any user-facing change?
Yes. But I think it will not change the join strategy because partitioned table usually very large.
### How was this patch tested?
unit test
Closes#25584 from wangyum/SPARK-28876.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR add test for set the partitioned bucketed data source table SerDe correctly.
### Why are the changes needed?
Improve test.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
N/A
Closes#25591 from wangyum/SPARK-27592-f1.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
Currently we have 2 configs to specify which v2 sources should fallback to v1 code path. One config for read path, and one config for write path.
However, I found it's awkward to work with these 2 configs:
1. for `CREATE TABLE USING format`, should this be read path or write path?
2. for `V2SessionCatalog.loadTable`, we need to return `UnresolvedTable` if it's a DS v1 or we need to fallback to v1 code path. However, at that time, we don't know if the returned table will be used for read or write.
We don't have any new features or perf improvement in file source v2. The fallback API is just a safeguard if we have bugs in v2 implementations. There are not many benefits to support falling back to v1 for read and write path separately.
This PR proposes to merge these 2 configs into one.
## How was this patch tested?
existing tests
Closes#25465 from cloud-fan/merge-conf.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR ignores Thrift server `ThriftServerQueryTestSuite`.
### Why are the changes needed?
This ThriftServerQueryTestSuite test case led to frequent Jenkins build failure.
### Does this PR introduce any user-facing change?
Yes.
### How was this patch tested?
N/A
Closes#25592 from wangyum/SPARK-28527-f1.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR adds support for INSERT INTO through both the SQL and DataFrameWriter APIs through the V2SessionCatalog.
### Why are the changes needed?
This will allow V2 tables to be plugged in through the V2SessionCatalog, and be used seamlessly with existing APIs.
### Does this PR introduce any user-facing change?
No behavior changes.
### How was this patch tested?
Pulled out a lot of tests so that they can be shared across the DataFrameWriter and SQL code paths.
Closes#25507 from brkyvz/insertSesh.
Lead-authored-by: Burak Yavuz <brkyvz@gmail.com>
Co-authored-by: Burak Yavuz <burak@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
This PR aims at improving the way physical plans are explained in spark.
Currently, the explain output for physical plan may look very cluttered and each operator's
string representation can be very wide and wraps around in the display making it little
hard to follow. This especially happens when explaining a query 1) Operating on wide tables
2) Has complex expressions etc.
This PR attempts to split the output into two sections. In the header section, we display
the basic operator tree with a number associated with each operator. In this section, we strictly
control what we output for each operator. In the footer section, each operator is verbosely
displayed. Based on the feedback from Maryann, the uncorrelated subqueries (SubqueryExecs) are not included in the main plan. They are printed separately after the main plan and can be
correlated by the originating expression id from its parent plan.
To illustrate, here is a simple plan displayed in old vs new way.
Example query1 :
```
EXPLAIN SELECT key, Max(val) FROM explain_temp1 WHERE key > 0 GROUP BY key HAVING max(val) > 0
```
Old :
```
*(2) Project [key#2, max(val)#15]
+- *(2) Filter (isnotnull(max(val#3)#18) AND (max(val#3)#18 > 0))
+- *(2) HashAggregate(keys=[key#2], functions=[max(val#3)], output=[key#2, max(val)#15, max(val#3)#18])
+- Exchange hashpartitioning(key#2, 200)
+- *(1) HashAggregate(keys=[key#2], functions=[partial_max(val#3)], output=[key#2, max#21])
+- *(1) Project [key#2, val#3]
+- *(1) Filter (isnotnull(key#2) AND (key#2 > 0))
+- *(1) FileScan parquet default.explain_temp1[key#2,val#3] Batched: true, DataFilters: [isnotnull(key#2), (key#2 > 0)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp1], PartitionFilters: [], PushedFilters: [IsNotNull(key), GreaterThan(key,0)], ReadSchema: struct<key:int,val:int>
```
New :
```
Project (8)
+- Filter (7)
+- HashAggregate (6)
+- Exchange (5)
+- HashAggregate (4)
+- Project (3)
+- Filter (2)
+- Scan parquet default.explain_temp1 (1)
(1) Scan parquet default.explain_temp1 [codegen id : 1]
Output: [key#2, val#3]
(2) Filter [codegen id : 1]
Input : [key#2, val#3]
Condition : (isnotnull(key#2) AND (key#2 > 0))
(3) Project [codegen id : 1]
Output : [key#2, val#3]
Input : [key#2, val#3]
(4) HashAggregate [codegen id : 1]
Input: [key#2, val#3]
(5) Exchange
Input: [key#2, max#11]
(6) HashAggregate [codegen id : 2]
Input: [key#2, max#11]
(7) Filter [codegen id : 2]
Input : [key#2, max(val)#5, max(val#3)#8]
Condition : (isnotnull(max(val#3)#8) AND (max(val#3)#8 > 0))
(8) Project [codegen id : 2]
Output : [key#2, max(val)#5]
Input : [key#2, max(val)#5, max(val#3)#8]
```
Example Query2 (subquery):
```
SELECT * FROM explain_temp1 WHERE KEY = (SELECT Max(KEY) FROM explain_temp2 WHERE KEY = (SELECT Max(KEY) FROM explain_temp3 WHERE val > 0) AND val = 2) AND val > 3
```
Old:
```
*(1) Project [key#2, val#3]
+- *(1) Filter (((isnotnull(KEY#2) AND isnotnull(val#3)) AND (KEY#2 = Subquery scalar-subquery#39)) AND (val#3 > 3))
: +- Subquery scalar-subquery#39
: +- *(2) HashAggregate(keys=[], functions=[max(KEY#26)], output=[max(KEY)#45])
: +- Exchange SinglePartition
: +- *(1) HashAggregate(keys=[], functions=[partial_max(KEY#26)], output=[max#47])
: +- *(1) Project [key#26]
: +- *(1) Filter (((isnotnull(KEY#26) AND isnotnull(val#27)) AND (KEY#26 = Subquery scalar-subquery#38)) AND (val#27 = 2))
: : +- Subquery scalar-subquery#38
: : +- *(2) HashAggregate(keys=[], functions=[max(KEY#28)], output=[max(KEY)#43])
: : +- Exchange SinglePartition
: : +- *(1) HashAggregate(keys=[], functions=[partial_max(KEY#28)], output=[max#49])
: : +- *(1) Project [key#28]
: : +- *(1) Filter (isnotnull(val#29) AND (val#29 > 0))
: : +- *(1) FileScan parquet default.explain_temp3[key#28,val#29] Batched: true, DataFilters: [isnotnull(val#29), (val#29 > 0)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp3], PartitionFilters: [], PushedFilters: [IsNotNull(val), GreaterThan(val,0)], ReadSchema: struct<key:int,val:int>
: +- *(1) FileScan parquet default.explain_temp2[key#26,val#27] Batched: true, DataFilters: [isnotnull(key#26), isnotnull(val#27), (val#27 = 2)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp2], PartitionFilters: [], PushedFilters: [IsNotNull(key), IsNotNull(val), EqualTo(val,2)], ReadSchema: struct<key:int,val:int>
+- *(1) FileScan parquet default.explain_temp1[key#2,val#3] Batched: true, DataFilters: [isnotnull(key#2), isnotnull(val#3), (val#3 > 3)], Format: Parquet, Location: InMemoryFileIndex[file:/user/hive/warehouse/explain_temp1], PartitionFilters: [], PushedFilters: [IsNotNull(key), IsNotNull(val), GreaterThan(val,3)], ReadSchema: struct<key:int,val:int>
```
New:
```
Project (3)
+- Filter (2)
+- Scan parquet default.explain_temp1 (1)
(1) Scan parquet default.explain_temp1 [codegen id : 1]
Output: [key#2, val#3]
(2) Filter [codegen id : 1]
Input : [key#2, val#3]
Condition : (((isnotnull(KEY#2) AND isnotnull(val#3)) AND (KEY#2 = Subquery scalar-subquery#23)) AND (val#3 > 3))
(3) Project [codegen id : 1]
Output : [key#2, val#3]
Input : [key#2, val#3]
===== Subqueries =====
Subquery:1 Hosting operator id = 2 Hosting Expression = Subquery scalar-subquery#23
HashAggregate (9)
+- Exchange (8)
+- HashAggregate (7)
+- Project (6)
+- Filter (5)
+- Scan parquet default.explain_temp2 (4)
(4) Scan parquet default.explain_temp2 [codegen id : 1]
Output: [key#26, val#27]
(5) Filter [codegen id : 1]
Input : [key#26, val#27]
Condition : (((isnotnull(KEY#26) AND isnotnull(val#27)) AND (KEY#26 = Subquery scalar-subquery#22)) AND (val#27 = 2))
(6) Project [codegen id : 1]
Output : [key#26]
Input : [key#26, val#27]
(7) HashAggregate [codegen id : 1]
Input: [key#26]
(8) Exchange
Input: [max#35]
(9) HashAggregate [codegen id : 2]
Input: [max#35]
Subquery:2 Hosting operator id = 5 Hosting Expression = Subquery scalar-subquery#22
HashAggregate (15)
+- Exchange (14)
+- HashAggregate (13)
+- Project (12)
+- Filter (11)
+- Scan parquet default.explain_temp3 (10)
(10) Scan parquet default.explain_temp3 [codegen id : 1]
Output: [key#28, val#29]
(11) Filter [codegen id : 1]
Input : [key#28, val#29]
Condition : (isnotnull(val#29) AND (val#29 > 0))
(12) Project [codegen id : 1]
Output : [key#28]
Input : [key#28, val#29]
(13) HashAggregate [codegen id : 1]
Input: [key#28]
(14) Exchange
Input: [max#37]
(15) HashAggregate [codegen id : 2]
Input: [max#37]
```
Note:
I opened this PR as a WIP to start getting feedback. I will be on vacation starting tomorrow
would not be able to immediately incorporate the feedback. I will start to
work on them as soon as i can. Also, currently this PR provides a basic infrastructure
for explain enhancement. The details about individual operators will be implemented
in follow-up prs
## How was this patch tested?
Added a new test `explain.sql` that tests basic scenarios. Need to add more tests.
Closes#24759 from dilipbiswal/explain_feature.
Authored-by: Dilip Biswal <dbiswal@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Test `spark.sql.redaction.options.regex` with and without default values.
### Why are the changes needed?
Normally, we do not rely on the default value of `spark.sql.redaction.options.regex`.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
N/A
Closes#25579 from wangyum/SPARK-28642-f1.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request?
This PR implements `SparkGetCatalogsOperation` for Thrift Server metadata completeness.
### Why are the changes needed?
Thrift Server metadata completeness.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Unit test
Closes#25555 from wangyum/SPARK-28852.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request?
1. Fix the physical plan (`DescribeTableExec`) to have the same output attributes as the corresponding logical plan.
2. Remove `output` in statements since they are unresolved plans.
### Why are the changes needed?
Correctness of how output attributes should work.
### Does this PR introduce any user-facing change?
NO
### How was this patch tested?
Existing tests
Closes#25568 from imback82/describe_table.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
This reverts commit 485ae6d181.
Closes#25563 from gatorsmile/revert.
Authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
To follow ANSI SQL, we should support a configurable mode that throws exceptions when casting to integers causes overflow.
The behavior is similar to https://issues.apache.org/jira/browse/SPARK-26218, which throws exceptions on arithmetical operation overflow.
To unify it, the configuration is renamed from "spark.sql.arithmeticOperations.failOnOverFlow" to "spark.sql.failOnIntegerOverFlow"
## How was this patch tested?
Unit test
Closes#25461 from gengliangwang/AnsiCastIntegral.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
Add id to Exchange and Subquery's stringArgs method for easier identifying their reuses in query plans, for example:
```
ReusedExchange d_date_sk#827, BroadcastExchange HashedRelationBroadcastMode(List(cast(input[0, int, true] as bigint))) [id=#2710]
```
Where `2710` is the id of the reused exchange.
## How was this patch tested?
Passes existing tests
Closes#25434 from dbaliafroozeh/ImplementStringArgsExchangeSubqueryExec.
Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com>
Signed-off-by: herman <herman@databricks.com>
### What changes were proposed in this pull request?
This PR removes the `canonicalize(attrs: AttributeSeq)` from `PlanExpression` and taking care of normalizing expressions in `QueryPlan`.
### Why are the changes needed?
`Expression` has already a `canonicalized` method and having the `canonicalize` method in `PlanExpression` is confusing.
### Does this PR introduce any user-facing change?
Removes the `canonicalize` plan from `PlanExpression`. Also renames the `normalizeExprId` to `normalizeExpressions` in query plan.
### How was this patch tested?
This PR is a refactoring and passes the existing tests
Closes#25534 from dbaliafroozeh/ImproveCanonicalizeAPI.
Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com>
Signed-off-by: herman <herman@databricks.com>
## What changes were proposed in this pull request?
Implements the SHOW TABLES logical and physical plans for data source v2 tables.
## How was this patch tested?
Added unit tests to `DataSourceV2SQLSuite`.
Closes#25247 from imback82/dsv2_show_tables.
Lead-authored-by: terryk <yuminkim@gmail.com>
Co-authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR extracts the schema information of TPCDS tables into a separate class called `TPCDSSchema` which can be reused for other testing purposes
### How was this patch tested?
This PR is only a refactoring for tests and passes existing tests
Closes#25535 from dbaliafroozeh/IntroduceTPCDSSchema.
Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR fixes the leak of crc files from CheckpointFileManager when FileContextBasedCheckpointFileManager is being used.
Spark hits the Hadoop bug, [HADOOP-16255](https://issues.apache.org/jira/browse/HADOOP-16255) which seems to be a long-standing issue.
This is there're two `renameInternal` methods:
```
public void renameInternal(Path src, Path dst)
public void renameInternal(final Path src, final Path dst, boolean overwrite)
```
which should be overridden to handle all cases but ChecksumFs only overrides method with 2 params, so when latter is called FilterFs.renameInternal(...) is called instead, and it will do rename with RawLocalFs as underlying filesystem.
The bug is related to FileContext, so FileSystemBasedCheckpointFileManager is not affected.
[SPARK-17475](https://issues.apache.org/jira/browse/SPARK-17475) took a workaround for this bug, but [SPARK-23966](https://issues.apache.org/jira/browse/SPARK-23966) seemed to bring regression.
This PR deletes crc file as "best-effort" when renaming, as failing to delete crc file is not that critical to fail the task.
### Why are the changes needed?
This PR prevents crc files not being cleaned up even purging batches. Too many files in same directory often hurts performance, as well as each crc file occupies more space than its own size so possible to occupy nontrivial amount of space when batches go up to 100000+.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Some unit tests are modified to check leakage of crc files.
Closes#25488 from HeartSaVioR/SPARK-28025.
Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
## What changes were proposed in this pull request?
After all the discussions in the dev list: http://apache-spark-developers-list.1001551.n3.nabble.com/Discuss-Follow-ANSI-SQL-on-table-insertion-td27531.html#a27562.
Here I propose that we can make the store assignment rules in the analyzer configurable, and the behavior of V1 and V2 should be consistent.
When inserting a value into a column with a different data type, Spark will perform type coercion. After this PR, we support 2 policies for the type coercion rules:
legacy and strict.
1. With legacy policy, Spark allows casting any value to any data type. The legacy policy is the only behavior in Spark 2.x and it is compatible with Hive.
2. With strict policy, Spark doesn't allow any possible precision loss or data truncation in type coercion, e.g. `int` and `long`, `float` -> `double` are not allowed.
Eventually, the "legacy" mode will be removed, so it is disallowed in data source V2.
To ensure backward compatibility with existing queries, the default store assignment policy for data source V1 is "legacy".
## How was this patch tested?
Unit test
Closes#25453 from gengliangwang/tableInsertRule.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Added proper message instead of NPE for invalid Dataset operations (e.g. calling actions inside of transformations) similar to SPARK-5063 for RDD
### Why are the changes needed?
To report the user about the exact issue instead of NPE
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Manually tested
```scala
test code snap
"import spark.implicits._
val ds1 = spark.sparkContext.parallelize(1 to 100, 100).toDS()
val ds2 = spark.sparkContext.parallelize(1 to 100, 100).toDS()
ds1.map(x => {
// scalastyle:off
println(ds2.count + x)
x
}).collect()"
```
Closes#25503 from shivusondur/jira28702.
Authored-by: shivusondur <shivusondur@gmail.com>
Signed-off-by: Josh Rosen <rosenville@gmail.com>
### What changes were proposed in this pull request?
This PR aims to annotate `HiveExternalCatalogVersionsSuite` with `ExtendedHiveTest`.
### Why are the changes needed?
`HiveExternalCatalogVersionsSuite` is an outstanding test in terms of testing time. This PR aims to allow skipping this test suite when we use `ExtendedHiveTest`.
![time](https://user-images.githubusercontent.com/9700541/63489184-4c75af00-c466-11e9-9e12-d250d4a23292.png)
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Since Jenkins doesn't exclude `ExtendedHiveTest`, there is no difference in Jenkins testing.
This PR should be tested by manually by the following.
**BEFORE**
```
$ cd sql/hive
$ mvn package -Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite -Dtest.exclude.tags=org.apache.spark.tags.ExtendedHiveTest
...
Run starting. Expected test count is: 1
HiveExternalCatalogVersionsSuite:
22:32:16.218 WARN org.apache.hadoop.util.NativeCodeLoader: Unable to load ...
```
**AFTER**
```
$ cd sql/hive
$ mvn package -Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.HiveExternalCatalogVersionsSuite -Dtest.exclude.tags=org.apache.spark.tags.ExtendedHiveTest
...
Run starting. Expected test count is: 0
HiveExternalCatalogVersionsSuite:
Run completed in 772 milliseconds.
Total number of tests run: 0
Suites: completed 2, aborted 0
Tests: succeeded 0, failed 0, canceled 0, ignored 0, pending 0
No tests were executed.
...
```
Closes#25550 from dongjoon-hyun/SPARK-28847.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Fix minor typo in SQLConf.
`FILE_COMRESSION_FACTOR` -> `FILE_COMPRESSION_FACTOR`
### Why are the changes needed?
Make conf more understandable.
### Does this PR introduce any user-facing change?
No. (`spark.sql.sources.fileCompressionFactor` is unchanged.)
### How was this patch tested?
Pass the Jenkins with the existing tests.
Closes#25538 from triplesheep/TYPO-FIX.
Authored-by: triplesheep <triplesheep0419@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR adds a simple cost model and a mechanism to compare the costs of the before and after plans of each re-optimization in Adaptive Query Execution. Now the workflow of AQE re-optimization is changed to: If the cost of the plan after re-optimization is lower than or equal to that of the plan before re-optimization and the plan has been changed after re-optimization (if equal), the current physical plan will be updated to the plan after re-optimization, otherwise it will remain unchanged until the next re-optimization.
### Why are the changes needed?
This new mechanism is to prevent regressions in Adaptive Query Execution caused by change of the plan introducing extra cost, in this PR specifically, change of SMJ to BHJ leading to extra `ShuffleExchangeExec`s.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added UT.
Closes#25456 from maryannxue/aqe-cost.
Authored-by: maryannxue <maryannxue@apache.org>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
<!--
Thanks for sending a pull request! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
4. Be sure to keep the PR description updated to reflect all changes.
5. Please write your PR title to summarize what this PR proposes.
6. If possible, provide a concise example to reproduce the issue for a faster review.
-->
### What changes were proposed in this pull request?
<!--
Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
2. If you fix some SQL features, you can provide some references of other DBMSes.
3. If there is design documentation, please add the link.
4. If there is a discussion in the mailing list, please add the link.
-->
When running CTAS/RTAS, use the nullable schema of the input query to create the table.
### Why are the changes needed?
<!--
Please clarify why the changes are needed. For instance,
1. If you propose a new API, clarify the use case for a new API.
2. If you fix a bug, you can clarify why it is a bug.
-->
It's very likely to run CTAS/RTAS with non-nullable input query, e.g. `CREATE TABLE t AS SELECT 1`. However, it's surprising to users if they can't write null to this table later. Non-nullable is kind of a constraint of the column and should be specified by users explicitly.
For reference, Postgres also use nullable schema for CTAS:
```
> create table t1(i int not null);
> insert into t1 values (1);
> create table t2 as select i from t1;
> \d+ t1;
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+---------+-----------+----------+---------+---------+--------------+-------------
i | integer | | not null | | plain | |
> \d+ t2;
Column | Type | Collation | Nullable | Default | Storage | Stats target | Description
--------+---------+-----------+----------+---------+---------+--------------+-------------
i | integer | | | | plain | |
```
File source V1 has the same behavior.
### Does this PR introduce any user-facing change?
<!--
If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
If no, write 'No'.
-->
Yes, after this PR CTAS/RTAS creates tables with nullable schema, then users can insert null values later.
### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add.
-->
new test
Closes#25536 from cloud-fan/ctas.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
<!--
Thanks for sending a pull request! Here are some tips for you:
1. If this is your first time, please read our contributor guidelines: https://spark.apache.org/contributing.html
2. Ensure you have added or run the appropriate tests for your PR: https://spark.apache.org/developer-tools.html
3. If the PR is unfinished, add '[WIP]' in your PR title, e.g., '[WIP][SPARK-XXXX] Your PR title ...'.
4. Be sure to keep the PR description updated to reflect all changes.
5. Please write your PR title to summarize what this PR proposes.
6. If possible, provide a concise example to reproduce the issue for a faster review.
-->
### What changes were proposed in this pull request?
<!--
Please clarify what changes you are proposing. The purpose of this section is to outline the changes and how this PR fixes the issue.
If possible, please consider writing useful notes for better and faster reviews in your PR. See the examples below.
1. If you refactor some codes with changing classes, showing the class hierarchy will help reviewers.
2. If you fix some SQL features, you can provide some references of other DBMSes.
3. If there is design documentation, please add the link.
4. If there is a discussion in the mailing list, please add the link.
-->
The current namespace/catalog should be set to None at the beginning, so that we can read the new configs when reporting currennt namespace/catalog later.
### Why are the changes needed?
<!--
Please clarify why the changes are needed. For instance,
1. If you propose a new API, clarify the use case for a new API.
2. If you fix a bug, you can clarify why it is a bug.
-->
Fix a bug in CatalogManager, to reflect the change of default catalog config when reporting current catalog.
### Does this PR introduce any user-facing change?
<!--
If yes, please clarify the previous behavior and the change this PR proposes - provide the console output, description and/or an example to show the behavior difference if possible.
If no, write 'No'.
-->
No. The current namespace/catalog stuff is still internal right now.
### How was this patch tested?
<!--
If tests were added, say they were added here. Please make sure to add some test cases that check the changes thoroughly including negative and positive cases if possible.
If it was tested in a way different from regular unit tests, please clarify how you tested step by step, ideally copy and paste-able, so that other reviewers can test and check, and descendants can verify in the future.
If tests were not added, please describe why they were not added and/or why it was difficult to add.
-->
a new test suite
Closes#25521 from cloud-fan/fix.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Burak Yavuz <brkyvz@gmail.com>
## What changes were proposed in this pull request?
Disable using radix sort in ShuffleExchangeExec when we do repartition.
In #20393, we fixed the indeterministic result in the shuffle repartition case by performing a local sort before repartitioning.
But for the newly added sort operation, we use radix sort which is wrong because binary data can't be compared by only the prefix. This makes the sort unstable and fails to solve the indeterminate shuffle output problem.
### Why are the changes needed?
Fix the correctness bug caused by repartition after a shuffle.
### Does this PR introduce any user-facing change?
Yes, user will get the right result in the case of repartition stage rerun.
## How was this patch tested?
Test with `local-cluster[5, 2, 5120]`, use the integrated test below, it can return a right answer 100000000.
```
import scala.sys.process._
import org.apache.spark.TaskContext
val res = spark.range(0, 10000 * 10000, 1).map{ x => (x % 1000, x)}
// kill an executor in the stage that performs repartition(239)
val df = res.repartition(113).map{ x => (x._1 + 1, x._2)}.repartition(239).map { x =>
if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId < 1 && TaskContext.get.stageAttemptNumber == 0) {
throw new Exception("pkill -f -n java".!!)
}
x
}
val r2 = df.distinct.count()
```
Closes#25491 from xuanyuanking/SPARK-28699-fix.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
## What changes were proposed in this pull request?
Introduces the collectInPlanAndSubqueries and subqueriesAll methods in QueryPlan that consider all the plans in the query plan, including the ones in nested subqueries.
## How was this patch tested?
Unit test added
Closes#25433 from dbaliafroozeh/IntroduceCollectInPlanAndSubqueries.
Authored-by: Ali Afroozeh <ali.afroozeh@databricks.com>
Signed-off-by: herman <herman@databricks.com>
### What changes were proposed in this pull request?
The rule ReuseExchange optimization rule will look for instances of Exchange that have the same plan and convert dedupe them to them to a ReuseExchangeExec instance. In the current Spark codebase all Exchange instances are row based, but if we use the spark.sql.extensions config to put in our own columnar based exchange implementation reuse will throw an exception saying that there was a columnar mismatch.
### Why are the changes needed?
Without it Reused Columnar Exchanges throw an exception
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
I tested this patch by running it against a query that was showing this exact issue and it fixed it.
I also added a very simple unit test that shows the issue.
Closes#25499 from revans2/reused-columnar-exchange.
Authored-by: Robert (Bobby) Evans <bobby@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
This PR adds a V1 fallback interface for writing to V2 Tables using V1 Writer interfaces. The only supported SaveMode that will be called on the target table will be an Append. The target table must use V2 interfaces such as `SupportsOverwrite` or `SupportsTruncate` to support Overwrite operations. It is up to the target DataSource implementation if this operation can be atomic or not.
We do not support dynamicPartitionOverwrite, as we cannot call a `commit` method that actually cleans up the data in the partitions that were touched through this fallback.
## How was this patch tested?
Will add tests and example implementation after comments + feedback. This is a proposal at this point.
Closes#25348 from brkyvz/v1WriteFallback.
Lead-authored-by: Burak Yavuz <brkyvz@gmail.com>
Co-authored-by: Burak Yavuz <burak@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
## What changes were proposed in this pull request?
The expression `IntegralDivide`, which corresponds to the `div` operator, support only integral type. Postgres, though, allows it to work also with decimals.
The PR adds the support to decimal operands for this operation in order to have feature parity with postgres.
## How was this patch tested?
added UTs
Closes#25136 from mgaido91/SPARK-28322.
Authored-by: Marco Gaido <marcogaido91@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This PR changes subquery reuse in Adaptive Query Execution from compile-time static reuse to execution-time dynamic reuse. This PR adds a `ReuseAdaptiveSubquery` rule that applies to a query stage after it is created and before it is executed. The new dynamic reuse enables subqueries to be reused across all different subquery levels.
### Why are the changes needed?
This is an improvement to the current subquery reuse in Adaptive Query Execution, which allows subquery reuse to happen in a lazy fashion as well as at different subquery levels.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Passed existing tests.
Closes#25471 from maryannxue/aqe-dynamic-sub-reuse.
Authored-by: maryannxue <maryannxue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>