### What changes were proposed in this pull request?
This PR is to clean up the markdown file in SHOW COLUMNS page.
- remove the unneeded embedded inline HTML markup by using the basic markdown syntax.
- use the ``` sql for highlighting the SQL syntax.
### Why are the changes needed?
Make the doc cleaner and easily editable by MD editors.
### Does this PR introduce _any_ user-facing change?
NO
### How was this patch tested?
**Before**
![Screen Shot 2020-04-29 at 5 20 11 PM](https://user-images.githubusercontent.com/11567269/80661963-fa4d4a80-8a44-11ea-9dea-c43cda6de010.png)
**After**
![Screen Shot 2020-04-29 at 6 03 50 PM](https://user-images.githubusercontent.com/11567269/80661940-f15c7900-8a44-11ea-9943-a83e8d8618fb.png)
Closes#28414 from gatorsmile/cleanupShowColumns.
Lead-authored-by: Xiao Li <gatorsmile@gmail.com>
Co-authored-by: gatorsmile <gatorsmile@gmail.com>
Signed-off-by: gatorsmile <gatorsmile@gmail.com>
### What changes were proposed in this pull request?
In the PR, I propose to fix two legacy timestamp formatter `LegacySimpleTimestampFormatter` and `LegacyFastTimestampFormatter` to perform micros rebasing in parsing/formatting from/to strings.
### Why are the changes needed?
Legacy timestamps formatters operate on the hybrid calendar (Julian + Gregorian), so, the input micros should be rebased to have the same date-time fields as in Proleptic Gregorian calendar used by Spark SQL, see SPARK-26651.
### Does this PR introduce any user-facing change?
Yes
### How was this patch tested?
Added tests to `TimestampFormatterSuite`
Closes#28408 from MaxGekk/fix-rebasing-in-legacy-timestamp-formatter.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In `CTESubstitution`, resolve CTE relations first, then traverse the main plan only once to substitute CTE relations.
### Why are the changes needed?
Currently we will traverse the main query many times (if there are many CTE relations), which can be pretty slow if the main query is large.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
local perf test
```
scala> :pa
// Entering paste mode (ctrl-D to finish)
def test(i: Int): Unit = 1.to(i).foreach { _ =>
spark.sql("""
with
t1 as (select 1),
t2 as (select 1),
t3 as (select 1),
t4 as (select 1),
t5 as (select 1),
t6 as (select 1),
t7 as (select 1),
t8 as (select 1),
t9 as (select 1)
select * from t1, t2, t3, t4, t5, t6, t7, t8, t9""").queryExecution.assertAnalyzed()
}
// Exiting paste mode, now interpreting.
test: (i: Int)Unit
scala> test(10000)
scala> println(org.apache.spark.sql.catalyst.rules.RuleExecutor.dumpTimeSpent)
```
The result before this patch
```
Rule Effective Time / Total Time Effective Runs / Total Runs
CTESubstitution 3328796344 / 3924576425 10000 / 20000
```
The result after this patch
```
Rule Effective Time / Total Time Effective Runs / Total Runs
CTESubstitution 1503085936 / 2091992092 10000 / 20000
```
About 2 times faster.
Closes#28407 from cloud-fan/cte.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
- Rephrase the API doc for `Column.as`
- Simplify the UTs
### Why are the changes needed?
Address comments in https://github.com/apache/spark/pull/28326
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
New UT added.
Closes#28390 from xuanyuanking/SPARK-27340-follow.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Although SPARK-30184 Implement a helper method for aliasing functions, developers always forget to using this improvement.
We need to add more powerful guarantees so that aliases outputed by built-in functions are correct.
This PR extracts the SQL from the example of expressions, and output the SQL and its schema into one golden file.
By checking the golden file, we can find the expressions whose aliases are not displayed correctly, and then fix them.
### Why are the changes needed?
Ensure that the output alias is correct
### Does this PR introduce any user-facing change?
'No'.
### How was this patch tested?
Jenkins test.
Closes#28194 from beliefer/check-expression-schema.
Lead-authored-by: beliefer <beliefer@163.com>
Co-authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR aims to fix `spark.kubernetes.executor.podNamePrefix` to work.
### Why are the changes needed?
Currently, the configuration is broken like the following.
```
bin/spark-submit \
--master k8s://$K8S_MASTER \
--deploy-mode cluster \
--name spark-pi \
--class org.apache.spark.examples.SparkPi \
-c spark.kubernetes.container.image=spark:pr \
-c spark.kubernetes.driver.pod.name=mypod \
-c spark.kubernetes.executor.podNamePrefix=mypod \
local:///opt/spark/examples/jars/spark-examples_2.12-3.1.0-SNAPSHOT.jar
```
**BEFORE SPARK-31601**
```
pod/mypod 1/1 Running 0 9s
pod/spark-pi-7469dd71c499fafb-exec-1 1/1 Running 0 4s
pod/spark-pi-7469dd71c499fafb-exec-2 1/1 Running 0 4s
```
**AFTER SPARK-31601**
```
pod/mypod 1/1 Running 0 8s
pod/mypod-exec-1 1/1 Running 0 3s
pod/mypod-exec-2 1/1 Running 0 3s
```
### Does this PR introduce any user-facing change?
Yes. This is a bug fix. The conf will work as described in the documentation.
### How was this patch tested?
Pass the Jenkins and run the above comment manually.
Closes#28401 from dongjoon-hyun/SPARK-31601.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Prashant Sharma <prashsh1@in.ibm.com>
### What changes were proposed in this pull request?
Because of ebc8fa50d0 and beec8d535f, the SQL output strings for date/timestamp - interval operation will have a malformed format, such as `struct<dateval:date,dateval + (- INTERVAL '2 years 2 months').....`
This PR restore this behavior by adding one `RuntimeReplaceable `implementation for both of the operations to have their pretty SQL strings back.
### Why are the changes needed?
restore the SQL string for datetime operations
### Does this PR introduce any user-facing change?
NO, we are restoring here
### How was this patch tested?
added unit tests
Closes#28402 from yaooqinn/SPARK-31586-F.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
- Add tests for different element types of collections that could be passed to `isInCollection`. Added tests for types that can pass the check `In`.`checkInputDataTypes()`.
- Test different switch thresholds in the `isInCollection: Scala Collection` test.
### Why are the changes needed?
To prevent regressions like introduced by https://github.com/apache/spark/pull/25754 and reverted by https://github.com/apache/spark/pull/28388
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By existing and new tests in `ColumnExpressionSuite`
Closes#28405 from MaxGekk/test-isInCollection.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to add a guide to clarify the Spark version when describing "Does this PR introduce any user-facing change?".
### Why are the changes needed?
It seems confusing to write when the user facing changes happen within unreleased branches.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
Manually tested in Github and it renders find as intended.
Closes#28403 from HyukjinKwon/minor-more-guide.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
We are adding a new Spark Yarn configuration, `spark.yarn.populateHadoopClasspath` to not populate Hadoop classpath from `yarn.application.classpath` and `mapreduce.application.classpath`.
### Why are the changes needed?
Spark Yarn client populates extra Hadoop classpath from `yarn.application.classpath` and `mapreduce.application.classpath` when a job is submitted to a Yarn Hadoop cluster.
However, for `with-hadoop` Spark build that embeds Hadoop runtime, it can cause jar conflicts because Spark distribution can contain different version of Hadoop jars.
One case we have is when a user uses an Apache Spark distribution with its-own embedded hadoop, and submits a job to Cloudera or Hortonworks Yarn clusters, because of two different incompatible Hadoop jars in the classpath, it runs into errors.
By not populating the Hadoop classpath from the clusters can address this issue.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
An UT is added, but very hard to add a new integration test since this requires using different incompatible versions of Hadoop.
We also manually tested this PR, and we are able to submit a Spark job using Spark distribution built with Apache Hadoop 2.10 to CDH 5.6 without populating CDH classpath.
Closes#28376 from dbtsai/yarn-classpath.
Authored-by: DB Tsai <d_tsai@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
### What changes were proposed in this pull request?
`requireNamespace1` was used to get `SparkR` on CRAN while Suggesting `arrow` while `arrow` was not yet available on CRAN.
### Why are the changes needed?
Now `arrow` is on CRAN, we can properly use `requireNamespace` without triggering CRAN failures.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
AppVeyor will test, and CRAN check in Jenkins build.
Closes#28387 from MichaelChirico/r-require-arrow.
Authored-by: Michael Chirico <michael.chirico@grabtaxi.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
- Check all available legacy formats in the tests added by https://github.com/apache/spark/pull/28345
- Check dates rebasing in legacy parsers for only one direction either days -> string or string -> days.
### Why are the changes needed?
Round trip tests can hide issues in dates rebasing. For example, if we remove rebasing from legacy parsers (from `parse()` and `format()`) the tests will pass.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By running `DateFormatterSuite`.
Closes#28398 from MaxGekk/test-rebasing-in-legacy-date-formatter.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR intends to update `sql` in `Rand`/`Randn` with no argument to make a column name deterministic.
Before this PR (a column name changes run-by-run):
```
scala> sql("select rand()").show()
+-------------------------+
|rand(7986133828002692830)|
+-------------------------+
| 0.9524061403696937|
+-------------------------+
```
After this PR (a column name fixed):
```
scala> sql("select rand()").show()
+------------------+
| rand()|
+------------------+
|0.7137935639522275|
+------------------+
// If a seed given, it is still shown in a column name
// (the same with the current behaviour)
scala> sql("select rand(1)").show()
+------------------+
| rand(1)|
+------------------+
|0.6363787615254752|
+------------------+
// We can still check a seed in explain output:
scala> sql("select rand()").explain()
== Physical Plan ==
*(1) Project [rand(-2282124938778456838) AS rand()#0]
+- *(1) Scan OneRowRelation[]
```
Note: This fix comes from #28194; the ongoing PR tests the output schema of expressions, so their schemas must be deterministic for the tests.
### Why are the changes needed?
To make output schema deterministic.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added unit tests.
Closes#28392 from maropu/SPARK-31594.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR addresses two things:
- `SHOW TBLPROPERTIES` should supports view (a regression introduced by #26921)
- `SHOW TBLPROPERTIES` on a temporary view should return empty result (2.4 behavior instead of throwing `AnalysisException`.
### Why are the changes needed?
It's a bug.
### Does this PR introduce any user-facing change?
Yes, now `SHOW TBLPROPERTIES` works on views:
```
scala> sql("CREATE VIEW view TBLPROPERTIES('p1'='v1', 'p2'='v2') AS SELECT 1 AS c1")
scala> sql("SHOW TBLPROPERTIES view").show(truncate=false)
+---------------------------------+-------------+
|key |value |
+---------------------------------+-------------+
|view.catalogAndNamespace.numParts|2 |
|view.query.out.col.0 |c1 |
|view.query.out.numCols |1 |
|p2 |v2 |
|view.catalogAndNamespace.part.0 |spark_catalog|
|p1 |v1 |
|view.catalogAndNamespace.part.1 |default |
+---------------------------------+-------------+
```
And for a temporary view:
```
scala> sql("CREATE TEMPORARY VIEW tview TBLPROPERTIES('p1'='v1', 'p2'='v2') AS SELECT 1 AS c1")
scala> sql("SHOW TBLPROPERTIES tview").show(truncate=false)
+---+-----+
|key|value|
+---+-----+
+---+-----+
```
### How was this patch tested?
Added tests.
Closes#28375 from imback82/show_tblproperties_followup.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
With suggestion from cloud-fan https://github.com/apache/spark/pull/28222#issuecomment-620586933
I Checked with both Presto and PostgresSQL, one is implemented intervals with ANSI style year-month/day-time, and the other is mixed and Non-ANSI. They both add the exceeded days in interval time part to the total days of the operation which extracts day from interval values.
```sql
presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp)));
_col0
-------
14
(1 row)
Query 20200428_135239_00000_ahn7x, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:01 [0 rows, 0B] [0 rows/s, 0B/s]
presto> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:01' as timestamp)));
_col0
-------
13
(1 row)
Query 20200428_135246_00001_ahn7x, FINISHED, 1 node
Splits: 17 total, 17 done (100.00%)
0:00 [0 rows, 0B] [0 rows/s, 0B/s]
presto>
```
```sql
postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp)));
date_part
-----------
14
(1 row)
postgres=# SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:01' as timestamp)));
date_part
-----------
13
```
```
spark-sql> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:01' as timestamp)));
0
spark-sql> SELECT EXTRACT(DAY FROM (cast('2020-01-15 00:00:00' as timestamp) - cast('2020-01-01 00:00:00' as timestamp)));
0
```
In ANSI standard, the day is exact 24 hours, so we don't need to worry about the conceptual day for interval extraction. The meaning of the conceptual day only takes effect when we add it to a zoned timestamp value.
### Why are the changes needed?
Both satisfy the ANSI standard and common use cases in modern SQL platforms
### Does this PR introduce any user-facing change?
No, it new in 3.0
### How was this patch tested?
add more uts
Closes#28396 from yaooqinn/SPARK-31597.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Run sql in spark thrift server, each session 's thrift server about method will be called in one thread, but when running query statement, we have two mode:
1. sync
2. async
5a482e7209/sql/hive-thriftserver/src/main/scala/org/apache/spark/sql/hive/thriftserver/SparkExecuteStatementOperation.scala (L205-L238)
In sync mode, we just submit query in current session's corresponding thread and wait Spark to running query and return result, and the query method will always wait for query return.
In async mode, in SparkExecuteStatementOperation, we will submit query in a backend thread pool, and update operation state, after submitted to backend thread poll, ExecuteStatement method will return a OperationHandle to client side, and client side will request operation status continuously. after backend thread running sql and return , it will update corresponding operation status, when client got operation status is final status, it will got error or start fetching result of this operation.
When we use pyhive connect to SparkThriftServer, it will run statement in sync mode.
When we query data of hive table , it will check serde class in HiveTableScanExec#addColumnMetadataToConf
5a482e7209/sql/hive/src/main/scala/org/apache/spark/sql/hive/execution/HiveTableScanExec.scala (L123)
```
public Class<? extends Deserializer> getDeserializerClass() {
try {
return Class.forName(this.getSerdeClassName(), true, Utilities.getSessionSpecifiedClassLoader());
} catch (ClassNotFoundException var2) {
throw new RuntimeException(var2);
}
}
public static ClassLoader getSessionSpecifiedClassLoader() {
SessionState state = SessionState.get();
if (state != null && state.getConf() != null) {
ClassLoader sessionCL = state.getConf().getClassLoader();
if (sessionCL != null) {
if (LOG.isTraceEnabled()) {
LOG.trace("Use session specified class loader");
}
return sessionCL;
} else {
if (LOG.isDebugEnabled()) {
LOG.debug("Session specified class loader not found, use thread based class loader");
}
return JavaUtils.getClassLoader();
}
} else {
if (LOG.isDebugEnabled()) {
LOG.debug("Hive Conf not found or Session not initiated, use thread based class loader instead");
}
return JavaUtils.getClassLoader();
}
}
```
Since we run statement in sync mode, it will use HiveSession's SessionState, and use it's conf's classLoader. then error happened.
```
Current operation state RUNNING_STATE,
java.lang.RuntimeException: java.lang.ClassNotFoundException:
xxx.xxx.xxxJsonSerDe
at org.apache.hadoop.hive.ql.plan.TableDesc.getDeserializerClass(TableDesc.java:74)
at org.apache.spark.sql.hive.execution.HiveTableScanExec.addColumnMetadataToConf(HiveTableScanExec.scala:123)
at org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopConf$lzycompute(HiveTableScanExec.scala:101)
at org.apache.spark.sql.hive.execution.HiveTableScanExec.hadoopConf(HiveTableScanExec.scala:98)
at org.apache.spark.sql.hive.execution.HiveTableScanExec.org$apache$spark$sql$hive$execution$HiveTableScanExec$$hadoopReader$lzycompute(HiveTableScanExec.scala:110)
at org.apache.spark.sql.hive.execution.HiveTableScanExec.org$apache$spark$sql$hive$execution$HiveTableScanExec$$hadoopReader(HiveTableScanExec.scala:105)
at org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:192)
at org.apache.spark.sql.hive.execution.HiveTableScanExec$$anonfun$11.apply(HiveTableScanExec.scala:192)
```
We should reset it when we start run sql in sync mode.
### Why are the changes needed?
Fix bug
### Does this PR introduce any user-facing change?
NO
### How was this patch tested?
UT
Closes#26141 from AngersZhuuuu/add_jar_in_sync_mode.
Lead-authored-by: angerszhu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to simplify the code of `InSet.sql` and create `Literal` instances directly from Catalyst's internal values by using the default `Literal` constructor.
### Why are the changes needed?
This simplifies code and avoids unnecessary conversions to external types.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By existing test `SPARK-31563: sql of InSet for UTF8String collection` in `ColumnExpressionSuite`.
Closes#28399 from MaxGekk/fix-InSet-sql-followup.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR adds `-Phive` profile to the pre-build phase to build the hive module to dev classpath.
Then reflect the HiveUtils object to dump all configurations in the class.
### Why are the changes needed?
supply SQL configurations from hive module to doc
### Does this PR introduce any user-facing change?
NO
### How was this patch tested?
passing Jenkins
add verified locally
![image](https://user-images.githubusercontent.com/8326978/80492333-6fae1200-8996-11ea-99fd-595ee18c67e5.png)
Closes#28394 from yaooqinn/SPARK-31596.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This aims to upgrade Rtools first to prepare R 4.0.0 in AppVeyor for Apache Spark 3.1.0.
### Why are the changes needed?
R 4.0.0 is released on April 24th, 2020. It uses Rtools 4.0.0 officially.
- https://cran.r-project.org/doc/manuals/r-release/NEWS.html
- https://stat.ethz.ch/pipermail/r-announce/2020/000653.html
### Does this PR introduce any user-facing change?
No. (This PR aims to test Rtools 4.0.0 in AppVeyor environment.)
### How was this patch tested?
See the AppVeyor result.
Closes#28358 from dongjoon-hyun/SPARK-31567.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
https://github.com/apache/spark/pull/27716 introduced parent index for InMemoryStore. When the method "deleteParentIndex(Object key)" in InMemoryStore.java is called and the key is not contained in "NaturalKeys v", A java.lang.NullPointerException will be thrown. This patch fixed the issue by updating the if condition.
### Why are the changes needed?
Fixed a minor bug.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Added a unit test for deleteParentIndex.
Closes#28378 from baohe-zhang/SPARK-31584.
Authored-by: Baohe Zhang <baohe.zhang@verizonmedia.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
### What changes were proposed in this pull request?
With https://github.com/apache/spark/pull/28310, the operation of date +/- interval(m, d, 0) has been improved a lot.
According to the benchmark results, about 75% time cost is reduced because of no casting date to timestamp back and forth.
In this PR, we add a benchmark for these operations, and timestamp +/- interval operations as accessories.
### Why are the changes needed?
Performance test coverage, since these operations are missing in the DateTimeBenchmark.
### Does this PR introduce any user-facing change?
No, just test
### How was this patch tested?
regenerated benchmark results
Closes#28369 from yaooqinn/SPARK-31527-F.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This reverts commit 5631a96367.
Closes#28328
### Why are the changes needed?
The PR https://github.com/apache/spark/pull/25754 introduced a bug in `isInCollection`. For example, if the SQL config `spark.sql.optimizer.inSetConversionThreshold`is set to 10 (by default):
```scala
val set = (0 to 20).map(_.toString).toSet
val data = Seq("1").toDF("x")
data.select($"x".isInCollection(set).as("isInCollection")).show()
```
The function must return **'true'** because "1" is in the set of "0" ... "20" but it returns "false":
```
+--------------+
|isInCollection|
+--------------+
| false|
+--------------+
```
### Does this PR introduce any user-facing change?
Yes
### How was this patch tested?
```
$ ./build/sbt "test:testOnly *ColumnExpressionSuite"
```
Closes#28388 from MaxGekk/fix-isInCollection-revert.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
The implementation of TimeSub for the operation of timestamp subtracting interval is almost repetitive with TimeAdd. We can replace it with TimeAdd(l, -r) since there are equivalent.
Suggestion from https://github.com/apache/spark/pull/28310#discussion_r414259239
Besides, the Coercion rules for TimeAdd/TimeSub(date, interval) are useless anymore, so remove them in this PR since they are touched in this PR.
### Why are the changes needed?
remove redundant and useless code for easy maintenance
### Does this PR introduce any user-facing change?
Yes, the SQL string of `datetime - interval` become `datetime + (- interval)`
### How was this patch tested?
modified existing unit tests.
Closes#28381 from yaooqinn/SPARK-31586.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
For regex functions in base R (`gsub`, `grep`, `grepl`, `strsplit`, `gregexpr`), supplying the `fixed=TRUE` option will be more performant.
### Why are the changes needed?
This is a minor fix for performance
### Does this PR introduce any user-facing change?
No (although some internal code was applying fixed-as-regex in some cases that could technically have been over-broad and caught unintended patterns)
### How was this patch tested?
Not
Closes#28367 from MichaelChirico/r-regex-fixed.
Authored-by: Michael Chirico <michael.chirico@grabtaxi.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add a new logical node AggregateWithHaving, and the parser should create this plan for HAVING. The analyzer resolves it to Filter(..., Aggregate(...)).
### Why are the changes needed?
The SQL parser in Spark creates Filter(..., Aggregate(...)) for the HAVING query, and Spark has a special analyzer rule ResolveAggregateFunctions to resolve the aggregate functions and grouping columns in the Filter operator.
It works for simple cases in a very tricky way as it relies on rule execution order:
1. Rule ResolveReferences hits the Aggregate operator and resolves attributes inside aggregate functions, but the function itself is still unresolved as it's an UnresolvedFunction. This stops resolving the Filter operator as the child Aggrege operator is still unresolved.
2. Rule ResolveFunctions resolves UnresolvedFunction. This makes the Aggrege operator resolved.
3. Rule ResolveAggregateFunctions resolves the Filter operator if its child is a resolved Aggregate. This rule can correctly resolve the grouping columns.
In the example query, I put a CAST, which needs to be resolved by rule ResolveTimeZone, which runs after ResolveAggregateFunctions. This breaks step 3 as the Aggregate operator is unresolved at that time. Then the analyzer starts next round and the Filter operator is resolved by ResolveReferences, which wrongly resolves the grouping columns.
See the demo below:
```
SELECT SUM(a) AS b, '2020-01-01' AS fake FROM VALUES (1, 10), (2, 20) AS T(a, b) GROUP BY b HAVING b > 10
```
The query's result is
```
+---+----------+
| b| fake|
+---+----------+
| 2|2020-01-01|
+---+----------+
```
But if we add CAST, it will return an empty result.
```
SELECT SUM(a) AS b, CAST('2020-01-01' AS DATE) AS fake FROM VALUES (1, 10), (2, 20) AS T(a, b) GROUP BY b HAVING b > 10
```
### Does this PR introduce any user-facing change?
Yes, bug fix for cast in having aggregate expressions.
### How was this patch tested?
New UT added.
Closes#28294 from xuanyuanking/SPARK-31519.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This is a followup of [#28022](https://github.com/apache/spark/pull/28022), to add the metric info of split task number for skewed optimization.
With this PR, we can see the number of splits for the skewed partitions as following:
![image](https://user-images.githubusercontent.com/11972570/80294583-ff886c00-879c-11ea-813c-2db302f99f04.png)
### Why are the changes needed?
make UI more friendly
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
existing ut
Closes#28109 from JkSelf/addSplitNumer.
Authored-by: jiake <ke.a.jia@intel.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR aims to use `r-lib/actions/setup-r` because it's more stable and maintained by 3rd party.
### Why are the changes needed?
This will recover the current outage. In addition, this will be more robust in the future.
As of now, this is tested via https://github.com/dongjoon-hyun/spark/pull/17 .
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Pass the GitHub Actions, especially `Linter R` and `Generate Documents`.
Closes#28382 from dongjoon-hyun/SPARK-31589.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Repeated `sapply` avoided in internal `checkSchemaInArrow`
### Why are the changes needed?
Current implementation is doubly inefficient:
1. Repeatedly doing the same (95%) `sapply` loop
2. Doing scalar `==` on a vector (`==` should be done over the whole vector for efficiency)
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
By my trusty friend the CI bots
Closes#28372 from MichaelChirico/vectorize-types.
Authored-by: Michael Chirico <michael.chirico@grabtaxi.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
All instances of `paste(..., sep = "")` in the code are replaced with `paste0` which is more performant
### Why are the changes needed?
Performance & consistency (`paste0` is already used extensively in the R package)
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
None
Closes#28374 from MichaelChirico/r-paste0.
Authored-by: Michael Chirico <michael.chirico@grabtaxi.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/28318, to make the code more readable, by adding some comments to explain the trick and simplify the code to use a boolean flag instead of 2 string sets.
This PR also fixes various problems:
1. the name check should consider case sensitivity
2. forward name conflicts like `with t as (with t2 as ...), t2 as ...` is not a real conflict and we shouldn't fail.
### Why are the changes needed?
correct the behavior
### Does this PR introduce any user-facing change?
yes, fix the fore-mentioned behaviors.
### How was this patch tested?
new tests
Closes#28371 from cloud-fan/followup.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
Credit to LiangchangZ, this PR reuses the UT as well as integrate test in #24457. Thanks Liangchang for your solid work.
### What changes were proposed in this pull request?
Make metadata propagatable between Aliases.
### Why are the changes needed?
In Structured Streaming, we added an Alias for TimeWindow by default.
590b9a0132/sql/core/src/main/scala/org/apache/spark/sql/functions.scala (L3272-L3273)
For some cases like stream join with watermark and window, users need to add an alias for convenience(we also added one in StreamingJoinSuite). The current metadata handling logic for `as` will lose the watermark metadata
590b9a0132/sql/core/src/main/scala/org/apache/spark/sql/Column.scala (L1049-L1054)
and finally cause the AnalysisException:
```
Stream-stream outer join between two streaming DataFrame/Datasets is not supported without a watermark in the join keys, or a watermark on the nullable side and an appropriate range condition
```
### Does this PR introduce any user-facing change?
Bugfix for an alias on time window with watermark.
### How was this patch tested?
New UTs added. One for the functionality and one for explaining the common scenario.
Closes#28326 from xuanyuanking/SPARK-27340.
Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Improve documentation for `gapply` in `SparkR`
### Why are the changes needed?
Spent a long time this weekend trying to figure out just what exactly `key` is in `gapply`'s `func`. I had assumed it would be a _named_ list, but apparently not -- the examples are working because `schema` is applying the name and the names of the output `data.frame` don't matter.
As near as I can tell the description I've added is correct, namely, that `key` is an unnamed list.
### Does this PR introduce any user-facing change?
No? Not in code. Only documentation.
### How was this patch tested?
Not. Documentation only
Closes#28350 from MichaelChirico/r-gapply-key-doc.
Authored-by: Michael Chirico <michael.chirico@grabtaxi.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Remove all the extra whitespaces in the formatted explain.
### Why are the changes needed?
The number of extra whitespaces of the formatted explain becomes different between master and branch-3.0. This causes a problem that whenever we backport formatted explain related tests from master to branch-3.0, it will fail branch-3.0. Besides, extra whitespaces are always disallowed in Spark. Thus, we should remove them as possible as we can.
### Does this PR introduce any user-facing change?
No, formatted explain is newly added in Spark 3.0.
### How was this patch tested?
Updated sql query tests.
Closes#28315 from Ngone51/fix_extra_spaces.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to use a different approach instead of breaking it per Micheal's rubric added at https://spark.apache.org/versioning-policy.html. It deprecates the behaviour for now. It will be gradually removed in the future releases.
After this change,
```python
import warnings
warnings.simplefilter("always")
from pyspark.sql.functions import *
df = spark.range(2)
map_col = create_map(lit(0), lit(100), lit(1), lit(200))
df.withColumn("mapped", map_col.getItem(col('id'))).show()
```
```
/.../python/pyspark/sql/column.py:311: DeprecationWarning: A column as 'key' in getItem is
deprecated as of Spark 3.0, and will not be supported in the future release. Use `column[key]`
or `column.key` syntax instead.
DeprecationWarning)
...
```
```python
import warnings
warnings.simplefilter("always")
from pyspark.sql.functions import *
df = spark.range(2)
struct_col = struct(lit(0), lit(100), lit(1), lit(200))
df.withColumn("struct", struct_col.getField(lit("col1"))).show()
```
```
/.../spark/python/pyspark/sql/column.py:336: DeprecationWarning: A column as 'name'
in getField is deprecated as of Spark 3.0, and will not be supported in the future release. Use
`column[name]` or `column.name` syntax instead.
DeprecationWarning)
```
### Why are the changes needed?
To prevent the radical behaviour change after the amended versioning policy.
### Does this PR introduce any user-facing change?
Yes, it will show the deprecated warning message.
### How was this patch tested?
Manually tested.
Closes#28327 from HyukjinKwon/SPARK-29664.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
To follow ANSI,the expressions - `date + interval`, `interval + date` and `date - interval` should only accept intervals which the `microseconds` part is 0.
### Why are the changes needed?
Better ANSI compliance
### Does this PR introduce any user-facing change?
No, this PR should target 3.0.0 in which this feature is newly added.
### How was this patch tested?
add more unit tests
Closes#28310 from yaooqinn/SPARK-31527.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR modifies LegacyDateFormatter#parse to return proleptic Gregorian days rather than hybrid Julian days.
### Why are the changes needed?
The legacy time parser currently returns epoch days in the hybrid Julian calendar. However, the callers to the legacy parser (e.g., UnivocityParser, JacksonParser) expect epoch days in the proleptic Gregorian calendar. As a result, pre-Gregorian dates like '1000-01-01' get interpreted as '1000-01-06'.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Manual testing and modified existing unit tests.
Closes#28345 from bersprockets/SPARK-31557.
Authored-by: Bruce Robbins <bersprockets@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
`CliSuite.runCliWithin` was not matching for expected results correctly. It was matching for expected lines anywhere in stdout or stderr.
On the example of `Single command with --database` test:
In
```
runCliWithin(2.minute)(
"CREATE DATABASE hive_db_test;"
-> "",
"USE hive_test;"
-> "",
"CREATE TABLE hive_test(key INT, val STRING);"
-> "",
"SHOW TABLES;"
-> "hive_test"
)
```
It was looking for lines containing "", "", "" and then "hive_test".
However, the string "hive_test" was contained in "hive_test_db", and hence:
```
2020-04-08 17:53:12,752 INFO CliSuite - 2020-04-08 17:53:12.752 - stderr> Spark master: local, Application Id: local-1586368384172
2020-04-08 17:53:12,765 INFO CliSuite - stderr> found expected output line 0: ""
2020-04-08 17:53:12,765 INFO CliSuite - 2020-04-08 17:53:12.765 - stdout> spark-sql> CREATE DATABASE hive_db_test;
2020-04-08 17:53:12,765 INFO CliSuite - stdout> found expected output line 1: ""
2020-04-08 17:53:17,688 INFO CliSuite - 2020-04-08 17:53:17.688 - stderr> chgrp: changing ownership of 'file:///tmp/spark-8811f069-4cba-4c71-a5d6-62dd925fb5ff': chown: changing group of '/tmp/spark-8811f069-4cba-4c71-a5d6-62dd925fb5ff': Operation not permitted
2020-04-08 17:53:12,765 INFO CliSuite - stderr> found expected output line 2: ""
2020-04-08 17:53:18,069 INFO CliSuite - 2020-04-08 17:53:18.069 - stderr> Time taken: 5.265 seconds
2020-04-08 17:53:18,087 INFO CliSuite - 2020-04-08 17:53:18.087 - stdout> spark-sql> USE hive_test;
2020-04-08 17:53:12,765 INFO CliSuite - stdout> found expected output line 3: "hive_test"
2020-04-08 17:53:21,742 INFO CliSuite - Found all expected output.
```
Because of that, it could kill the CLI process without really even creating the table. This was not expected. The test could be flaky depending on whether process.destroy() in the finally block managed to kill it before it actually creates the table.
I make the output checking more robust to not match on unexpected output, by making it check the echo of query output on the CLI. Also, wait for the CLI process to finish gracefully (triggered by closing its stdin), instead of killing it forcibly.
### Why are the changes needed?
org.apache.spark.sql.hive.thriftserver.CliSuite was flaky, and didn't test outputs as expected.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing tests in CLISuite. Tested several times with no flakiness. Was getting flaky results almost on every run before.
```
[info] CliSuite:
[info] - load warehouse dir from hive-site.xml (12 seconds, 568 milliseconds)
[info] - load warehouse dir from --hiveconf (10 seconds, 648 milliseconds)
[info] - load warehouse dir from --conf spark(.hadoop).hive.* (20 seconds, 653 milliseconds)
[info] - load warehouse dir from spark.sql.warehouse.dir (9 seconds, 763 milliseconds)
[info] - Simple commands (16 seconds, 238 milliseconds)
[info] - Single command with -e (9 seconds, 967 milliseconds)
[info] - Single command with --database (21 seconds, 205 milliseconds)
[info] - Commands using SerDe provided in --jars (15 seconds, 51 milliseconds)
[info] - SPARK-29022: Commands using SerDe provided in --hive.aux.jars.path (14 seconds, 625 milliseconds)
[info] - SPARK-11188 Analysis error reporting (7 seconds, 960 milliseconds)
[info] - SPARK-11624 Spark SQL CLI should set sessionState only once (7 seconds, 424 milliseconds)
[info] - list jars (9 seconds, 520 milliseconds)
[info] - list jar <jarfile> (9 seconds, 277 milliseconds)
[info] - list files (9 seconds, 828 milliseconds)
[info] - list file <filepath> (9 seconds, 646 milliseconds)
[info] - apply hiveconf from cli command (9 seconds, 469 milliseconds)
[info] - Support hive.aux.jars.path (10 seconds, 676 milliseconds)
[info] - SPARK-28840 test --jars command (10 seconds, 921 milliseconds)
[info] - SPARK-28840 test --jars and hive.aux.jars.path command (11 seconds, 49 milliseconds)
[info] - SPARK-29022 Commands using SerDe provided in ADD JAR sql (14 seconds, 210 milliseconds)
[info] - SPARK-26321 Should not split semicolon within quoted string literals (12 seconds, 729 milliseconds)
[info] - Pad Decimal numbers with trailing zeros to the scale of the column (10 seconds, 381 milliseconds)
[info] - SPARK-30049 Should not complain for quotes in commented lines (10 seconds, 935 milliseconds)
[info] - SPARK-30049 Should not complain for quotes in commented with multi-lines (20 seconds, 731 milliseconds)
```
Closes#28156 from juliuszsompolski/SPARK-31388.
Authored-by: Juliusz Sompolski <julek@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Fix Pyspark CrossValidator/TrainValidationSplit with pipeline estimator cannot save and load model.
Most pyspark estimators/transformers inherit `JavaParams`, but some estimators are special (in order to support pure python implemented nested estimators/transformers):
* Pipeline
* OneVsRest
* CrossValidator
* TrainValidationSplit
But note that, currently, in pyspark, estimators listed above, their model reader/writer do NOT support pure python implemented nested estimators/transformers. Because they use java reader/writer wrapper as python side reader/writer.
Pyspark CrossValidator/TrainValidationSplit model reader/writer require all estimators define the `_transfer_param_map_to_java` and `_transfer_param_map_from_java` (used in model read/write).
OneVsRest class already defines the two methods, but Pipeline do not, so it lead to this bug.
In this PR I add `_transfer_param_map_to_java` and `_transfer_param_map_from_java` into Pipeline class.
### Why are the changes needed?
Bug fix.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Unit test.
Manually test in pyspark shell:
1) CrossValidator with Simple Pipeline estimator
```
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, ParamGridBuilder
training = spark.createDataFrame([
(0, "a b c d e spark", 1.0),
(1, "b d", 0.0),
(2, "spark f g h", 1.0),
(3, "hadoop mapreduce", 0.0),
(4, "b spark who", 1.0),
(5, "g d a y", 0.0),
(6, "spark fly", 1.0),
(7, "was mapreduce", 0.0),
], ["id", "text", "label"])
# Configure an ML pipeline, which consists of tree stages: tokenizer, hashingTF, and lr.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(inputCol=tokenizer.getOutputCol(), outputCol="features")
lr = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[tokenizer, hashingTF, lr])
paramGrid = ParamGridBuilder() \
.addGrid(hashingTF.numFeatures, [10, 100, 1000]) \
.addGrid(lr.regParam, [0.1, 0.01]) \
.build()
crossval = CrossValidator(estimator=pipeline,
estimatorParamMaps=paramGrid,
evaluator=BinaryClassificationEvaluator(),
numFolds=2) # use 3+ folds in practice
# Run cross-validation, and choose the best set of parameters.
cvModel = crossval.fit(training)
cvModel.save('/tmp/cv_model001')
CrossValidatorModel.load('/tmp/cv_model001')
```
2) CrossValidator with Pipeline estimator which include a OneVsRest estimator stage, and OneVsRest estimator nest a LogisticRegression estimator.
```
from pyspark.ml.linalg import Vectors
from pyspark.ml import Estimator, Model
from pyspark.ml.classification import LogisticRegression, LogisticRegressionModel, OneVsRest
from pyspark.ml.evaluation import BinaryClassificationEvaluator, \
MulticlassClassificationEvaluator, RegressionEvaluator
from pyspark.ml.linalg import Vectors
from pyspark.ml.param import Param, Params
from pyspark.ml.tuning import CrossValidator, CrossValidatorModel, ParamGridBuilder, \
TrainValidationSplit, TrainValidationSplitModel
from pyspark.sql.functions import rand
from pyspark.testing.mlutils import SparkSessionTestCase
dataset = spark.createDataFrame(
[(Vectors.dense([0.0]), 0.0),
(Vectors.dense([0.4]), 1.0),
(Vectors.dense([0.5]), 0.0),
(Vectors.dense([0.6]), 1.0),
(Vectors.dense([1.0]), 1.0)] * 10,
["features", "label"])
ova = OneVsRest(classifier=LogisticRegression())
lr1 = LogisticRegression().setMaxIter(100)
lr2 = LogisticRegression().setMaxIter(150)
grid = ParamGridBuilder().addGrid(ova.classifier, [lr1, lr2]).build()
evaluator = MulticlassClassificationEvaluator()
pipeline = Pipeline(stages=[ova])
cv = CrossValidator(estimator=pipeline, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)
cvModel.save('/tmp/model002')
cvModel2 = CrossValidatorModel.load('/tmp/model002')
```
TrainValidationSplit testing code are similar so I do not paste them.
Closes#28279 from WeichenXu123/fix_pipeline_tuning.
Authored-by: Weichen Xu <weichen.xu@databricks.com>
Signed-off-by: Xiangrui Meng <meng@databricks.com>
### What changes were proposed in this pull request?
Convert `java.time.LocalDate` to `java.sql.Date` in pushed down filters to ORC datasource when Java 8 time API enabled.
Closes#28272
### Why are the changes needed?
The changes fix the exception raised while pushing date filters when `spark.sql.datetime.java8API.enabled` is set to `true`:
```
Wrong value class java.time.LocalDate for DATE.EQUALS leaf
java.lang.IllegalArgumentException: Wrong value class java.time.LocalDate for DATE.EQUALS leaf
at org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl.checkLiteralType(SearchArgumentImpl.java:192)
at org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$PredicateLeafImpl.<init>(SearchArgumentImpl.java:75)
at org.apache.hadoop.hive.ql.io.sarg.SearchArgumentImpl$BuilderImpl.equals(SearchArgumentImpl.java:352)
at org.apache.spark.sql.execution.datasources.orc.OrcFilters$.buildLeafSearchArgument(OrcFilters.scala:229)
```
### Does this PR introduce any user-facing change?
Yes
### How was this patch tested?
Added tests to `OrcFilterSuite`.
Closes#28261 from MaxGekk/orc-date-filter-pushdown.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR fixes a CTE substitution issue so as to the following SQL return the correct empty result:
```
WITH t(c) AS (SELECT 1)
SELECT * FROM t
WHERE c IN (
WITH t(c) AS (SELECT 2)
SELECT * FROM t
)
```
Before this PR the result was `1`.
### Why are the changes needed?
To fix a correctness issue.
### Does this PR introduce any user-facing change?
Yes, fixes a correctness issue.
### How was this patch tested?
Added new test case.
Closes#28318 from peter-toth/SPARK-31535-fix-nested-cte-substitution.
Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>