### What changes were proposed in this pull request?
This PR adds 2 changes regarding exception handling in `SQLQueryTestSuite` and `ThriftServerQueryTestSuite`
- fixes an expected output sorting issue in `ThriftServerQueryTestSuite` as if there is an exception then there is no need for sort
- introduces common exception handling in those 2 suites with a new `handleExceptions` method
### Why are the changes needed?
Currently `ThriftServerQueryTestSuite` passes on master, but it fails on one of my PRs (https://github.com/apache/spark/pull/23531) with this error (https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/111651/testReport/org.apache.spark.sql.hive.thriftserver/ThriftServerQueryTestSuite/sql_3/):
```
org.scalatest.exceptions.TestFailedException: Expected "
[Recursion level limit 100 reached but query has not exhausted, try increasing spark.sql.cte.recursion.level.limit
org.apache.spark.SparkException]
", but got "
[org.apache.spark.SparkException
Recursion level limit 100 reached but query has not exhausted, try increasing spark.sql.cte.recursion.level.limit]
" Result did not match for query #4 WITH RECURSIVE r(level) AS ( VALUES (0) UNION ALL SELECT level + 1 FROM r ) SELECT * FROM r
```
The unexpected reversed order of expected output (error message comes first, then the exception class) is due to this line: https://github.com/apache/spark/pull/26028/files#diff-b3ea3021602a88056e52bf83d8782de8L146. It should not sort the expected output if there was an error during execution.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing UTs.
Closes#26028 from peter-toth/SPARK-29359-better-exception-handling.
Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
### What changes were proposed in this pull request?
This PR is a very minor follow-up to become robust because `spark.sql.additionalRemoteRepositories` is a configuration which has a comma-separated value.
### Why are the changes needed?
This makes sure that `getHiveContribJar` will not fail on the configuration changes.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Manual. Change the default value with multiple repositories and run the following.
```
build/sbt -Phive "project hive" "test-only org.apache.spark.sql.hive.HiveSparkSubmitSuite"
```
Closes#26096 from dongjoon-hyun/SPARK-27831.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
### What changes were proposed in this pull request?
Minor version bump of Netty to patch reported CVE.
Patches: https://www.cvedetails.com/cve/CVE-2019-16869/
### Why are the changes needed?
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Compiled locally using `mvn clean install -DskipTests`
Closes#26099 from Fokko/SPARK-29445.
Authored-by: Fokko Driesprong <fokko@apache.org>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
This PR is Adding tooltip for The Executors Tab's column names include RDD Blocks, Disk Used,Cores, Activity Tasks, Failed Tasks , Complete Tasks, Total Tasks in the history server Page.
https://issues.apache.org/jira/browse/SPARK-29323
![image](https://user-images.githubusercontent.com/28332082/66017759-b6c24a80-e50e-11e9-807b-5b076f701d2f.png)
I have modify the following code in executorspage-template.html
Before:
<th>RDD Blocks</th>
<th>Disk Used</th>
<th>Cores</th>
<th>Active Tasks</th>
<th>Failed Tasks</th>
<th>Complete Tasks</th>
<th>Total Tasks</th>
![image](https://user-images.githubusercontent.com/28332082/66018111-4ddbd200-e510-11e9-9cfc-19f3eae81e76.png)
After:
<th><span data-toggle="tooltip" data-placement="top" title="RDD Blocks">RDD Blocks</span></th>
<th><span data-toggle="tooltip" data-placement="top" title="Disk Used">Disk Used</span></th>
<th><span data-toggle="tooltip" data-placement="top" title="Cores">Cores</span></th>
<th><span data-toggle="tooltip" data-placement="top" title="Active Tasks">Active Tasks</span></th>
<th><span data-toggle="tooltip" data-placement="top" title="Failed Tasks">Failed Tasks</span></th>
<th><span data-toggle="tooltip" data-placement="top" title="Complete Tasks">Complete Tasks</span></th>
<th><span data-toggle="tooltip" data-placement="top" title="Total Tasks">Total Tasks</span></th>
![image](https://user-images.githubusercontent.com/28332082/66018157-79f75300-e510-11e9-96ba-6230aa0940c7.png)
### Why are the changes needed?
the spark Executors of history Tab page, the Summary part shows the line in the list of title, but format is irregular.
Some column names have tooltip, such as Storage Memory, Task Time(GC Time), Input, Shuffle Read,
Shuffle Write and Blacklisted, but there are still some list names that have not tooltip. They are RDD Blocks, Disk Used,Cores, Activity Tasks, Failed Tasks , Complete Tasks and Total Tasks. oddly, Executors section below,All the column names Contains the column names above have tooltip .
It's important for open source projects to have consistent style and user-friendly UI, and I'm working on keeping it consistent And more user-friendly.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Manual tests for Chrome, Firefox and Safari
Authored-by: liucht-inspur <liuchtinspur.com>
Signed-off-by: liucht-inspur <liuchtinspur.com>
Closes#25994 from liucht-inspur/master.
Authored-by: liucht <liucht@inspur.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
Revert this commit 18b7ad2fc5.
### Why are the changes needed?
See https://github.com/apache/spark/pull/16304#discussion_r92753590
### Does this PR introduce any user-facing change?
Yes
### How was this patch tested?
There is no test for that.
Closes#26101 from MaxGekk/revert-mean-seconds-per-month.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
get the first row lazily, and reuse it for each vector column.
### Why are the changes needed?
avoid unnecssary `first` jobs
### Does this PR introduce any user-facing change?
no
### How was this patch tested?
existing testsuites & local tests in repl
Closes#26052 from zhengruifeng/rformula_lazy_row.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
This PR updates commons-beanutils to 1.9.4.
### Why are the changes needed?
CVE fixed in 1.9.4: http://commons.apache.org/proper/commons-beanutils/javadocs/v1.9.4/RELEASE-NOTES.txt
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Existing UTs.
Closes#26069 from peter-toth/SPARK-29410-update-commons-beanutils-to-1.9.4.
Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
- Move tree related classes to a separate file ```tree.py```
- add method ```predictLeaf``` in ```DecisionTreeModel```& ```TreeEnsembleModel```
### Why are the changes needed?
- keep parity between scala and python
- easy code maintenance
### Does this PR introduce any user-facing change?
Yes
add method ```predictLeaf``` in ```DecisionTreeModel```& ```TreeEnsembleModel```
add ```setMinWeightFractionPerNode``` in ```DecisionTreeClassifier``` and ```DecisionTreeRegressor```
### How was this patch tested?
add some doc tests
Closes#25929 from huaxingao/spark_29116.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Added tests for grouped map pandas_udf using a window.
### Why are the changes needed?
Current tests for grouped map do not use a window and this had previously caused an error due the window range being a struct column, which was not yet supported.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
New tests added.
Closes#26063 from BryanCutler/pyspark-pandas_udf-group-with-window-tests-SPARK-29402.
Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: Bryan Cutler <cutlerb@gmail.com>
### What changes were proposed in this pull request?
Add documentation to SQL programming guide to use PyArrow >= 0.15.0 with current versions of Spark.
### Why are the changes needed?
Arrow 0.15.0 introduced a change in format which requires an environment variable to maintain compatibility.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Ran pandas_udfs tests using PyArrow 0.15.0 with environment variable set.
Closes#26045 from BryanCutler/arrow-document-legacy-IPC-fix-SPARK-29367.
Authored-by: Bryan Cutler <cutlerb@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
This adds an entry about PrometheusServlet to the documentation, following SPARK-29032
### Why are the changes needed?
The monitoring documentation lists all the available metrics sinks, this should be added to the list for completeness.
Closes#26081 from LucaCanali/FollowupSpark29032.
Authored-by: Luca Canali <luca.canali@cern.ch>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
[SPARK-29064](https://github.com/apache/spark/pull/25770) introduced `PrometheusResource` to expose `ExecutorSummary`. This PR aims to improve it further more `Prometheus`-friendly to use [Prometheus labels](https://prometheus.io/docs/concepts/data_model/#metric-names-and-labels).
### Why are the changes needed?
**BEFORE**
```
metrics_app_20191008151432_0000_driver_executor_rddBlocks_Count 0
metrics_app_20191008151432_0000_driver_executor_memoryUsed_Count 0
metrics_app_20191008151432_0000_driver_executor_diskUsed_Count 0
```
**AFTER**
```
$ curl -s http://localhost:4040/metrics/executors/prometheus/ | head -n3
metrics_executor_rddBlocks_Count{application_id="app-20191008151625-0000", application_name="Spark shell", executor_id="driver"} 0
metrics_executor_memoryUsed_Count{application_id="app-20191008151625-0000", application_name="Spark shell", executor_id="driver"} 0
metrics_executor_diskUsed_Count{application_id="app-20191008151625-0000", application_name="Spark shell", executor_id="driver"} 0
```
### Does this PR introduce any user-facing change?
No, but `Prometheus` understands the new format and shows more intelligently.
<img width="735" alt="ui" src="https://user-images.githubusercontent.com/9700541/66438279-1756f900-e9e1-11e9-91c7-c04c6ce9172f.png">
### How was this patch tested?
Manually.
**SETUP**
```
$ sbin/start-master.sh
$ sbin/start-slave.sh spark://`hostname`:7077
$ bin/spark-shell --master spark://`hostname`:7077 --conf spark.ui.prometheus.enabled=true
```
**RESULT**
```
$ curl -s http://localhost:4040/metrics/executors/prometheus/ | head -n3
metrics_executor_rddBlocks_Count{application_id="app-20191008151625-0000", application_name="Spark shell", executor_id="driver"} 0
metrics_executor_memoryUsed_Count{application_id="app-20191008151625-0000", application_name="Spark shell", executor_id="driver"} 0
metrics_executor_diskUsed_Count{application_id="app-20191008151625-0000", application_name="Spark shell", executor_id="driver"} 0
```
Closes#26060 from dongjoon-hyun/SPARK-29400.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Update ANSI mode related config names in comments as "spark.sql.ansi.enabled"
### Why are the changes needed?
The removed configuration `spark.sql.parser.ansi.enabled` and `spark.sql.failOnIntegralTypeOverflow` still exist in code comments.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Grep the whole code to ensure the remove config names no longer exist.
```
git grep "parser.ansi.enabled"
git grep failOnIntegralTypeOverflow
git grep decimalOperationsNullOnOverflow
git grep ANSI_SQL_PARSER
git grep FAIL_ON_INTEGRAL_TYPE_OVERFLOW
git grep DECIMAL_OPERATIONS_NULL_ON_OVERFLOW
```
Closes#26067 from gengliangwang/spark-28989-followup.
Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR proposes to use Arrow R 0.14.1 for now in AppVeyor to make tests passed.
### Why are the changes needed?
To make build passed with Arrow. It doesn't work with setting `ARROW_PRE_0_15_IPC_FORMAT` to `1` to allow Arrow R 0.15 compatibility.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
AppVeyor
Closes#26041 from HyukjinKwon/investigate.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Use `.sameElements` to compare (non-nested) arrays, as `Arrays.deep` is removed in 2.13 and wasn't the best way to do this in the first place.
### Why are the changes needed?
To compile with 2.13.
### Does this PR introduce any user-facing change?
None.
### How was this patch tested?
Existing tests.
Closes#26073 from srowen/SPARK-29416.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Rewrite declaration of internal `ThreadUtils.parmap` method to avoid `TraversableLike`, which is removed in Scala 2.13.
### Why are the changes needed?
To compile in Scala 2.13.
### Does this PR introduce any user-facing change?
None.
### How was this patch tested?
Existing tests.
Closes#26072 from srowen/SPARK-29413.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
This is just a followup on https://github.com/apache/spark/pull/26062 -- see it for more detail.
I think we will eventually find more cases of this. It's hard to get them all at once as there are many different types of compile errors in earlier modules. I'm trying to address them in as a big a chunk as possible.
Closes#26074 from srowen/SPARK-29401.2.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Replace `Unit` with equivalent `()` where code refers to the `Unit` companion object.
### Why are the changes needed?
It doesn't compile otherwise in Scala 2.13.
- https://github.com/scala/scala/blob/v2.13.0/src/library/scala/Unit.scala#L30
### Does this PR introduce any user-facing change?
Should be no behavior change at all.
### How was this patch tested?
Existing tests.
Closes#26070 from srowen/SPARK-29411.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR adds an accumulator that computes a global aggregate over a number of rows. A user can define an arbitrary number of aggregate functions which can be computed at the same time.
The accumulator uses the standard technique for implementing (interpreted) aggregation in Spark. It uses projections and manual updates for each of the aggregation steps (initialize buffer, update buffer with new input row, merge two buffers and compute the final result on the buffer). Note that two of the steps (update and merge) use the aggregation buffer both as input and output.
Accumulators do not have an explicit point at which they get serialized. A somewhat surprising side effect is that the buffers of a `TypedImperativeAggregate` go over the wire as-is instead of serializing them. The merging logic for `TypedImperativeAggregate` assumes that the input buffer contains serialized buffers, this is violated by the accumulator's implicit serialization. In order to get around this I have added `mergeBuffersObjects` method that merges two unserialized buffers to `TypedImperativeAggregate`.
### Why are the changes needed?
This is the mechanism we are going to use to implement observable metrics.
### Does this PR introduce any user-facing change?
No, not yet.
### How was this patch tested?
Added `AggregatingAccumulator` test suite.
Closes#26012 from hvanhovell/SPARK-29346.
Authored-by: herman <herman@databricks.com>
Signed-off-by: herman <herman@databricks.com>
### What changes were proposed in this pull request?
The commit 4e6d31f570 changed default behavior of `size()` for the `NULL` input. In this PR, I propose to update the SQL migration guide.
### Why are the changes needed?
To inform users about new behavior of the `size()` function for the `NULL` input.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
N/A
Closes#26066 from MaxGekk/size-null-migration-guide.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
DataSourceV2 Exec classes (ShowTablesExec, ShowNamespacesExec, etc.) all extend LeafExecNode. This results in running a job when executeCollect() is called. This breaks the previous behavior [SPARK-19650](https://issues.apache.org/jira/browse/SPARK-19650).
A new command physical operator will be introduced form which all V2 Exec classes derive to avoid running a job.
### Why are the changes needed?
It is a bug since the current behavior runs a spark job, which breaks the existing behavior: [SPARK-19650](https://issues.apache.org/jira/browse/SPARK-19650).
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Existing unit tests.
Closes#26048 from imback82/dsv2_command.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Invocations like `sc.parallelize(Array((1,2)))` cause a compile error in 2.13, like:
```
[ERROR] [Error] /Users/seanowen/Documents/spark_2.13/core/src/test/scala/org/apache/spark/ShuffleSuite.scala:47: overloaded method value apply with alternatives:
(x: Unit,xs: Unit*)Array[Unit] <and>
(x: Double,xs: Double*)Array[Double] <and>
(x: Float,xs: Float*)Array[Float] <and>
(x: Long,xs: Long*)Array[Long] <and>
(x: Int,xs: Int*)Array[Int] <and>
(x: Char,xs: Char*)Array[Char] <and>
(x: Short,xs: Short*)Array[Short] <and>
(x: Byte,xs: Byte*)Array[Byte] <and>
(x: Boolean,xs: Boolean*)Array[Boolean]
cannot be applied to ((Int, Int), (Int, Int), (Int, Int), (Int, Int))
```
Using a `Seq` instead appears to resolve it, and is effectively equivalent.
### Why are the changes needed?
To better cross-build for 2.13.
### Does this PR introduce any user-facing change?
None.
### How was this patch tested?
Existing tests.
Closes#26062 from srowen/SPARK-29401.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Syntax like `'foo` is deprecated in Scala 2.13. Replace usages with `Symbol("foo")`
### Why are the changes needed?
Avoids ~50 deprecation warnings when attempting to build with 2.13.
### Does this PR introduce any user-facing change?
None, should be no functional change at all.
### How was this patch tested?
Existing tests.
Closes#26061 from srowen/SPARK-29392.
Authored-by: Sean Owen <sean.owen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Beautify comment
### Why are the changes needed?
The config name now is pretty weird.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
No test.
Closes#26054 from wangshisan/SPARK-29189.
Authored-by: gwang3 <gwang3@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
RDD dependencies and partitions can be simultaneously
accessed and mutated by user threads and spark's scheduler threads, so
access must be thread-safe. In particular, as partitions and
dependencies are lazily-initialized, before this change they could get
initialized multiple times, which would lead to the scheduler having an
inconsistent view of the pendings stages and get stuck.
Tested with existing unit tests.
Closes#25951 from squito/SPARK-28917.
Authored-by: Imran Rashid <irashid@cloudera.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
### What changes were proposed in this pull request?
Reimplement `org.apache.spark.sql.catalyst.util.QuantileSummaries#merge` and add a test-case showing the previous bug.
### Why are the changes needed?
The original Greenwald-Khanna paper, from which the algorithm behind `approxQuantile` was taken, does not cover how to merge the result of multiple parallel QuantileSummaries. The current implementation violates some invariants and therefore the effective error can be larger than the specified.
### Does this PR introduce any user-facing change?
Yes, for same cases, the results from `approxQuantile` (`percentile_approx` in SQL) will now be within the expected error margin. For example:
```scala
var values = (1 to 100).toArray
val all_quantiles = values.indices.map(i => (i+1).toDouble / values.length).toArray
for (n <- 0 until 5) {
var df = spark.sparkContext.makeRDD(values).toDF("value").repartition(5)
val all_answers = df.stat.approxQuantile("value", all_quantiles, 0.1)
val all_answered_ranks = all_answers.map(ans => values.indexOf(ans)).toArray
val error = all_answered_ranks.zipWithIndex.map({ case (answer, expected) => Math.abs(expected - answer) }).toArray
val max_error = error.max
print(max_error + "\n")
}
```
In the current build it returns:
```
16
12
10
11
17
```
I couldn't run the code with this patch applied to double check the implementation. Can someone please confirm it now outputs at most `10`, please?
### How was this patch tested?
A new unit test was added to uncover the previous bug.
Closes#26029 from sitegui/SPARK-29336.
Authored-by: Guilherme <sitegui@sitegui.com.br>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
Set the default value of the `spark.sql.legacy.sizeOfNull` config to `false`. That changes behavior of the `size()` function for `NULL`. The function will return `NULL` for `NULL` instead of `-1`.
### Why are the changes needed?
There is the agreement in the PR https://github.com/apache/spark/pull/21598#issuecomment-399695523 to change behavior in Spark 3.0.
### Does this PR introduce any user-facing change?
Yes.
Before:
```sql
spark-sql> select size(NULL);
-1
```
After:
```sql
spark-sql> select size(NULL);
NULL
```
### How was this patch tested?
By the `check outputs of expression examples` test in `SQLQuerySuite` which runs expression examples.
Closes#26051 from MaxGekk/sizeof-null-returns-null.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add back the resolved logical plan for UPDATE TABLE. It was in https://github.com/apache/spark/pull/25626 before but was removed later.
### Why are the changes needed?
In https://github.com/apache/spark/pull/25626 , we decided to not add the update API in DS v2, but we still want to implement UPDATE for builtin source like JDBC. We should at least add the resolved logical plan.
### Does this PR introduce any user-facing change?
no, UPDATE is still not supported yet.
### How was this patch tested?
new tests.
Closes#26025 from cloud-fan/update.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
### What changes were proposed in this pull request?
Add getters/setters in Pyspark ALSModel.
### Why are the changes needed?
To keep parity between python and scala.
### Does this PR introduce any user-facing change?
Yes.
add the following getters/setters to ALSModel
```
get/setUserCol
get/setItemCol
get/setColdStartStrategy
get/setPredictionCol
```
### How was this patch tested?
add doctest
Closes#25947 from huaxingao/spark-29269.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
The `LICENSE` file exists a minor issue has an incorrect path.
This PR will fix it.
### Why are the changes needed?
This is a minor bug.
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
Exists UT.
Closes#26050 from beliefer/resolve-minor-license-issue.
Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims the followings.
- Refactor `TPCDSQueryBenchmark` to use main method to improve the usability.
- Reduce the number of iteration from 5 to 2 because it takes too long. (2 is okay because we have `Stdev` field now. If there is an irregular run, we can notice easily with that).
- Generate one result file for TPCDS scale factor 1. (Note that this test suite can be used for the other scale factors, too.)
- AWS EC2 `r3.xlarge` with `ami-06f2f779464715dc5 (ubuntu-bionic-18.04-amd64-server-20190722.1)` is used.
This PR adds a JDK8 result based on the TPCDS ScaleFactor 1G data generated by the following.
```
# `spark-tpcds-datagen` needs this. (JDK8)
$ git clone https://github.com/apache/spark.git -b branch-2.4 --depth 1 spark-2.4
$ export SPARK_HOME=$PWD
$ ./build/mvn clean package -DskipTests
# Generate data. (JDK8)
$ git clone gitgithub.com:maropu/spark-tpcds-datagen.git
$ cd spark-tpcds-datagen/
$ build/mvn clean package
$ mkdir -p /data/tpcds
$ ./bin/dsdgen --output-location /data/tpcds/s1 // This need `Spark 2.4`
```
### Why are the changes needed?
Although the generated TPCDS data is random, we can keep the record.
### Does this PR introduce any user-facing change?
No. (This is dev-only test benchmark).
### How was this patch tested?
Manually run the benchmark. Please note that you need to have TPCDS data.
```
SPARK_GENERATE_BENCHMARK_FILES=1 build/sbt "sql/test:runMain org.apache.spark.sql.execution.benchmark.TPCDSQueryBenchmark --data-location /data/tpcds/s1"
```
Closes#26049 from dongjoon-hyun/SPARK-25668.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Fix a config name typo from the resource scheduling user docs. In case users might get confused with the wrong config name, we'd better fix this typo.
### How was this patch tested?
Document change, no need to run test.
Closes#26047 from jiangxb1987/doc.
Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com>
Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>
Previously, the RDD level would change depending on the status reported
by executors for the block they were storing, and individual blocks would
reflect that. That is wrong because different blocks may be stored differently
in different executors.
So now the RDD tracks the user-provided storage level, while the individual
partitions reflect the current storage level of that particular block,
including the current number of replicas.
Closes#25779 from vanzin/SPARK-27468.
Authored-by: Marcelo Vanzin <vanzin@cloudera.com>
Signed-off-by: Imran Rashid <irashid@cloudera.com>
### What changes were proposed in this pull request?
In our PROD env, we have a pure Spark cluster, I think this is also pretty common, where computation is separated from storage layer. In such deploy mode, data locality is never reachable.
And there are some configurations in Spark scheduler to reduce waiting time for data locality(e.g. "spark.locality.wait"). While, problem is that, in listing file phase, the location informations of all the files, with all the blocks inside each file, are all fetched from the distributed file system. Actually, in a PROD environment, a table can be so huge that even fetching all these location informations need take tens of seconds.
To improve such scenario, Spark need provide an option, where data locality can be totally ignored, all we need in the listing file phase are the files locations, without any block location informations.
### Why are the changes needed?
And we made a benchmark in our PROD env, after ignore the block locations, we got a pretty huge improvement.
Table Size | Total File Number | Total Block Number | List File Duration(With Block Location) | List File Duration(Without Block Location)
-- | -- | -- | -- | --
22.6T | 30000 | 120000 | 16.841s | 1.730s
28.8 T | 42001 | 148964 | 10.099s | 2.858s
3.4 T | 20000 | 20000 | 5.833s | 4.881s
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Via ut.
Closes#25869 from wangshisan/SPARK-29189.
Authored-by: gwang3 <gwang3@ebay.com>
Signed-off-by: Imran Rashid <irashid@cloudera.com>
### What changes were proposed in this pull request?
In the PR, I propose to pass the `Pattern.CASE_INSENSITIVE` flag while compiling interval patterns in `CalendarInterval`. This makes casting string values to intervals case insensitive and tolerant to case of the `interval`, `year(s)`, `month(s)`, `week(s)`, `day(s)`, `hour(s)`, `minute(s)`, `second(s)`, `millisecond(s)` and `microsecond(s)`.
### Why are the changes needed?
There are at least 2 reasons:
- To maintain feature parity with PostgreSQL which is not sensitive to case:
```sql
# select cast('10 Days' as INTERVAL);
interval
----------
10 days
(1 row)
```
- Spark is tolerant to case of interval literals. Case insensitivity in casting should be convenient for Spark users.
```sql
spark-sql> SELECT INTERVAL 1 YEAR 1 WEEK;
interval 1 years 1 weeks
```
### Does this PR introduce any user-facing change?
Yes, current implementation produces `NULL` for `interval`, `year`, ... `microsecond` that are not in lower case.
Before:
```sql
spark-sql> SELECT CAST('INTERVAL 10 DAYS' as INTERVAL);
NULL
```
After:
```sql
spark-sql> SELECT CAST('INTERVAL 10 DAYS' as INTERVAL);
interval 1 weeks 3 days
```
### How was this patch tested?
- by new tests in `CalendarIntervalSuite.java`
- new test in `CastSuite`
Closes#26010 from MaxGekk/interval-case-insensitive.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
add column setters/getters support in Pyspark feature models
### Why are the changes needed?
keep parity between Pyspark and Scala
### Does this PR introduce any user-facing change?
Yes.
After the change, Pyspark feature models have column setters/getters support.
### How was this patch tested?
Add some doctests
Closes#25908 from huaxingao/spark-29143.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
### Why are the changes needed?
Keep parity between Scala and Python
### Does this PR introduce any user-facing change?
add the following getters/setter to FPGrowthModel
```
getMinSupport
getNumPartitions
getMinConfidence
getItemsCol
getPredictionCol
setItemsCol
setMinConfidence
setPredictionCol
```
add following getters/setters to PrefixSpan
```
set/getMinSupport
set/getMaxPatternLength
set/getMaxLocalProjDBSize
set/getSequenceCol
```
### How was this patch tested?
add doctest
Closes#26035 from huaxingao/spark-29360.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
The current docker image used by Kubernetes is `openjdk:8-alpine`. It was not supported and was removed with the commit 3eb0351b20 (diff-f95ffa3d1377774732c33f7b8368e099).
This PR proposes to move to a supported docker image.
### Why are the changes needed?
I think there are at least two reasons:
1. According to the commit, Alpine/musl is not officially supported by the OpenJDK project.
2. As no more OpenJDK 8 Alpine images, new JDK updates including security fixes
, are not applied to it. See below:
```
docker run -it --rm openjdk:8-alpine java -version
openjdk version "1.8.0_212"
OpenJDK Runtime Environment (IcedTea 3.12.0) (Alpine 8.212.04-r0)
OpenJDK 64-Bit Server VM (build 25.212-b04, mixed mode)
```
```
docker run -it --rm openjdk:8-jdk-slim java -version
openjdk version "1.8.0_222"
OpenJDK Runtime Environment (build 1.8.0_222-b10)
OpenJDK 64-Bit Server VM (build 25.222-b10, mixed mode)
```
### Does this PR introduce any user-facing change?
Yes. This changes the base docker image of Spark.
### How was this patch tested?
Existing tests.
Closes#26037 from viirya/SPARK-28938.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
I introduced new constants `SECONDS_PER_MONTH` and `MILLIS_PER_MONTH`, and reused it in calculations of seconds/milliseconds per month. `SECONDS_PER_MONTH` is 2629746 because the average year of the Gregorian calendar is 365.2425 days long or 60 * 60 * 24 * 365.2425 = 31556952.0 = 12 * 2629746 seconds per year.
### Why are the changes needed?
Spark uses the proleptic Gregorian calendar (see https://issues.apache.org/jira/browse/SPARK-26651) in which the average year is 365.2425 days (see https://en.wikipedia.org/wiki/Gregorian_calendar) but existing implementation assumes 31 days per months or 12 * 31 = 372 days. That's far away from the the truth.
### Does this PR introduce any user-facing change?
Yes, the changes affect at least 3 methods in `GroupStateImpl`, `EventTimeWatermark` and `MonthsBetween`. For example, the `month_between()` function will return different result in some cases.
Before:
```sql
spark-sql> select months_between('2019-09-15', '1970-01-01');
596.4516129
```
After:
```sql
spark-sql> select months_between('2019-09-15', '1970-01-01');
596.45996838
```
### How was this patch tested?
By existing test suite `DateTimeUtilsSuite`, `DateFunctionsSuite` and `DateExpressionsSuite`.
Closes#25998 from MaxGekk/days-in-year.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
Added new rules to `TypeCoercion.DateTimeOperations` for the `Subtract` expression which is replaced by existing `TimestampDiff` expression if one of its parameter has the `DATE` type and another one is the `TIMESTAMP` type. The date argument is casted to timestamp.
### Why are the changes needed?
- To maintain feature parity with PostgreSQL which supports subtraction of a date from a timestamp and a timestamp from a date:
```sql
maxim=# select timestamp'now' - date'epoch';
?column?
----------------------------
18175 days 21:07:33.412875
(1 row)
maxim=# select date'2020-01-01' - timestamp'now';
?column?
-------------------------
86 days 02:52:00.945296
(1 row)
```
- To conform to the SQL standard which defines datetime subtraction as an interval.
### Does this PR introduce any user-facing change?
Yes, currently the queries bellow fails with the error:
```sql
spark-sql> select timestamp'now' - date'2019-10-01';
Error in query: cannot resolve '(TIMESTAMP('2019-10-06 21:05:07.234') - DATE '2019-10-01')' due to data type mismatch: differing types in '(TIMESTAMP('2019-10-06 21:05:07.234') - DATE '2019-10-01')' (timestamp and date).; line 1 pos 7;
'Project [unresolvedalias((1570385107234000 - 18170), None)]
+- OneRowRelation
```
after the changes:
```sql
spark-sql> select timestamp'now' - date'2019-10-01';
interval 5 days 21 hours 4 minutes 55 seconds 878 milliseconds
```
### How was this patch tested?
- Add new cases to the `rule for date/timestamp operations` test in `TypeCoercionSuite`
- by 2 new test in `datetime.sql`
Closes#26036 from MaxGekk/date-timestamp-subtract.
Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
In kubernetes, there are some naming regular expression requirements and restrictions on environment variable names, such as:
- In kubernetes version release-1.7 and earlier, the naming rules of pod environment variable names should meet the requirements of regular expressions: [[A-Za-z_] [A-Za-z0-9_]*](https://github.com/kubernetes/kubernetes/blob/release-1.7/staging/src/k8s.io/apimachinery/pkg/util/validation/validation.go#L169)
- In kubernetes version release-1.8 and later, the naming rules of pod environment variable names should meet the requirements of regular expressions: [[-. _ A-ZA-Z][-. _ A-ZA-Z0-9].*](https://github.com/kubernetes/kubernetes/blob/release-1.8/staging/src/k8s.io/apimachinery/pkg/util/validation/validation.go#L305)
However, in spark on k8s mode, spark should add restrictions on environmental variable names when creating executorEnv.
In addition, we need to use regular expressions adapted to the high version of k8s to increase the restrictions on the names of environmental variables.
Otherwise, the pod will not be created properly and the spark application will be suspended.
To solve the problem above, a regular validation to executorEnv is added and committed.
### Why are the changes needed?
If no validation rules are added, the environment variable names that don't meet the requirements will cause the pod to not be created properly and the application will be suspended.
### Does this PR introduce any user-facing change?
No.
### How was this patch tested?
Add unit tests and manually run.
Closes#25920 from merrily01/SPARK-29233.
Authored-by: maruilei <maruilei@jd.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
### What changes were proposed in this pull request?
- Removal of `private[ml]` modifier from `Regressor`.
- Marking `Regressor` as `DeveloperApi`.
### Why are the changes needed?
Consistency with the rest of ML API as described in [the corresponding JIRA ticket](https://issues.apache.org/jira/browse/SPARK-29363).
### Does this PR introduce any user-facing change?
Yes, as described above.
### How was this patch tested?
Existing tests.
Closes#26033 from zero323/SPARK-29363.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>