### What changes were proposed in this pull request?
This PR aims to update comments in `gen-sql-config-docs.py`.
### Why are the changes needed?
To make it up to date according to Spark version 3.2.0 release.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
N/A.
Closes#33902 from williamhyun/fixtool.
Authored-by: William Hyun <william@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Two case of using undefined window frame as below should provide proper error message
1. For case using undefined window frame with window function
```
SELECT nth_value(employee_name, 2) OVER w second_highest_salary
FROM basic_pays;
```
origin error message is
```
Window function nth_value(employee_name#x, 2, false) requires an OVER clause.
```
It's confused that in use use a window frame `w` but it's not defined.
Now the error message is
```
Window specification w is not defined in the WINDOW clause.
```
2. For case using undefined window frame with aggregation function
```
SELECT SUM(salary) OVER w sum_salary
FROM basic_pays;
```
origin error message is
```
Error in query: unresolved operator 'Aggregate [unresolvedwindowexpression(sum(salary#2), WindowSpecReference(w)) AS sum_salary#34]
+- SubqueryAlias spark_catalog.default.basic_pays
+- HiveTableRelation [`default`.`employees`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [name#0, dept#1, salary#2, age#3], Partition Cols: []]
```
In this case, when convert GlobalAggregate, should skip UnresolvedWindowExpression
Now the error message is
```
Window specification w is not defined in the WINDOW clause.
```
### Why are the changes needed?
Provide proper error message
### Does this PR introduce _any_ user-facing change?
Yes, error messages are improved as described in desc
### How was this patch tested?
Added UT
Closes#33892 from AngersZhuuuu/SPARK-36637.
Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Update both `DataFrame.approxQuantile` and `DataFrameStatFunctions.approxQuantile` to support overloaded definitions when multiple columns are supplied.
### Why are the changes needed?
The current type hints don't support the multi-column signature, a form that was added in Spark 2.2 (see [the approxQuantile docs](https://spark.apache.org/docs/latest/api/python/reference/api/pyspark.sql.DataFrame.approxQuantile.html).) This change was also introduced to pyspark-stubs (https://github.com/zero323/pyspark-stubs/pull/552). zero323 asked me to open a PR for the upstream change.
### Does this PR introduce _any_ user-facing change?
This change only affects type hints - it brings the `approxQuantile` type hints up to date with the actual code.
### How was this patch tested?
Ran `./dev/lint-python`.
Closes#33880 from carylee/master.
Authored-by: Cary Lee <cary@amperity.com>
Signed-off-by: zero323 <mszymkiewicz@gmail.com>
### What changes were proposed in this pull request?
This PR adds a test for SPARK-36400 (#33743).
### Why are the changes needed?
SPARK-36512 (#33741) was fixed so we can add this test now.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New test.
Closes#33885 from sarutak/add-reduction-test.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
### What changes were proposed in this pull request?
This PR proposes to implement `TimestampNTZType` support in PySpark's `SparkSession.createDataFrame`, `DataFrame.toPandas`, Python UDFs, and pandas UDFs with and without Arrow.
### Why are the changes needed?
To complete `TimestampNTZType` support.
### Does this PR introduce _any_ user-facing change?
Yes.
- Users now can use `TimestampNTZType` type in `SparkSession.createDataFrame`, `DataFrame.toPandas`, Python UDFs, and pandas UDFs with and without Arrow.
- If `spark.sql.timestampType` is configured to `TIMESTAMP_NTZ`, PySpark will infer the `datetime` without timezone as `TimestampNTZType`. If it has a timezone, it will be inferred as `TimestampType` in `SparkSession.createDataFrame`.
- If `TimestampType` and `TimestampNTZType` conflict during merging inferred schema, `TimestampType` has a higher precedence.
- If the type is `TimestampNTZType`, treat this internally as an unknown timezone, and compute w/ UTC (same as JVM side), and avoid localization externally.
### How was this patch tested?
Manually tested and unittests were added.
Closes#33876 from HyukjinKwon/SPARK-36626.
Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: Dominik Gehl <dog@open.ch>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to ask users if they want to download and install SparkR when they install SparkR from CRAN.
`SPARKR_ASK_INSTALLATION` environment variable was added in case other notebook projects are affected.
### Why are the changes needed?
This is required for CRAN. Currently SparkR is removed: https://cran.r-project.org/web/packages/SparkR/index.html.
See also https://lists.apache.org/thread.html/r02b9046273a518e347dfe85f864d23d63d3502c6c1edd33df17a3b86%40%3Cdev.spark.apache.org%3E
### Does this PR introduce _any_ user-facing change?
Yes, `sparkR.session(...)` will ask if users want to download and install Spark package or not if they are in the plain R shell or `Rscript`.
### How was this patch tested?
**R shell**
Valid input (`n`):
```
> sparkR.session(master="local")
Spark not found in SPARK_HOME:
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): n
```
```
Error in sparkCheckInstall(sparkHome, master, deployMode) :
Please make sure Spark package is installed in this machine.
- If there is one, set the path in sparkHome parameter or environment variable SPARK_HOME.
- If not, you may run install.spark function to do the job.
```
Invalid input:
```
> sparkR.session(master="local")
Spark not found in SPARK_HOME:
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): abc
```
```
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n):
```
Valid input (`y`):
```
> sparkR.session(master="local")
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): y
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
Preferred mirror site found: https://ftp.riken.jp/net/apache/spark
Downloading spark-3.3.0 for Hadoop 2.7 from:
- https://ftp.riken.jp/net/apache/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.7.tgz
trying URL 'https://ftp.riken.jp/net/apache/spark/spark-3.3.0/spark-3.3.0-bin-hadoop2.7.tgz'
...
```
**Rscript**
```
cat tmp.R
```
```
library(SparkR, lib.loc = c(file.path(".", "R", "lib")))
sparkR.session(master="local")
```
```
Rscript tmp.R
```
Valid input (`n`):
```
Spark not found in SPARK_HOME:
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): n
```
```
Error in sparkCheckInstall(sparkHome, master, deployMode) :
Please make sure Spark package is installed in this machine.
- If there is one, set the path in sparkHome parameter or environment variable SPARK_HOME.
- If not, you may run install.spark function to do the job.
Calls: sparkR.session -> sparkCheckInstall
```
Invalid input:
```
Spark not found in SPARK_HOME:
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): abc
```
```
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n):
```
Valid input (`y`):
```
...
Spark not found in SPARK_HOME:
Will you download and install (or reuse if it exists) Spark package under the cache [/.../Caches/spark]? (y/n): y
Spark not found in the cache directory. Installation will start.
MirrorUrl not provided.
Looking for preferred site from apache website...
Preferred mirror site found: https://ftp.riken.jp/net/apache/spark
Downloading spark-3.3.0 for Hadoop 2.7 from:
...
```
`bin/sparkR` and `bin/spark-submit *.R` are not affected (tested).
Closes#33887 from HyukjinKwon/SPARK-36631.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Apache license headers to Pandas API on Spark documents.
### Why are the changes needed?
Pandas API on Spark document sources do not have license headers, while the other docs have.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
`make html`
Closes#33871 from yoda-mon/add-license-header.
Authored-by: Leona Yoda <yodal@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to add `BooleanType` support to the `UnwrapCastInBinaryComparison` optimizer that is currently supports `NumericType` only.
The main idea is to treat `BooleanType` as 1 bit integer so that we can utilize all optimizations already defined in `UnwrapCastInBinaryComparison`.
This work is an extension of SPARK-24994 and SPARK-32858
### Why are the changes needed?
Current implementation of Spark without this PR cannot properly optimize the filter for the following case
```
SELECT * FROM t WHERE boolean_field = 2
```
The above query creates a filter of `cast(boolean_field, int) = 2`. The casting prevents from pushing down the filter. In contrast, this PR creates a `false` filter and returns early as there cannot be such a matching rows anyway (empty results.)
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Passed existing tests
```
build/sbt "catalyst/test"
build/sbt "sql/test"
```
Added unit tests
```
build/sbt "catalyst/testOnly *UnwrapCastInBinaryComparisonSuite -- -z SPARK-36607"
build/sbt "sql/testOnly *UnwrapCastInComparisonEndToEndSuite -- -z SPARK-36607"
```
Closes#33865 from kazuyukitanimura/SPARK-36607.
Authored-by: Kazuyuki Tanimura <ktanimura@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This change creates a new type of Trigger: Trigger.AvailableNow for streaming queries. It is like Trigger.Once, which process all available data then stop the query, but with better scalability since data can be processed in multiple batches instead of one.
To achieve this, this change proposes a new interface `SupportsTriggerAvailableNow`, which is an extension of `SupportsAdmissionControl`. It has one method, `prepareForTriggerAvailableNow`, which will be called at the beginning of streaming queries with Trigger.AvailableNow, to let the source record the offset for the current latest data at the time (a.k.a. the target offset for the query). The source should then behave as if there is no new data coming in after the beginning of the query, i.e., the source will not return an offset higher than the target offset when `latestOffset` is called.
This change also updates `FileStreamSource` to be an implementation of `SupportsTriggerAvailableNow`.
For other sources that does not implement `SupportsTriggerAvailableNow`, this change also provides a new class `FakeLatestOffsetSupportsTriggerAvailableNow`, which wraps the sources and makes them support Trigger.AvailableNow, by overriding their `latestOffset` method to always return the latest offset at the beginning of the query.
### Why are the changes needed?
Currently streaming queries with Trigger.Once will always load all of the available data in a single batch. Because of this, the amount of data a query can process is limited, or Spark driver will run out of memory.
### Does this PR introduce _any_ user-facing change?
Users will be able to use Trigger.AvailableNow (to process all available data then stop the streaming query) with this change.
### How was this patch tested?
Added unit tests.
Closes#33763 from bozhang2820/new-trigger.
Authored-by: Bo Zhang <bo.zhang@databricks.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
This PR aims to upgrade `aircompressor` dependency from 1.19 to 1.21.
### Why are the changes needed?
This will bring the latest bug fix which exists in `aircompressor` 1.17 ~ 1.20.
- 1e364f7133
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CIs.
Closes#33883 from dongjoon-hyun/SPARK-36629.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Implement `DataFrame.combine_first`.
The PR is based on https://github.com/databricks/koalas/pull/1950. Thanks AishwaryaKalloli for the prototype.
### Why are the changes needed?
Update null elements with value in the same location in another is a common use case.
That is supported in pandas. We should support that as well.
### Does this PR introduce _any_ user-facing change?
Yes. `DataFrame.combine_first` can be used.
```py
>>> ps.set_option("compute.ops_on_diff_frames", True)
>>> df1 = ps.DataFrame({'A': [None, 0], 'B': [None, 4]})
>>> df2 = ps.DataFrame({'A': [1, 1], 'B': [3, 3]})
>>> df1.combine_first(df2).sort_index()
A B
0 1.0 3.0
1 0.0 4.0
# Null values still persist if the location of that null value does not exist in other
>>> df1 = ps.DataFrame({'A': [None, 0], 'B': [4, None]})
>>> df2 = ps.DataFrame({'B': [3, 3], 'C': [1, 1]}, index=[1, 2])
>>> df1.combine_first(df2).sort_index()
A B C
0 NaN 4.0 NaN
1 0.0 3.0 1.0
2 NaN 3.0 1.0
>>> ps.reset_option("compute.ops_on_diff_frames")
```
### How was this patch tested?
Unit tests.
Closes#33714 from xinrong-databricks/df_combine_first.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to add the support of `TimestampNTZType` in Arrow APIs.
Now, Arrow can write `TimestampNTZType` as Timestamp with `null` timezone in Arrow.
### Why are the changes needed?
To complete the support of `TimestampNTZType` in Apache Spark.
### Does this PR introduce _any_ user-facing change?
Yes, the Arrow APIs (`ArrowColumnVector`) can now write `TimestampNTZType`
### How was this patch tested?
Unittests were added.
Closes#33875 from HyukjinKwon/SPARK-36608-arrow.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to fix bugs around prefix-scan for both HDFS backed state store and RocksDB state store.
> HDFS backed state store
We did "shallow-copy" on copying prefix map, which leads the values of prefix map (mutable Set) to be "same instances" across multiple versions. This PR fixes it via creating a new mutable Set and copying elements.
> RocksDB state store
Prefix-scan iterators are only closed on RocksDB.rollback(), which is only called in RocksDBStateStore.abort().
While `RocksDBStateStore.abort()` method will be called for streaming session window (since it has two physical plans for read and write), other stateful operators which only have read-write physical plan will call either commit or abort, and don't close the iterators on committing. These unclosed iterators can be "reused" and produce incorrect outputs.
This PR ensures that resetting prefix-scan iterators is done on loading RocksDB, which was only done in rollback.
### Why are the changes needed?
Please refer the above section on explanation of bugs and treatments.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Modified UT which failed without this PR and passes with this PR.
Closes#33870 from HeartSaVioR/SPARK-36619.
Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
Remove unused listener in HiveThriftServer2AppStatusStore.
### Why are the changes needed?
`HiveThriftServer2AppStatusStore` provides a view of a KVStore with methods that make it easy to query SQL-specific state.
`HiveThriftServer2Listener` has no function in this class, and it should not appear in this class. We can remove it.
### Does this PR introduce _any_ user-facing change?
### How was this patch tested?
Closes#33866 from timothy65535/master.
Authored-by: timothy65535 <timothy65535@163.com>
Signed-off-by: Kent Yao <yao@apache.org>
### What changes were proposed in this pull request?
Spark SQL includes a data source that can read data from other databases using JDBC.
Spark also supports the case-insensitive option `pushDownPredicate`.
According to http://spark.apache.org/docs/latest/sql-data-sources-jdbc.html, If set `pushDownPredicate` to false, no filter will be pushed down to the JDBC data source and thus all filters will be handled by Spark.
But I find it still be pushed down to JDBC data source.
### Why are the changes needed?
Fix bug `pushDownPredicate`=false failed to prevent push down filters to JDBC data source.
### Does this PR introduce _any_ user-facing change?
'No'.
The output of query will not change.
### How was this patch tested?
Jenkins test.
Closes#33822 from beliefer/SPARK-36574.
Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Mark the exception added `private[spark]`
according [comments](https://github.com/apache/spark/pull/33573#discussion_r696324962)
### Why are the changes needed?
[comments](https://github.com/apache/spark/pull/33573#discussion_r696324962)
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existed ut testcase
Closes#33856 from Peng-Lei/SPARK-36336-FOLLOW.
Authored-by: PengLei <peng.8lei@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Use WeakReference not SoftReference in LevelDB
### Why are the changes needed?
(See discussion at https://github.com/apache/spark/pull/28769#issuecomment-906722390 )
"The soft reference to iterator introduced in this pr unfortunately ended up causing iterators to not be closed when they go out of scope (which would have happened earlier in the finalize)
This is because java is more conservative in cleaning up SoftReference's.
The net result was we ended up having 50k files for SHS while typically they get compacted away to 200 odd files.
Changing from SoftReference to WeakReference should make it much more aggresive in cleanup and prevent the issue - which we observed in a 3.1 SHS"
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests
Closes#33859 from srowen/SPARK-36603.
Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
The main change of this pr is replace `filter` + `contains` with `intersect` api and `filterNot` + `contains` with `diff`
**Before**
```scala
val set = Set(1, 2)
val others = Set(2, 3)
set.filter(others.contains(_))
set.filterNot(others.contains)
```
**After**
```scala
val set = Set(1, 2)
val others = Set(2, 3)
set.intersect(others)
set.diff(others)
```
### Why are the changes needed?
Code simplification, replace manual implementation with existing API
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass the Jenkins or GitHub Action
Closes#33829 from LuciferYang/SPARK-36580.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Proposing that the `KafkaOffsetRangeCalculator`'s range calculation logic be modified to exclude small (i.e. un-split) partitions from the overall proportional distribution math, in order to more reasonably divide the large partitions when they are accompanied by many small partitions, and to provide optimal behavior for cases where a `minPartitions` value is deliberately computed based on the volume of data being read.
### Why are the changes needed?
While the [documentation](https://spark.apache.org/docs/latest/structured-streaming-kafka-integration.html) does contain a clear disclaimer,
> Please note that this configuration is like a hint: the number of Spark tasks will be **approximately** `minPartitions`. It can be less or more depending on rounding errors or Kafka partitions that didn't receive any new data.
there are cases where the calculated Kafka partition range splits can differ greatly from expectations. For evenly distributed data and most `minPartitions `values this would not be a major or commonly encountered concern. However when the distribution of data across partitions is very heavily skewed, somewhat surprising range split calculations can result.
For example, given the following input data:
- 1 partition containing 10,000 messages
- 1,000 partitions each containing 1 message
Spark processing code loading from this collection of 1,001 partitions may decide that it would like each task to read no more than 1,000 messages. Consequently, it could specify a `minPartitions` value of 1,010 — expecting the single large partition to be split into 10 equal chunks, along with the 1,000 small partitions each having their own task. That is far from what actually occurs. The `KafkaOffsetRangeCalculator` algorithm ends up splitting the large partition into 918 chunks of 10 or 11 messages, two orders of magnitude from the desired maximum message count per task and nearly double the number of Spark tasks hinted in the configuration.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing unit tests and added new unit tests
Closes#33827 from noslowerdna/SPARK-36576.
Authored-by: Andrew Olson <aolson1@cerner.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
Some of GitHub Actions workflow files do not have the Apache license header. This PR adds them.
### Why are the changes needed?
To comply Apache license.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
N/A
Closes#33862 from HyukjinKwon/minor-lisence.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
The `try_add` function allows the following inputs:
- number, number
- date, number
- date, interval
- timestamp, interval
- interval, interval
And, the `try_divide` function allows the following inputs:
- number, number
- interval, number
However, in the current code, there are only examples and tests about the (number, number) inputs. We should enhance the docs to let users know that the functions can be used for datetime and interval operations too.
### Why are the changes needed?
Improve documentation and tests.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
New UT
Also build docs for preview:
![image](https://user-images.githubusercontent.com/1097932/131212897-8aea14c8-a882-4e12-94e2-f56bde7c0367.png)
Closes#33861 from gengliangwang/enhanceTryDoc.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This pr upgrade Apache `commons-pool2` from `2.6.2` to `2.11.1`, `2.11.1` is a Java 8 build version and `2.6.2` is still a Java 7 build version.
### Why are the changes needed?
Bring some bug fix like `DefaultPooledObject.getIdleTime() drops nanoseconds on Java 9 and greater`
Other changes: [RELEASE-NOTES](https://gitbox.apache.org/repos/asf?p=commons-pool.git;a=blob;f=RELEASE-NOTES.txt)
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass the Jenkins or GitHub Action
Closes#33831 from LuciferYang/commons-pool2.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR upgrades Jackson from `2.12.3` to `2.12.5`.
### Why are the changes needed?
Recently, Jackson `2.12.5` was released and it seems to be expected as the last full patch release for 2.12.x.
This release includes a fix for a regression in jackson-databind introduced in `2.12.3` which Spark 3.2 currently depends on.
https://github.com/FasterXML/jackson/wiki/Jackson-Release-2.12.5
### Does this PR introduce _any_ user-facing change?
Dependency maintenance.
### How was this patch tested?
CIs.
Closes#33860 from sarutak/upgrade-jackson-2.12.5.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
This PR fixes an issue that executors are never re-scheduled if the worker which the executors run on stops.
As a result, the application stucks.
You can easily reproduce this issue by the following procedures.
```
# Run master
$ sbin/start-master.sh
# Run worker 1
$ SPARK_LOG_DIR=/tmp/worker1 SPARK_PID_DIR=/tmp/worker1/ sbin/start-worker.sh -c 1 -h localhost -d /tmp/worker1 --webui-port 8081 spark://<hostname>:7077
# Run worker 2
$ SPARK_LOG_DIR=/tmp/worker2 SPARK_PID_DIR=/tmp/worker2/ sbin/start-worker.sh -c 1 -h localhost -d /tmp/worker2 --webui-port 8082 spark://<hostname>:7077
# Run Spark Shell
$ bin/spark-shell --master spark://<hostname>:7077 --executor-cores 1 --total-executor-cores 1
# Check which worker the executor runs on and then kill the worker.
$ kill <worker pid>
```
With the procedure above, we will expect that the executor is re-scheduled on the other worker but it won't.
The reason seems that `Master.schedule` cannot be called after the worker is marked as `WorkerState.DEAD`.
So, the solution this PR proposes is to call `Master.schedule` whenever `Master.removeWorker` is called.
This PR also fixes an issue that `ExecutorRunner` can send `ExecutorStateChanged` message without changing its state.
This issue causes assertion error.
```
2021-08-13 14:05:37,991 [dispatcher-event-loop-9] ERROR: Ignoring errorjava.lang.AssertionError: assertion failed: executor 0 state transfer from RUNNING to RUNNING is illegal
```
### Why are the changes needed?
It's a critical bug.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Manually tested with the procedure shown above and confirmed the executor is re-scheduled.
Closes#33818 from sarutak/fix-scheduling-stuck.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
### What changes were proposed in this pull request?
This PR does minor changes in the file SaveAsHiveFile.scala.
It contains the below changes :
1. dropping getParent from below part of code
===============================
if (extURI.getScheme == "viewfs") {
getExtTmpPathRelTo(path.getParent, hadoopConf, stagingDir)
===============================
### Why are the changes needed?
Hive is creating .staging directories inside "/db/table" location but Spark-sql creates .staging directories inside /db/" location when we use hadoop federation(viewFs). But works as expected (creating .staging inside /db/table/ location for other filesystems like hdfs).
In HIVE:
```
beeline
> use dicedb;
> insert into table part_test partition (j=1) values (1);
...
INFO : Loading data to table dicedb.part_test partition (j=1) from **viewfs://cloudera/user/daisuke/dicedb/part_test/j=1/.hive-staging_hive_2021-07-19_13-04-44_989_6775328876605030677-1/-ext-10000**
```
but spark's behaviour,
```
spark-sql> use dicedb;
spark-sql> insert into table part_test partition (j=2) values (2);
21/07/19 13:07:37 INFO FileUtils: Creating directory if it doesn't exist: **viewfs://cloudera/user/daisuke/dicedb/.hive-staging_hive_2021-07-19_13-07-37_317_5083528872437596950-1**
...
```
The reason why we require this change is , if we allow spark-sql to create .staging directory inside /db/ location then we will end-up with security issues. We need to provide permission for "viewfs:///db/" location to all users who submit spark jobs.
After this change is applied spark-sql creates .staging inside /db/table/, similar to hive, as below,
```
spark-sql> use dicedb;
21/07/28 00:22:47 INFO SparkSQLCLIDriver: Time taken: 0.929 seconds
spark-sql> insert into table part_test partition (j=8) values (8);
21/07/28 00:23:25 INFO HiveMetaStoreClient: Closed a connection to metastore, current connections: 1
21/07/28 00:23:26 INFO FileUtils: Creating directory if it doesn't exist: **viewfs://cloudera/user/daisuke/dicedb/part_test/.hive-staging_hive_2021-07-28_00-23-26_109_4548714524589026450-1**
```
The reason why we don't see this issue in Hive but only occurs in Spark-sql:
In hive, "/db/table/tmp" directory structure is passed for path and hence path.getParent returns "db/table/" . But in Spark we just pass "/db/table" so it is not required to use "path.getParent" for hadoop federation(viewfs)
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Tested manually by creating hive-sql.jar
Closes#33577 from senthh/viewfs-792392.
Authored-by: senthilkumarb <senthilkumarb@cloudera.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Change API doc for `UnivariateFeatureSelector`
### Why are the changes needed?
make the doc look better
### Does this PR introduce _any_ user-facing change?
yes, API doc change
### How was this patch tested?
Manually checked
Closes#33855 from huaxingao/ml_doc.
Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR proposes to document `window` & `session_window` function in SQL API doc page.
Screenshot of functions:
> window
![스크린샷 2021-08-26 오후 6 34 58](https://user-images.githubusercontent.com/1317309/130939754-0ea1b55e-39d4-4205-b79d-a9508c98921c.png)
> session_window
![스크린샷 2021-08-26 오후 6 35 19](https://user-images.githubusercontent.com/1317309/130939773-b6cb4b98-88f8-4d57-a188-ee40ed7b2b08.png)
### Why are the changes needed?
Description is missing in both `window` / `session_window` functions for SQL API page.
### Does this PR introduce _any_ user-facing change?
Yes, the description of `window` / `session_window` functions will be available in SQL API page.
### How was this patch tested?
Only doc changes.
Closes#33846 from HeartSaVioR/SPARK-36595.
Authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
Improve the user guide document.
### Why are the changes needed?
Make the user guide clear.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Doc change only.
Closes#33854 from xuanyuanking/SPARK-35611-follow.
Authored-by: Yuanjian Li <yuanjian.li@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR migrates CreateNamespaceStatement to the v2 command framework. Two new logical plans `UnresolvedObjectName` and `ResolvedObjectName` are introduced to support these CreateXXXStatements.
### Why are the changes needed?
Avoid duplicated code
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
existing tests
Closes#33835 from cloud-fan/ddl.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Debugged internally and found a bug where we should disable vectorized reader now based on schema recursively. Currently we check `schema.length` to be no more than `wholeStageMaxNumFields` to enable vectorization. `schema.length` does not take nested columns sub-fields into condition (i.e. view nested column same as primitive column). This check will be wrong when enabling vectorization for nested columns. We should follow [same check](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/WholeStageCodegenExec.scala#L896) from `WholeStageCodegenExec` to check sub-fields recursively. This will not cause correctness issue but will cause performance issue where we may enable vectorization for nested columns by mistake when nested column has a lot of sub-fields.
### Why are the changes needed?
Avoid OOM/performance regression when reading ORC table with nested column types.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added unit test in `OrcQuerySuite.scala`. Verified test failed without this change.
Closes#33842 from c21/field-fix.
Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
This PR proposes to enable the tests, disabled since different behavior with pandas 1.3.
- `inplace` argument for `CategoricalDtype` functions is deprecated from pandas 1.3, and seems they have bug. So we manually created the expected result and test them.
- Fixed the `GroupBy.transform` since it doesn't work properly for `CategoricalDtype`.
### Why are the changes needed?
We should enable the tests as much as possible even if pandas has a bug.
And we should follow the behavior of latest pandas.
### Does this PR introduce _any_ user-facing change?
Yes, `GroupBy.transform` now follow the behavior of latest pandas.
### How was this patch tested?
Unittests.
Closes#33817 from itholic/SPARK-36537.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes improving test coverage for pandas-on-Spark DataFrame code base, which is written in `frame.py`.
This PR did the following to improve coverage:
- Add unittest for untested code
- Remove unused code
- Add arguments to some functions for testing
### Why are the changes needed?
To make the project healthier by improving coverage.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unittest.
Closes#33833 from itholic/SPARK-36505.
Authored-by: itholic <haejoon.lee@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Compare the 3.2.0 API doc with the latest release version 3.1.2. Fix the following issues:
- Add missing `Since` annotation for new APIs
- Remove the leaking class/object in API doc
### Why are the changes needed?
Improve API docs
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing UT
Closes#33824 from gengliangwang/auditDoc.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
When `spark.sql.parser.quotedRegexColumnNames=true` and a pattern is used in a place where is not allowed the message is a little bit confusing
```
scala> spark.sql("set spark.sql.parser.quotedRegexColumnNames=true")
scala> spark.sql("SELECT `col_.?`/col_b FROM (SELECT 3 AS col_a, 1 as col_b)")
org.apache.spark.sql.AnalysisException: Invalid usage of '*' in expression 'divide'
```
This PR attempts to improve the error message
```
scala> spark.sql("SELECT `col_.?`/col_b FROM (SELECT 3 AS col_a, 1 as col_b)")
org.apache.spark.sql.AnalysisException: Invalid usage of regular expression in expression 'divide'
```
### Why are the changes needed?
To clarify the error message with this option active
### Does this PR introduce _any_ user-facing change?
Yes, change the error message
### How was this patch tested?
Unit testing and manual testing
Closes#33802 from planga82/feature/spark36488_improve_error_message.
Authored-by: Pablo Langa <soypab@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR modifies `FileScan.estimateStatistics()` to take the read schema into account.
### Why are the changes needed?
`V2ScanRelationPushDown` can column prune `DataSourceV2ScanRelation`s and change read schema of `Scan` operations. The better statistics returned by `FileScan.estimateStatistics()` can mean better query plans. For example, with this change the broadcast issue in SPARK-36568 can be avoided.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added new UT.
Closes#33825 from peter-toth/SPARK-36568-scan-statistics-estimation.
Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to use the session time zone ( see the SQL config `spark.sql.session.timeZone`) instead of JVM default time zone while converting of special timestamp_ntz strings such as "today", "tomorrow" and so on.
### Why are the changes needed?
Current implementation is based on the system time zone, and it controverses to other functions/classes that use the session time zone. For example, Spark doesn't respects user's settings:
```sql
$ export TZ="Europe/Amsterdam"
$ ./bin/spark-sql -S
spark-sql> select timestamp_ntz'now';
2021-08-25 18:12:36.233
spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone America/Los_Angeles
spark-sql> select timestamp_ntz'now';
2021-08-25 18:14:40.547
```
### Does this PR introduce _any_ user-facing change?
Yes. For the example above, after the changes:
```sql
spark-sql> select timestamp_ntz'now';
2021-08-25 18:47:46.832
spark-sql> set spark.sql.session.timeZone=America/Los_Angeles;
spark.sql.session.timeZone America/Los_Angeles
spark-sql> select timestamp_ntz'now';
2021-08-25 09:48:05.211
```
### How was this patch tested?
By running the affected test suites:
```
$ build/sbt "test:testOnly *DateTimeUtilsSuite"
```
Closes#33838 from MaxGekk/fix-ts_ntz-special-values.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to update an existing item in the SQL migration guide, and mention that Spark 3.2 supports foldable special datetime values as well.
<img width="1292" alt="Screenshot 2021-08-25 at 23 29 51" src="https://user-images.githubusercontent.com/1580697/130860184-27f0ba56-6c2d-4a5a-91a8-195f2f8aa5da.png">
### Why are the changes needed?
To inform users about actual Spark SQL behavior introduced by https://github.com/apache/spark/pull/33816
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
By generating docs, and checking results manually.
Closes#33840 from MaxGekk/special-datetime-cast-migr-guide.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Generalize the pod allocator and add support for statefulsets.
### Why are the changes needed?
Allocating individual pods in Spark can be not ideal for some clusters and using higher level operators like statefulsets and replicasets can be useful.
### Does this PR introduce _any_ user-facing change?
Yes new config options.
### How was this patch tested?
Completed: New unit & basic integration test
PV integration tests
Closes#33508 from holdenk/SPARK-36058-support-replicasets-or-job-api-like-things.
Lead-authored-by: Holden Karau <hkarau@netflix.com>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Signed-off-by: Holden Karau <hkarau@netflix.com>
### What changes were proposed in this pull request?
Spark 3.2.0 includes two new functions `regexp` and `regexp_like`, which are identical to `rlike`. However, in the generated documentation. the since versions of both functions are `1.0.0` since they are based on the expression `RLike`:
- https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/sql/index.html#regexp
- https://dist.apache.org/repos/dist/dev/spark/v3.2.0-rc1-docs/_site/api/sql/index.html#regexp_like
This PR is to:
* Support setting `since` version in FunctionRegistry
* Correct the `since` version of `regexp` and `regexp_like`
### Why are the changes needed?
Correct the SQL doc
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Run
```
sh sql/create-docs.sh
```
and check the SQL doc manually
Closes#33834 from gengliangwang/allowSQLFunVersion.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
### What changes were proposed in this pull request?
This PR fixes an issue that there is no way to redact sensitive information in Spark Thrift Server log.
For example, JDBC password can be exposed in the log.
```
21/08/25 18:52:37 INFO SparkExecuteStatementOperation: Submitting query 'CREATE TABLE mytbl2(a int) OPTIONS(url="jdbc:mysql//example.com:3306", driver="com.mysql.jdbc.Driver", dbtable="test_tbl", user="test_usr", password="abcde")' with ca14ae38-1aaf-4bf4-a099-06b8e5337613
```
### Why are the changes needed?
Bug fix.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Ran ThriftServer, connect to it and execute `CREATE TABLE mytbl2(a int) OPTIONS(url="jdbc:mysql//example.com:3306", driver="com.mysql.jdbc.Driver", dbtable="test_tbl", user="test_usr", password="abcde");` with `spark.sql.redaction.string.regex=((?i)(?<=password=))(".*")|('.*')`
Then, confirmed the log.
```
21/08/25 18:54:11 INFO SparkExecuteStatementOperation: Submitting query 'CREATE TABLE mytbl2(a int) OPTIONS(url="jdbc:mysql//example.com:3306", driver="com.mysql.jdbc.Driver", dbtable="test_tbl", user="test_usr", password=*********(redacted))' with ffc627e2-b1a8-4d83-ab6d-d819b3ccd909
```
Closes#33832 from sarutak/fix-SPARK-36398.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
### What changes were proposed in this pull request?
In the PR, I propose to add new correctness rule `SpecialDatetimeValues` to the final analysis phase. It replaces casts of strings to date/timestamp_ltz/timestamp_ntz by literals of such types if the strings contain special datetime values like `today`, `yesterday` and `tomorrow`, and the input strings are foldable.
### Why are the changes needed?
1. To avoid a breaking change.
2. To improve user experience with Spark SQL. After the PR https://github.com/apache/spark/pull/32714, users have to use typed literals instead of implicit casts. For instance,
at Spark 3.1:
```sql
select ts_col > 'now';
```
but the query fails at the moment, and users have to use typed timestamp literal:
```sql
select ts_col > timestamp'now';
```
### Does this PR introduce _any_ user-facing change?
No. Previous release 3.1 has supported the feature already till it was removed by https://github.com/apache/spark/pull/32714.
### How was this patch tested?
1. Manually tested via the sql command line:
```sql
spark-sql> select cast('today' as date);
2021-08-24
spark-sql> select timestamp('today');
2021-08-24 00:00:00
spark-sql> select timestamp'tomorrow' > 'today';
true
```
2. By running new test suite:
```
$ build/sbt "sql/testOnly org.apache.spark.sql.catalyst.optimizer.SpecialDatetimeValuesSuite"
```
Closes#33816 from MaxGekk/foldable-datetime-special-values.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR modifies `sql-ref-syntax-ddl-create-function.md` to mention `ARCHIVE` as an acceptable resource type for `CREATE FUNCTION` statement.
`ARCHIVE` is acceptable as of SPARK-35236 (#32359).
### Why are the changes needed?
To maintain the document.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
`SKIP_API=1 bundle exec jekyll build`
![create-function-archive](https://user-images.githubusercontent.com/4736016/130630637-dcddfd8c-543b-4d21-997c-d2deaf917a4f.png)
Closes#33823 from sarutak/create-function-archive.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to move distributed-sequence index implementation to SQL plan to leverage optimizations such as column pruning.
```python
import pyspark.pandas as ps
ps.set_option('compute.default_index_type', 'distributed-sequence')
ps.range(10).id.value_counts().to_frame().spark.explain()
```
**Before:**
```bash
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#51L DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(count#51L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#70]
+- HashAggregate(keys=[id#37L], functions=[count(1)], output=[__index_level_0__#48L, count#51L])
+- Exchange hashpartitioning(id#37L, 200), ENSURE_REQUIREMENTS, [id=#67]
+- HashAggregate(keys=[id#37L], functions=[partial_count(1)], output=[id#37L, count#63L])
+- Project [id#37L]
+- Filter atleastnnonnulls(1, id#37L)
+- Scan ExistingRDD[__index_level_0__#36L,id#37L]
# ^^^ Base DataFrame created by the output RDD from zipWithIndex (and checkpointed)
```
**After:**
```bash
== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#275L DESC NULLS LAST], true, 0
+- Exchange rangepartitioning(count#275L DESC NULLS LAST, 200), ENSURE_REQUIREMENTS, [id=#174]
+- HashAggregate(keys=[id#258L], functions=[count(1)])
+- HashAggregate(keys=[id#258L], functions=[partial_count(1)])
+- Filter atleastnnonnulls(1, id#258L)
+- Range (0, 10, step=1, splits=16)
# ^^^ Removed the Spark job execution for `zipWithIndex`
```
### Why are the changes needed?
To leverage optimization of SQL engine and avoid unnecessary shuffle to create default index.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Unittests were added. Also, this PR will test all unittests in pandas API on Spark after switching the default index implementation to `distributed-sequence`.
Closes#33807 from HyukjinKwon/SPARK-36559.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
When we refactor the query execution errors to use error classes in QueryExecutionErrors, we need define some exception that mix SparkThrowable into a base Exception type.
according the example [SparkArithmeticException](f90eb6a5db/core/src/main/scala/org/apache/spark/SparkException.scala (L75))
Add SparkXXXException as follows:
- `SparkClassNotFoundException`
- `SparkConcurrentModificationException`
- `SparkDateTimeException`
- `SparkFileAlreadyExistsException`
- `SparkFileNotFoundException`
- `SparkNoSuchMethodException`
- `SparkIndexOutOfBoundsException`
- `SparkIOException`
- `SparkSecurityException`
- `SparkSQLException`
- `SparkSQLFeatureNotSupportedException`
Refactor some exceptions in QueryExecutionErrors to use error classes and new exception for testing new exception
Some added by [PR](https://github.com/apache/spark/pull/33538) as follows:
- `SparkUnsupportedOperationException`
- `SparkIllegalStateException`
- `SparkNumberFormatException`
- `SparkIllegalArgumentException`
- `SparkArrayIndexOutOfBoundsException`
- `SparkNoSuchElementException`
### Why are the changes needed?
[SPARK-36336](https://issues.apache.org/jira/browse/SPARK-36336)
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existed ut test
Closes#33573 from Peng-Lei/SPARK-36336.
Authored-by: PengLei <peng.8lei@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Revert 397b843890 and 5a48eb8d00
### Why are the changes needed?
As discussed in https://github.com/apache/spark/pull/33800#issuecomment-904140869, there is correctness issue in the current implementation. Let's revert the code changes from branch 3.2 and fix it on master branch later
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Ci tests
Closes#33819 from gengliangwang/revert-SPARK-34415.
Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR fixes `NullPointerException` in `LiveRDDDistribution.toApi`.
### Why are the changes needed?
Looking at the stack trace, the NPE is caused by the null `exec.hostPort`. I can't get the complete log to take a close look but only guess that it might be due to the event `SparkListenerBlockManagerAdded` is dropped or out of order.
```
21/08/23 12:26:29 ERROR AsyncEventQueue: Listener AppStatusListener threw an exception
java.lang.NullPointerException
at com.google.common.base.Preconditions.checkNotNull(Preconditions.java:192)
at com.google.common.collect.MapMakerInternalMap.putIfAbsent(MapMakerInternalMap.java:3507)
at com.google.common.collect.Interners$WeakInterner.intern(Interners.java:85)
at org.apache.spark.status.LiveEntityHelpers$.weakIntern(LiveEntity.scala:696)
at org.apache.spark.status.LiveRDDDistribution.toApi(LiveEntity.scala:563)
at org.apache.spark.status.LiveRDD.$anonfun$doUpdate$4(LiveEntity.scala:629)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at scala.collection.mutable.HashMap$$anon$2.$anonfun$foreach$3(HashMap.scala:158)
at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:237)
at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:230)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:44)
at scala.collection.mutable.HashMap$$anon$2.foreach(HashMap.scala:158)
at scala.collection.TraversableLike.map(TraversableLike.scala:238)
at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
at scala.collection.AbstractTraversable.map(Traversable.scala:108)
at org.apache.spark.status.LiveRDD.doUpdate(LiveEntity.scala:629)
at org.apache.spark.status.LiveEntity.write(LiveEntity.scala:51)
at org.apache.spark.status.AppStatusListener.update(AppStatusListener.scala:1206)
at org.apache.spark.status.AppStatusListener.maybeUpdate(AppStatusListener.scala:1212)
at org.apache.spark.status.AppStatusListener.$anonfun$onExecutorMetricsUpdate$6(AppStatusListener.scala:956)
...
```
### Does this PR introduce _any_ user-facing change?
Yes, users will see the expected RDD info in UI instead of the NPE error.
### How was this patch tested?
Pass existing tests.
Closes#33812 from Ngone51/fix-hostport-npe.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Currently, the procedure to run the Oracle Integration Suite is based on building the Oracle RDBMS image from the Dockerfiles provided by Oracle.
Recently, Oracle has started providing database images, see https://container-registry.oracle.com
Moreover an Oracle employee is maintaining Oracle XE images that are streamlined for testing at https://hub.docker.com/r/gvenzl/oracle-xe and https://github.com/gvenzl/oci-oracle-xe This solves the issue that official images are quite large and make testing resource-intensive and slow.
This proposes to document the available options and to introduce a default value for ORACLE_DOCKER_IMAGE
### Why are the changes needed?
This change will make it easier and faster to run the Oracle Integration Suite, removing the need to manually build an Oracle DB image.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manually tested:
```
export ENABLE_DOCKER_INTEGRATION_TESTS=1
./build/sbt -Pdocker-integration-tests "testOnly org.apache.spark.sql.jdbc.OracleIntegrationSuite"
./build/sbt -Pdocker-integration-tests "testOnly org.apache.spark.sql.jdbc.v2.OracleIntegrationSuite"
```
Closes#33821 from LucaCanali/oracleDockerIntegration.
Authored-by: Luca Canali <luca.canali@cern.ch>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>