### What changes were proposed in this pull request?
This PR removes Hive 1.2 option (and therefore `HIVE_VERSION` environment variable as well).
### Why are the changes needed?
Hive 1.2 is a fork version. We shouldn't promote users to use.
### Does this PR introduce _any_ user-facing change?
Nope, `HIVE_VERSION` and Hive 1.2 are removed but this is new experimental feature in master only.
### How was this patch tested?
Manually tested:
```bash
SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz -v
SPARK_VERSION=3.0.1 HADOOP_VERSION=2.7 pip install pyspark-3.1.0.dev0.tar.gz -v
SPARK_VERSION=3.0.1 HADOOP_VERSION=invalid pip install pyspark-3.1.0.dev0.tar.gz -v
```
Closes#29858 from HyukjinKwon/SPARK-32981.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes migration of [`pyspark-stubs`](https://github.com/zero323/pyspark-stubs) into Spark codebase.
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
Yes. This PR adds type annotations directly to Spark source.
This can impact interaction with development tools for users, which haven't used `pyspark-stubs`.
### How was this patch tested?
- [x] MyPy tests of the PySpark source
```
mypy --no-incremental --config python/mypy.ini python/pyspark
```
- [x] MyPy tests of Spark examples
```
MYPYPATH=python/ mypy --no-incremental --config python/mypy.ini examples/src/main/python/ml examples/src/main/python/sql examples/src/main/python/sql/streaming
```
- [x] Existing Flake8 linter
- [x] Existing unit tests
Tested against:
- `mypy==0.790+dev.e959952d9001e9713d329a2f9b196705b028f894`
- `mypy==0.782`
Closes#29591 from zero323/SPARK-32681.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This is a follow-up to fix Scala 2.13 compilation at Kubernetes module.
### Why are the changes needed?
To fix Scala 2.13 compilation.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the GitHub Action Scala 2.13 compilation job.
Closes#29859 from dongjoon-hyun/SPARK-32971-2.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
The default is always ErrorsOnExist regardless of DataSource version. Fixing the JavaDoc to reflect this.
### Why are the changes needed?
To fix documentation
### Does this PR introduce _any_ user-facing change?
Doc change.
### How was this patch tested?
Manual.
Closes#29853 from RussellSpitzer/SPARK-32977.
Authored-by: Russell Spitzer <russell.spitzer@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR aims to support dynamic PVC creation and deletion for K8s executors. The PVCs are created with executor pods and deleted when the executor pods are deleted.
**Configuration**
Mostly, this PR reuses the existing PVC volume configs and `storageClass` is added.
```
spark.executor.instances=2
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.claimName=OnDemand
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.storageClass=gp2
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.options.sizeLimit=500Gi
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.path=/data
spark.kubernetes.executor.volumes.persistentVolumeClaim.spark-local-dir-1.mount.readOnly=false
```
**Executors**
```
$ kubectl get pod -l spark-role=executor
NAME READY STATUS RESTARTS AGE
spark-pi-f4d80574b9bb0941-exec-1 1/1 Running 0 2m6s
spark-pi-f4d80574b9bb0941-exec-2 1/1 Running 0 2m6s
```
**PVCs**
```
$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLA
SS AGE
spark-pi-f4d80574b9bb0941-exec-1-pvc-0 Bound pvc-7d20173f-278b-4c7b-b7e5-7f0ed414ee64 500Gi RWO gp2
48s
spark-pi-f4d80574b9bb0941-exec-2-pvc-0 Bound pvc-1138f00d-87f1-47f4-9b58-ce5d13ea0c3a 500Gi RWO gp2
48s
```
**Executor Disk**
```
$ k exec -it spark-pi-f4d80574b9bb0941-exec-1 -- df -h /data
Filesystem Size Used Avail Use% Mounted on
/dev/nvme3n1 493G 74M 492G 1% /data
```
```
$ k exec -it spark-pi-f4d80574b9bb0941-exec-1 -- ls /data
blockmgr-81dcebaf-11a7-4d7b-91d6-3c580187d914
lost+found
spark-6be42db8-2c58-4389-b52c-8aeeafe76bd5
```
### Why are the changes needed?
While SPARK-32655 supports to mount a pre-created PVC, this PR can create PVC itself dynamically and reduce lots of manual efforts.
### Does this PR introduce _any_ user-facing change?
Yes. This is a new feature.
### How was this patch tested?
Pass the newly added test cases.
Closes#29846 from dongjoon-hyun/SPARK-32971.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Fixes the log strings the decom integration tests looks for and add comments reminding people to run the K8s integration tests when changing those code paths.
### Why are the changes needed?
The strings it looks for have been changed.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
WIP: Verify that the K8s jenkins job succeeds
Closes#29854 from holdenk/SPARK-32979-spark-k8s-decom-test-is-broken.
Authored-by: Holden Karau <hkarau@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Apache Spark 3.0 switches its Hive execution version from 1.2 to 2.3, but it still provides the unofficial forked Hive 1.2 version from our distribution like the following. This PR aims to remove it from Apache Spark 3.1.0 officially while keeping `hive-1.2` profile.
```
spark-3.0.1-bin-hadoop2.7-hive1.2.tgz
spark-3.0.1-bin-hadoop2.7-hive1.2.tgz.asc
spark-3.0.1-bin-hadoop2.7-hive1.2.tgz.sha512
```
### Why are the changes needed?
The unofficial Hive 1.2.1 fork has many bugs and is not maintained for a long time. We had better not recommend this in the official Apache Spark distribution.
### Does this PR introduce _any_ user-facing change?
There is no user-facing change in the default distribution (Hadoop 3.2/Hive 2.3).
### How was this patch tested?
Manually because this is a change in release script .
Closes#29856 from dongjoon-hyun/SPARK-32981.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Remove unnecessary code.
### Why are the changes needed?
General housekeeping. Might be a slight performance improvement, especially on big-endian systems.
There is no need for separate code paths for big- and little-endian
platforms in putDoubles and putFloats anymore (since PR #24861). On
all platforms values are encoded in native byte order and can just
be copied directly.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#29815 from mundaym/clean-putfloats.
Authored-by: Michael Munday <mike.munday@ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
MurmurHash3 and xxHash64 interpret sequences of bytes as integers
encoded in little-endian byte order. This requires a byte reversal
on big endian platforms.
I've left the hashInt and hashLong functions as-is for now. My
interpretation of these functions is that they perform the hash on
the integer value as if it were serialized in little-endian byte
order. Therefore no byte reversal is necessary.
### What changes were proposed in this pull request?
Modify hash functions to produce correct results on big-endian platforms.
### Why are the changes needed?
Hash functions produce incorrect results on big-endian platforms which, amongst other potential issues, causes test failures.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests run on the IBM Z (s390x) platform which uses a big-endian byte order.
Closes#29762 from mundaym/fix-hashes.
Authored-by: Michael Munday <mike.munday@ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
revert blockify gmm
### Why are the changes needed?
WeichenXu123 and I thought we should use memory size instead of number of rows to blockify instance; then if a buffer's size is large and determined by number of rows, we should discard it.
In GMM, we found that the pre-allocated memory maybe too large and should be discarded:
```
transient private lazy val auxiliaryPDFMat = DenseMatrix.zeros(blockSize, numFeatures)
```
We had some offline discuss and thought it is better to revert blockify GMM.
### Does this PR introduce _any_ user-facing change?
blockSize added in master branch will be removed
### How was this patch tested?
existing testsuites
Closes#29782 from zhengruifeng/unblockify_gmm.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
This PR addresses two issues related to the `Relation: view text` test in `DataSourceV2SQLSuite`.
1. The test has the following block:
```scala
withView("view1") { v1: String =>
sql(...)
}
```
Since `withView`'s signature is `withView(v: String*)(f: => Unit): Unit`, the `f` that will be executed is ` v1: String => sql(..)`, which is just defining the anonymous function, and _not_ executing it.
2. Once the test is fixed to run, it actually fails. The reason is that the v2 session catalog implementation used in tests does not correctly handle `V1Table` for views in `loadTable`. And this results in views resolved to `ResolvedTable` instead of `ResolvedView`, causing the test failure: f1dc479d39/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Analyzer.scala (L1007-L1011)
### Why are the changes needed?
Fixing a bug in test.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing test.
Closes#29811 from imback82/fix_minor_test.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
In the PR, I propose to replace current examples for `percentile_approx()` with **only one** input value by example **with multiple values** in the input column.
### Why are the changes needed?
Current examples are pretty trivial, and don't demonstrate function's behaviour on a sequence of values.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- by running `ExpressionInfoSuite`
- `./dev/scalastyle`
Closes#29841 from MaxGekk/example-percentile_approx.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to add a way to select Hadoop and Hive versions in pip installation.
Users can select Hive or Hadoop versions as below:
```bash
HADOOP_VERSION=3.2 pip install pyspark
HIVE_VERSION=1.2 pip install pyspark
HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark
```
When the environment variables are set, internally it downloads the corresponding Spark version and then sets the Spark home to it. Also this PR exposes a mirror to set as an environment variable, `PYSPARK_RELEASE_MIRROR`.
**Please NOTE that:**
- We cannot currently leverage pip's native installation option, for example:
```bash
pip install pyspark --install-option="hadoop3.2"
```
This is because of a limitation and bug in pip itself. Once they fix this issue, we can switch from the environment variables to the proper installation options, see SPARK-32837.
It IS possible to workaround but very ugly or hacky with a big change. See [this PR](https://github.com/microsoft/nni/pull/139/files) as an example.
- In pip installation, we pack the relevant jars together. This PR _does not touch existing packaging way_ in order to prevent any behaviour changes.
Once this experimental way is proven to be safe, we can avoid packing the relevant jars together (and keep only the relevant Python scripts). And downloads the Spark distribution as this PR proposes.
- This way is sort of consistent with SparkR:
SparkR provides a method `SparkR::install.spark` to support CRAN installation. This is fine because SparkR is provided purely as a R library. For example, `sparkr` script is not packed together.
PySpark cannot take this approach because PySpark packaging ships relevant executable script together, e.g.) `pyspark` shell.
If PySpark has a method such as `pyspark.install_spark`, users cannot call it in `pyspark` because `pyspark` already assumes relevant Spark is installed, JVM is launched, etc.
- There looks no way to release that contains different Hadoop or Hive to PyPI due to [the version semantics](https://www.python.org/dev/peps/pep-0440/). This is not an option.
The usual way looks either `--install-option` above with hacks or environment variables given my investigation.
### Why are the changes needed?
To provide users the options to select Hadoop and Hive versions.
### Does this PR introduce _any_ user-facing change?
Yes, users will be able to select Hive and Hadoop version as below when they install it from `pip`;
```bash
HADOOP_VERSION=3.2 pip install pyspark
HIVE_VERSION=1.2 pip install pyspark
HIVE_VERSION=1.2 HADOOP_VERSION=2.7 pip install pyspark
```
### How was this patch tested?
Unit tests were added. I also manually tested in Mac and Windows (after building Spark with `python/dist/pyspark-3.1.0.dev0.tar.gz`):
```bash
./build/mvn -DskipTests -Phive-thriftserver clean package
```
Mac:
```bash
SPARK_VERSION=3.0.1 HADOOP_VERSION=3.2 pip install pyspark-3.1.0.dev0.tar.gz
```
Windows:
```bash
set HADOOP_VERSION=3.2
set SPARK_VERSION=3.0.1
pip install pyspark-3.1.0.dev0.tar.gz
```
Closes#29703 from HyukjinKwon/SPARK-32017.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR adjusts signatures of methods decorated with `keyword_only` to indicate using [Python 3 keyword-only syntax](https://www.python.org/dev/peps/pep-3102/).
__Note__:
For the moment the goal is not to replace `keyword_only`. For justification see https://github.com/apache/spark/pull/29591#discussion_r489402579
### Why are the changes needed?
Right now it is not clear that `keyword_only` methods are indeed keyword only. This proposal addresses that.
In practice we could probably capture `locals` and drop `keyword_only` completel, i.e:
```python
keyword_only
def __init__(self, *, featuresCol="features"):
...
kwargs = self._input_kwargs
self.setParams(**kwargs)
```
could be replaced with
```python
def __init__(self, *, featuresCol="features"):
kwargs = locals()
del kwargs["self"]
...
self.setParams(**kwargs)
```
### Does this PR introduce _any_ user-facing change?
Docstrings and inspect tools will now indicate that `keyword_only` methods expect only keyword arguments.
For example with ` LinearSVC` will change from
```
>>> from pyspark.ml.classification import LinearSVC
>>> ?LinearSVC.__init__
Signature:
LinearSVC.__init__(
self,
featuresCol='features',
labelCol='label',
predictionCol='prediction',
maxIter=100,
regParam=0.0,
tol=1e-06,
rawPredictionCol='rawPrediction',
fitIntercept=True,
standardization=True,
threshold=0.0,
weightCol=None,
aggregationDepth=2,
)
Docstring: __init__(self, featuresCol="features", labelCol="label", predictionCol="prediction", maxIter=100, regParam=0.0, tol=1e-6, rawPredictionCol="rawPrediction", fitIntercept=True, standardization=True, threshold=0.0, weightCol=None, aggregationDepth=2):
File: /path/to/python/pyspark/ml/classification.py
Type: function
```
to
```
>>> from pyspark.ml.classification import LinearSVC
>>> ?LinearSVC.__init__
Signature:
LinearSVC.__init__ (
self,
*,
featuresCol='features',
labelCol='label',
predictionCol='prediction',
maxIter=100,
regParam=0.0,
tol=1e-06,
rawPredictionCol='rawPrediction',
fitIntercept=True,
standardization=True,
threshold=0.0,
weightCol=None,
aggregationDepth=2,
blockSize=1,
)
Docstring: __init__(self, \*, featuresCol="features", labelCol="label", predictionCol="prediction", maxIter=100, regParam=0.0, tol=1e-6, rawPredictionCol="rawPrediction", fitIntercept=True, standardization=True, threshold=0.0, weightCol=None, aggregationDepth=2, blockSize=1):
File: ~/Workspace/spark/python/pyspark/ml/classification.py
Type: function
```
### How was this patch tested?
Existing tests.
Closes#29799 from zero323/SPARK-32933.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
More precise description of the result of the `percentile_approx()` function and its synonym `approx_percentile()`. The proposed sentence clarifies that the function returns **one of elements** (or array of elements) from the input column.
### Why are the changes needed?
To improve Spark docs and avoid misunderstanding of the function behavior.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
`./dev/scalastyle`
Closes#29835 from MaxGekk/doc-percentile_approx.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
### What changes were proposed in this pull request?
This is a minor followup of https://github.com/apache/spark/pull/29601 , to preserve the attribute name in `SubqueryBroadcastExec.output`.
### Why are the changes needed?
During explain, it's better to see the origin column name instead of always "key".
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
existing tests.
Closes#29839 from cloud-fan/followup2.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
There is only one failed case of `streaming` module in Scala 2.13: `start with non-serializable DStream checkpoint ` in `StreamingContextSuite`.
`StackOverflowError` is thrown here when `SerializationDebugger#visit` method is called.
I found that `inputStreams` and `outputStreams` in `DStreamGraph` can not be matched in `SerializationDebugger#visit` method because `ArrayBuffer` in not `Array` in Scala 2.13.
The main change of this pr is use `mutable.ArraySeq` instead of `ArrayBuffer` to store `inputStreams` and `outputStreams` in `DStreamGraph`, then it can be matched in `SerializationDebugger#visit` method.
### Why are the changes needed?
We need to support a Scala 2.13 build.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Scala 2.12: Pass the Jenkins or GitHub Action
- Scala 2.13: Pass GitHub 2.13 Build Action
Do the following:
```
dev/change-scala-version.sh 2.13
mvn clean install -DskipTests -pl streaming -Pscala-2.13 -am
mvn test -pl streaming -Pscala-2.13
mvn test -pl core -Pscala-2.13
```
streaming module:
```
Tests: succeeded 339, failed 0, canceled 0, ignored 2, pending 0
All tests passed.
```
core module:
```
Tests: succeeded 2648, failed 0, canceled 4, ignored 7, pending 0
All tests passed.
```
Closes#29836 from LuciferYang/fix-streaming-213.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/29475.
This PR updates the code to broadcast the Array instead of Set, which was the behavior before #29475
### Why are the changes needed?
The size of Set can be much bigger than Array. It's safer to keep the behavior the same as before and build the set at the executor side.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing tests
Closes#29838 from cloud-fan/followup.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR fixes an link in `_layouts/global.html`.
The item `More` in the navigation bar in the WebUI links to `api.html` but it seems to be wrong.
This PR also removes `api.md` because it and `api.html` generated from it are not referred from anywhere.
### Why are the changes needed?
Fix the wrong link.
### Does this PR introduce _any_ user-facing change?
Yes. "More" item no longer links to `api.html`.
### How was this patch tested?
`SKIP_API=1 jekyll build` and confirmed that the item no longer links to `api.html`.
I also confirmed `api.md` and `api.html` are no longer referred from anywhere by the following command.
```
$ grep -Erl "api\.(html|md)" docs
```
Closes#29821 from sarutak/fix-api-doc-link.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR adds R wrapper for `timestamp_seconds` function.
### Why are the changes needed?
Feature parity.
### Does this PR introduce _any_ user-facing change?
Yes, it adds a new R function.
### How was this patch tested?
New unit tests.
Closes#29822 from zero323/SPARK-32949.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR adds foldable propagation from `Aggregate` as per: https://github.com/apache/spark/pull/29771#discussion_r490412031
### Why are the changes needed?
This is an improvement as `Aggregate`'s `aggregateExpressions` can contain foldables that can be propagated up.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New UT.
Closes#29816 from peter-toth/SPARK-32951-foldable-propagation-from-aggregate.
Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Now that the minimum version of PyArrow is `1.0.0`, we should update the version in the installation guide.
### Why are the changes needed?
The minimum version of PyArrow was upgraded to `1.0.0`.
### Does this PR introduce _any_ user-facing change?
Users see the correct minimum version in the installation guide.
### How was this patch tested?
N/A
Closes#29829 from ueshin/issues/SPARK-32312/doc.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Fix ".../jobs/undefined" link from "Event Timeline" in jobs page. Job page link in "Event Timeline" view is constructed by fetching job page link defined in job list below. when job count exceeds page size of job table, only links of jobs in job table can be fetched from page. Other jobs' link would be 'undefined', and links of them in "Event Timeline" are broken, they are redirected to some wired URL like ".../jobs/undefined". This PR is fixing this wrong link issue. With this PR, job link in "Event Timeline" view would always redirect to correct job page.
### Why are the changes needed?
Wrong link (".../jobs/undefined") in "Event Timeline" of jobs page. for example, the first job in below page is not in table below, as job count(116) exceeds page size(100). When clicking it's item in "Event Timeline", page is redirected to ".../jobs/undefined", which is wrong. Links in "Event Timeline" should always be correct.
![undefinedlink](https://user-images.githubusercontent.com/10524738/93184779-83fa6d80-f6f1-11ea-8a80-1a304ca9cbb2.JPG)
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manually tested.
Closes#29757 from zhli1142015/fix-link-event-timeline-view.
Authored-by: Zhen Li <zhli@microsoft.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
In current mode, when explain a SQL plan with HiveTableRelation, it will show so many info about HiveTableRelation's prunedPartition, this make plan hard to read, this pr make this information simpler.
Before:
![image](https://user-images.githubusercontent.com/46485123/93012078-aeeca080-f5cf-11ea-9286-f5c15eadbee3.png)
For UT
```
test("Make HiveTableScanExec message simple") {
withSQLConf("hive.exec.dynamic.partition.mode" -> "nonstrict") {
withTable("df") {
spark.range(30)
.select(col("id"), col("id").as("k"))
.write
.partitionBy("k")
.format("hive")
.mode("overwrite")
.saveAsTable("df")
val df = sql("SELECT df.id, df.k FROM df WHERE df.k < 2")
df.explain(true)
}
}
}
```
After this pr will show
```
== Parsed Logical Plan ==
'Project ['df.id, 'df.k]
+- 'Filter ('df.k < 2)
+- 'UnresolvedRelation [df], []
== Analyzed Logical Plan ==
id: bigint, k: bigint
Project [id#11L, k#12L]
+- Filter (k#12L < cast(2 as bigint))
+- SubqueryAlias spark_catalog.default.df
+- HiveTableRelation [`default`.`df`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#11L], Partition Cols: [k#12L]]
== Optimized Logical Plan ==
Filter (isnotnull(k#12L) AND (k#12L < 2))
+- HiveTableRelation [`default`.`df`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#11L], Partition Cols: [k#12L], Pruned Partitions: [(k=0), (k=1)]]
== Physical Plan ==
Scan hive default.df [id#11L, k#12L], HiveTableRelation [`default`.`df`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Data Cols: [id#11L], Partition Cols: [k#12L], Pruned Partitions: [(k=0), (k=1)]], [isnotnull(k#12L), (k#12L < 2)]
```
In my pr, I will construct `HiveTableRelation`'s `simpleString` method to avoid show too much unnecessary info in explain plan. compared to what we had before,I decrease the detail metadata of each partition and only retain the partSpec to show each partition was pruned. Since for detail information, we always don't see this in Plan but to use DESC EXTENDED statement.
### Why are the changes needed?
Make plan about HiveTableRelation more readable
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
No
Closes#29739 from AngersZhuuuu/HiveTableScan-meta-location-info.
Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR adds `withColumn` function SparkR.
### Why are the changes needed?
### Does this PR introduce _any_ user-facing change?
Yes, new function, equivalent to Scala and PySpark equivalents, is exposed to the end user.
### How was this patch tested?
New unit tests added.
Closes#29814 from zero323/SPARK-32946.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Add optional `allowMissingColumns` argument to SparkR `unionByName`.
### Why are the changes needed?
Feature parity.
### Does this PR introduce _any_ user-facing change?
`unionByName` supports `allowMissingColumns`.
### How was this patch tested?
Existing unit tests. New unit tests targeting this feature.
Closes#29813 from zero323/SPARK-32799.
Authored-by: zero323 <mszymkiewicz@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR:
- rephrases some wordings in installation guide to avoid using the terms that can be potentially ambiguous such as "different favors"
- documents extra dependency installation `pip install pyspark[sql]`
- uses the link that corresponds to the released version. e.g.) https://spark.apache.org/docs/latest/building-spark.html vs https://spark.apache.org/docs/3.0.0/building-spark.html
- adds some more details
I built it on Read the Docs to make it easier to review: https://hyukjin-spark.readthedocs.io/en/stable/getting_started/install.html
### Why are the changes needed?
To improve installation guide.
### Does this PR introduce _any_ user-facing change?
Yes, it updates the user-facing installation guide.
### How was this patch tested?
Manually built the doc and tested.
Closes#29779 from HyukjinKwon/SPARK-32180.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Only calculate the executorRunTime when taskStartTimeNs > 0. Otherwise, set executorRunTime to 0.
### Why are the changes needed?
bug fix.
It's possible that a task be killed (e.g., by another successful attempt) before it reaches "taskStartTimeNs = System.nanoTime()". In this case, taskStartTimeNs is still 0 since it hasn't been really initialized. And we will get the wrong executorRunTime by calculating System.nanoTime() - taskStartTimeNs.
### Does this PR introduce _any_ user-facing change?
Yes, users will see the correct executorRunTime.
### How was this patch tested?
Pass existing tests.
Closes#29789 from Ngone51/fix-SPARK-32898.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This is a follow-up PR to https://github.com/apache/spark/pull/29771 and just adds a new test case.
### Why are the changes needed?
To have better test coverage.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New UT.
Closes#29802 from peter-toth/SPARK-32635-fix-foldable-propagation-followup.
Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
After https://github.com/apache/spark/pull/29660 and https://github.com/apache/spark/pull/29689 there are 13 remaining failed cases of sql core module with Scala 2.13.
The reason for the remaining failed cases is the optimization result of `CostBasedJoinReorder` maybe different with same input in Scala 2.12 and Scala 2.13 if there are more than one same cost candidate plans.
In this pr give a way to make the optimization result deterministic as much as possible to pass all remaining failed cases of `sql/core` module in Scala 2.13, the main change of this pr as follow:
- Change to use `LinkedHashMap` instead of `Map` to store `foundPlans` in `JoinReorderDP.search` method to ensure same iteration order with same insert order because iteration order of `Map` behave differently under Scala 2.12 and 2.13
- Fixed `StarJoinCostBasedReorderSuite` affected by the above change
- Regenerate golden files affected by the above change.
### Why are the changes needed?
We need to support a Scala 2.13 build.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Scala 2.12: Pass the Jenkins or GitHub Action
- Scala 2.13: All tests passed.
Do the following:
```
dev/change-scala-version.sh 2.13
mvn clean install -DskipTests -pl sql/core -Pscala-2.13 -am
mvn test -pl sql/core -Pscala-2.13
```
**Before**
```
Tests: succeeded 8485, failed 13, canceled 1, ignored 52, pending 0
*** 13 TESTS FAILED ***
```
**After**
```
Tests: succeeded 8498, failed 0, canceled 1, ignored 52, pending 0
All tests passed.
```
Closes#29711 from LuciferYang/SPARK-32808-3.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This pr fix all 14 failed cases in `external/avro` module in Scala 2.13, the main change of this pr as follow:
- Manual call `toSeq` in `AvroDeserializer#newWriter` and `SchemaConverters#toSqlTypeHelper` method because the object type for case match is `ArrayBuffer` not `Seq` in Scala 2.13
- Specified `Seq` to `s.c.Seq` when we call `Row.get(i).asInstanceOf[Seq]` because the data maybe `mutable.ArraySeq` but `Seq` is `immutable.Seq` in Scala 2.13
### Why are the changes needed?
We need to support a Scala 2.13 build.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Scala 2.12: Pass the Jenkins or GitHub Action
- Scala 2.13: Pass 2.13 Build GitHub Action and do the following:
```
dev/change-scala-version.sh 2.13
mvn clean install -DskipTests -pl external/avro -Pscala-2.13 -am
mvn clean test -pl external/avro -Pscala-2.13
```
**Before**
```
Tests: succeeded 197, failed 14, canceled 0, ignored 2, pending 0
*** 14 TESTS FAILED ***
```
**After**
```
Tests: succeeded 211, failed 0, canceled 0, ignored 2, pending 0
All tests passed.
```
Closes#29801 from LuciferYang/fix-external-avro-213.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Found via discussion https://github.com/apache/spark/pull/29746#issuecomment-694726504
and the root cause it that hive-1.2 does not recognize NULL
```scala
sbt.ForkMain$ForkError: java.sql.SQLException: Unrecognized column type: NULL
at org.apache.hive.jdbc.JdbcColumn.typeStringToHiveType(JdbcColumn.java:160)
at org.apache.hive.jdbc.HiveResultSetMetaData.getHiveType(HiveResultSetMetaData.java:48)
at org.apache.hive.jdbc.HiveResultSetMetaData.getPrecision(HiveResultSetMetaData.java:86)
at org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$35(SparkThriftServerProtocolVersionsSuite.scala:358)
at org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$35$adapted(SparkThriftServerProtocolVersionsSuite.scala:351)
at org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.testExecuteStatementWithProtocolVersion(SparkThriftServerProtocolVersionsSuite.scala:66)
at org.apache.spark.sql.hive.thriftserver.SparkThriftServerProtocolVersionsSuite.$anonfun$new$34(SparkThriftServerProtocolVersionsSuite.scala:351)
at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.java:23)
at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
at org.scalatest.Transformer.apply(Transformer.scala:22)
at org.scalatest.Transformer.apply(Transformer.scala:20)
at org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:189)
at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:176)
at org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:187)
at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:199)
at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
at org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:199)
at org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:181)
at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:61)
at org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
at org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
at org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:61)
at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:232)
at org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
at scala.collection.immutable.List.foreach(List.scala:392)
at org.scalatest.SuperEngine.traverseSubNodes$1(Engine.scala:401)
at org.scalatest.SuperEngine.runTestsInBranch(Engine.scala:396)
at org.scalatest.SuperEngine.runTestsImpl(Engine.scala:475)
at org.scalatest.funsuite.AnyFunSuiteLike.runTests(AnyFunSuiteLike.scala:232)
at org.scalatest.funsuite.AnyFunSuiteLike.runTests$(AnyFunSuiteLike.scala:231)
at org.scalatest.funsuite.AnyFunSuite.runTests(AnyFunSuite.scala:1562)
at org.scalatest.Suite.run(Suite.scala:1112)
at org.scalatest.Suite.run$(Suite.scala:1094)
at org.scalatest.funsuite.AnyFunSuite.org$scalatest$funsuite$AnyFunSuiteLike$$super$run(AnyFunSuite.scala:1562)
at org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$run$1(AnyFunSuiteLike.scala:236)
at org.scalatest.SuperEngine.runImpl(Engine.scala:535)
at org.scalatest.funsuite.AnyFunSuiteLike.run(AnyFunSuiteLike.scala:236)
at org.scalatest.funsuite.AnyFunSuiteLike.run$(AnyFunSuiteLike.scala:235)
at org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterAll$$super$run(SparkFunSuite.scala:61)
at org.scalatest.BeforeAndAfterAll.liftedTree1$1(BeforeAndAfterAll.scala:213)
at org.scalatest.BeforeAndAfterAll.run(BeforeAndAfterAll.scala:210)
at org.scalatest.BeforeAndAfterAll.run$(BeforeAndAfterAll.scala:208)
at org.apache.spark.SparkFunSuite.run(SparkFunSuite.scala:61)
at org.scalatest.tools.Framework.org$scalatest$tools$Framework$$runSuite(Framework.scala:318)
at org.scalatest.tools.Framework$ScalaTestTask.execute(Framework.scala:513)
at sbt.ForkMain$Run$2.call(ForkMain.java:296)
at sbt.ForkMain$Run$2.call(ForkMain.java:286)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```
In this PR, we simply ignore these checks for hive 1.2
### Why are the changes needed?
fix jenkins
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
test itself.
Closes#29803 from yaooqinn/SPARK-32874-F.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR changes `UnsafeExternalSorter.SpillableIterator` to free its memory (except for the page holding the last record) if it is forced to spill after all of its records have been read. It also makes sure that `lastPage` is freed if `loadNext` is never called the again. The latter was necessary to get my test case to succeed (otherwise it would complain about a leak).
### Why are the changes needed?
No memory is freed after calling `UnsafeExternalSorter.SpillableIterator.spill()` when all records have been read, even though it is still holding onto some memory. This may cause a `SparkOutOfMemoryError` to be thrown, even though we could have just freed the memory instead.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
A test was added to `UnsafeExternalSorterSuite`.
Closes#29787 from tomvanbussel/SPARK-32911.
Authored-by: Tom van Bussel <tom.vanbussel@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR aims to replace deprecated `isFile` and `isDirectory` methods.
```diff
- fs.isDirectory(hadoopPath)
+ fs.getFileStatus(hadoopPath).isDirectory
```
```diff
- fs.isFile(new Path(inProgressLog))
+ fs.getFileStatus(new Path(inProgressLog)).isFile
```
### Why are the changes needed?
It shows deprecation warnings.
- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-sbt-hadoop-3.2-hive-2.3/1244/consoleFull
```
[warn] /home/jenkins/workspace/spark-master-test-sbt-hadoop-3.2-hive-2.3/core/src/main/scala/org/apache/spark/deploy/history/FsHistoryProvider.scala:815: method isFile in class FileSystem is deprecated: see corresponding Javadoc for more information.
[warn] if (!fs.isFile(new Path(inProgressLog))) {
```
```
[warn] /home/jenkins/workspace/spark-master-test-sbt-hadoop-3.2-hive-2.3/core/src/main/scala/org/apache/spark/SparkContext.scala:1884: method isDirectory in class FileSystem is deprecated: see corresponding Javadoc for more information.
[warn] if (fs.isDirectory(hadoopPath)) {
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the Jenkins.
Closes#29796 from williamhyun/filesystem.
Authored-by: William Hyun <williamhyun3@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
With a long-running application in kerberized mode, the AMEndpiont handles `UpdateDelegationTokens` message wrong, which is an OneWayMessage that should be handled in the `receive` function.
```java
20-09-15 18:53:01 INFO yarn.YarnAllocator: Received 22 containers from YARN, launching executors on 0 of them.
20-09-16 12:52:28 ERROR netty.Inbox: Ignoring error
org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) does not implement 'receive'
at org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
20-09-17 06:52:28 ERROR netty.Inbox: Ignoring error
org.apache.spark.SparkException: NettyRpcEndpointRef(spark-client://YarnAM) does not implement 'receive'
at org.apache.spark.rpc.RpcEndpoint$$anonfun$receive$1.applyOrElse(RpcEndpoint.scala:70)
at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:203)
at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```
### Why are the changes needed?
bugfix, without a proper token refresher, the long-running apps are going to fail potentially in kerberized cluster
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
Passing jenkins
and verify manually
I am running the sub-module `kyuubi-spark-sql-engine` of https://github.com/yaooqinn/kyuubi
The simplest way to reproduce the bug and verify this fix is to follow these steps
#### 1 build the `kyuubi-spark-sql-engine` module
```
mvn clean package -pl :kyuubi-spark-sql-engine
```
#### 2. config the spark with Kerberos settings towards your secured cluster
#### 3. start it in the background
```
nohup bin/spark-submit --class org.apache.kyuubi.engine.spark.SparkSQLEngine ../kyuubi-spark-sql-engine-1.0.0-SNAPSHOT.jar > kyuubi.log &
```
#### 4. check the AM log and see
"Updating delegation tokens ..." for SUCCESS
"Inbox: Ignoring error ...... does not implement 'receive'" for FAILURE
Closes#29777 from yaooqinn/SPARK-32905.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
1. Change the target error calculation according to the paper [Space-Efficient Online Computation of Quantile Summaries](http://infolab.stanford.edu/~datar/courses/cs361a/papers/quantiles.pdf). It says that the error `e = max(gi, deltai)/2` (see the page 59). Also this has clear explanation [ε-approximate quantiles](http://www.mathcs.emory.edu/~cheung/Courses/584/Syllabus/08-Quantile/Greenwald.html#proofprop1).
2. Added a test to check different accuracies.
3. Added an input CSV file `percentile_approx-input.csv.bz2` to the resource folder `sql/catalyst/src/main/resources` for the test.
### Why are the changes needed?
To fix incorrect percentile calculation, see an example in SPARK-32908.
### Does this PR introduce _any_ user-facing change?
Yes
### How was this patch tested?
- By running existing tests in `QuantileSummariesSuite` and in `ApproximatePercentileQuerySuite`.
- Added new test `SPARK-32908: maximum target error in percentile_approx` to `ApproximatePercentileQuerySuite`.
Closes#29784 from MaxGekk/fix-percentile_approx-2.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
1, simplify the aggregation by get `count` via `summary.count`
2, ignore nan values like the old impl:
```
val relativeError = 0.05
val approxQuantile = numNearestNeighbors.toDouble / count + relativeError
val modelDatasetWithDist = modelDataset.withColumn(distCol, hashDistCol)
if (approxQuantile >= 1) {
modelDatasetWithDist
} else {
val hashThreshold = modelDatasetWithDist.stat
.approxQuantile(distCol, Array(approxQuantile), relativeError)
// Filter the dataset where the hash value is less than the threshold.
modelDatasetWithDist.filter(hashDistCol <= hashThreshold(0))
}
```
### Why are the changes needed?
simplify the aggregation
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#29778 from zhengruifeng/lsh_nit.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
### What changes were proposed in this pull request?
Recently, we added code to log plan changes in the preparation phase in `QueryExecution` for execution (https://github.com/apache/spark/pull/29544). This PR intends to apply the same fix for logging plan changes in AQE.
### Why are the changes needed?
Easy debugging for AQE plans
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added unit tests.
Closes#29774 from maropu/PlanChangeLogForAQE.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This PR rewrites `FoldablePropagation` rule to replace attribute references in a node with foldables coming only from the node's children.
Before this PR in the case of this example (with setting`spark.sql.optimizer.excludedRules=org.apache.spark.sql.catalyst.optimizer.ConvertToLocalRelation`):
```scala
val a = Seq("1").toDF("col1").withColumn("col2", lit("1"))
val b = Seq("2").toDF("col1").withColumn("col2", lit("2"))
val aub = a.union(b)
val c = aub.filter($"col1" === "2").cache()
val d = Seq("2").toDF( "col4")
val r = d.join(aub, $"col2" === $"col4").select("col4")
val l = c.select("col2")
val df = l.join(r, $"col2" === $"col4", "LeftOuter")
df.show()
```
foldable propagation happens incorrectly:
```
Join LeftOuter, (col2#6 = col4#34) Join LeftOuter, (col2#6 = col4#34)
!:- Project [col2#6] :- Project [1 AS col2#6]
: +- InMemoryRelation [col1#4, col2#6], StorageLevel(disk, memory, deserialized, 1 replicas) : +- InMemoryRelation [col1#4, col2#6], StorageLevel(disk, memory, deserialized, 1 replicas)
: +- Union : +- Union
: :- *(1) Project [value#1 AS col1#4, 1 AS col2#6] : :- *(1) Project [value#1 AS col1#4, 1 AS col2#6]
: : +- *(1) Filter (isnotnull(value#1) AND (value#1 = 2)) : : +- *(1) Filter (isnotnull(value#1) AND (value#1 = 2))
: : +- *(1) LocalTableScan [value#1] : : +- *(1) LocalTableScan [value#1]
: +- *(2) Project [value#10 AS col1#13, 2 AS col2#15] : +- *(2) Project [value#10 AS col1#13, 2 AS col2#15]
: +- *(2) Filter (isnotnull(value#10) AND (value#10 = 2)) : +- *(2) Filter (isnotnull(value#10) AND (value#10 = 2))
: +- *(2) LocalTableScan [value#10] : +- *(2) LocalTableScan [value#10]
+- Project [col4#34] +- Project [col4#34]
+- Join Inner, (col2#6 = col4#34) +- Join Inner, (col2#6 = col4#34)
:- Project [value#31 AS col4#34] :- Project [value#31 AS col4#34]
: +- LocalRelation [value#31] : +- LocalRelation [value#31]
+- Project [col2#6] +- Project [col2#6]
+- Union false, false +- Union false, false
:- Project [1 AS col2#6] :- Project [1 AS col2#6]
: +- LocalRelation [value#1] : +- LocalRelation [value#1]
+- Project [2 AS col2#15] +- Project [2 AS col2#15]
+- LocalRelation [value#10] +- LocalRelation [value#10]
```
and so the result is wrong:
```
+----+----+
|col2|col4|
+----+----+
| 1|null|
+----+----+
```
After this PR foldable propagation will not happen incorrectly and the result is correct:
```
+----+----+
|col2|col4|
+----+----+
| 2| 2|
+----+----+
```
### Why are the changes needed?
To fix a correctness issue.
### Does this PR introduce _any_ user-facing change?
Yes, fixes a correctness issue.
### How was this patch tested?
Existing and new UTs.
Closes#29771 from peter-toth/SPARK-32635-fix-foldable-propagation.
Authored-by: Peter Toth <peter.toth@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
make orc table column name support special characters like `$`
### Why are the changes needed?
Special characters like `$` are allowed in orc table column name by Hive.
But it's error when execute command "CREATE TABLE tbl(`$` INT, b INT) using orc" in spark. it's not compatible with Hive.
`Column name "$" contains invalid character(s). Please use alias to rename it.;Column name "$" contains invalid character(s). Please use alias to rename it.;org.apache.spark.sql.AnalysisException: Column name "$" contains invalid character(s). Please use alias to rename it.;
at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$.checkFieldName(OrcFileFormat.scala:51)
at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$.$anonfun$checkFieldNames$1(OrcFileFormat.scala:59)
at org.apache.spark.sql.execution.datasources.orc.OrcFileFormat$.$anonfun$checkFieldNames$1$adapted(OrcFileFormat.scala:59)
at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:38) `
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Add unit test
Closes#29761 from jzc928/orcColSpecialChar.
Authored-by: jzc <jzc@jzcMacBookPro.local>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This pr fix failed and aborted cases in sql hive-thriftserver module in Scala 2.13, the main change of this pr as follow:
- Use `s.c.Seq` instead of `Seq` in `HiveResult` because the input type maybe `mutable.ArraySeq`, but `Seq` represent `immutable.Seq` in Scala 2.13.
- Reset classLoader after `HiveMetastoreLazyInitializationSuite` completed because context class loader is `NonClosableMutableURLClassLoader` in `HiveMetastoreLazyInitializationSuite` running process, and it propagate to `HiveThriftServer2ListenerSuite` trigger following problems in Scala 2.13:
```
HiveThriftServer2ListenerSuite:
*** RUN ABORTED ***
java.lang.LinkageError: loader constraint violation: loader (instance of net/bytebuddy/dynamic/loading/MultipleParentClassLoader) previously initiated loading for a different type with name "org/apache/hive/service/ServiceStateChangeListener"
at org.mockito.codegen.HiveThriftServer2$MockitoMock$1850222569.<clinit>(Unknown Source)
at sun.reflect.GeneratedSerializationConstructorAccessor530.newInstance(Unknown Source)
at java.lang.reflect.Constructor.newInstance(Constructor.java:423)
at org.objenesis.instantiator.sun.SunReflectionFactoryInstantiator.newInstance(SunReflectionFactoryInstantiator.java:48)
at org.objenesis.ObjenesisBase.newInstance(ObjenesisBase.java:73)
at org.mockito.internal.creation.instance.ObjenesisInstantiator.newInstance(ObjenesisInstantiator.java:19)
at org.mockito.internal.creation.bytebuddy.SubclassByteBuddyMockMaker.createMock(SubclassByteBuddyMockMaker.java:47)
at org.mockito.internal.creation.bytebuddy.ByteBuddyMockMaker.createMock(ByteBuddyMockMaker.java:25)
at org.mockito.internal.util.MockUtil.createMock(MockUtil.java:35)
at org.mockito.internal.MockitoCore.mock(MockitoCore.java:63)
...
```
After this pr `HiveThriftServer2Suites` and `HiveThriftServer2ListenerSuite` was fixed and all 461 test passed
### Why are the changes needed?
We need to support a Scala 2.13 build.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Scala 2.12: Pass the Jenkins or GitHub Action
- Scala 2.13: All tests passed.
Do the following:
```
dev/change-scala-version.sh 2.13
mvn clean install -DskipTests -pl sql/hive-thriftserver -am -Phive-thriftserver -Pscala-2.13
mvn test -pl sql/hive-thriftserver -Phive -Phive-thriftserver -Pscala-2.13
```
**Before**
```
HiveThriftServer2ListenerSuite:
*** RUN ABORTED ***
```
**After**
```
Tests: succeeded 461, failed 0, canceled 0, ignored 17, pending 0
All tests passed.
```
Closes#29783 from LuciferYang/sql-thriftserver-tests.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
The PR aims to add Scala 2.13 build test coverage into GitHub Action for Apache Spark 3.1.0.
### Why are the changes needed?
The branch is ready for Scala 2.13 and this will prevent any regression.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Pass the GitHub Action.
Closes#29793 from dongjoon-hyun/SPARK-32926.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Correct the typo in Show Table document
### Why are the changes needed?
Current Document of Show Table returns in parse error, so it is misleading to users
### Does this PR introduce _any_ user-facing change?
Yes, the document of show table is corrected now
### How was this patch tested?
NA
Closes#29758 from Udbhav30/showtable.
Authored-by: Udbhav30 <u.agrawal30@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This is a follow-up on #29565, and addresses a few issues in the last PR:
- style issue pointed by [this comment](https://github.com/apache/spark/pull/29565#discussion_r487646749)
- skip optimization when `fromExp` is foldable (by [this comment](https://github.com/apache/spark/pull/29565#discussion_r487646973)) as there could be more efficient rule to apply for this case.
- pass timezone info to the generated cast on the literal value
- a bunch of cleanups and test improvements
Originally I plan to handle this when implementing [SPARK-32858](https://issues.apache.org/jira/browse/SPARK-32858) but now think it's better to isolate these changes from that.
### Why are the changes needed?
To fix a few left over issues in the above PR.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added a test for the foldable case. Otherwise relying on existing tests.
Closes#29775 from sunchao/SPARK-24994-followup.
Authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
To fix the flaky `ExecutorAllocationManagerSuite`: Avoid first `schedule()` invocation after `ExecutorAllocationManager` started.
### Why are the changes needed?
`ExecutorAllocationManagerSuite` is still flaky, see:
https://github.com/apache/spark/pull/29722/checks?check_run_id=1117979237
By checking the below logs, we can see that there's a race condition between thread `pool-1-thread-1-ScalaTest-running` and thread `spark-dynamic-executor-allocation`. The only possibility of thread `spark-dynamic-executor-allocation` becoming active is the first time invocation of `schedule()`(since the `TEST_SCHEDULE_INTERVAL`(30s) is really long, so it's impossible the second invocation would happen). Thus, I think we shall avoid the first invocation too.
```scala
20/09/15 12:41:20.831 pool-1-thread-1-ScalaTest-running-ExecutorAllocationManagerSuite INFO ExecutorAllocationManager: Requesting 1 new executor because tasks are backlogged (new desired total will be 2 for resource profile id: 0)
20/09/15 12:41:20.832 spark-dynamic-executor-allocation INFO ExecutorAllocationManager: Requesting 2 new executors because tasks are backlogged (new desired total will be 4 for resource profile id: 0)
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
The flaky can't be reproduced locally so it's hard to say it has been completely fixed by now. We need time to see the result.
Closes#29773 from Ngone51/fix-SPARK-32287.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>