### What changes were proposed in this pull request?
As title. Fixed two places where the documentation for window operator has some error.
### Why are the changes needed?
Help people read code for window operator more easily in the future.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#32585 from c21/minor-doc.
Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
The PR is proposed for **pandas APIs on Spark**, in order to separate arithmetic operations shown as below into data-type-based structures.
`__add__, __sub__, __mul__, __truediv__, __floordiv__, __pow__, __mod__,
__radd__, __rsub__, __rmul__, __rtruediv__, __rfloordiv__, __rpow__,__rmod__`
DataTypeOps and subclasses are introduced.
The existing behaviors of each arithmetic operation should be preserved.
### Why are the changes needed?
Currently, the same arithmetic operation of all data types is defined in one function, so it’s difficult to extend the behavior change based on the data types.
Introducing DataTypeOps would be the foundation for [pandas APIs on Spark: Separate basic operations into data type based structures.](https://docs.google.com/document/d/12MS6xK0hETYmrcl5b9pX5lgV4FmGVfpmcSKq--_oQlc/edit?usp=sharing).
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Tests are introduced under pyspark.pandas.tests.data_type_ops. One test file per DataTypeOps class.
Closes#32469 from xinrong-databricks/datatypeop_arith.
Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
AQE has an optimization where it attempts to reuse compatible exchanges but it does not take into account whether the exchanges are columnar or not, resulting in incorrect reuse under some circumstances.
This PR simply changes the key used to lookup cached stages. It now uses the canonicalized form of the new query stage (potentially created by a plugin) rather than using the canonicalized form of the original exchange.
### Why are the changes needed?
When using the [RAPIDS Accelerator for Apache Spark](https://github.com/NVIDIA/spark-rapids) we sometimes see a new query stage correctly create a row-based exchange and then Spark replaces it with a cached columnar exchange, which is not compatible, and this causes queries to fail.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
The patch has been tested with the query that highlighted this issue. I looked at writing unit tests for this but it would involve implementing a mock columnar exchange in the tests so would be quite a bit of work. If anyone has ideas on other ways to test this I am happy to hear them.
Closes#32195 from andygrove/SPARK-35093.
Authored-by: Andy Grove <andygrove73@gmail.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
### What changes were proposed in this pull request?
Updating column stats for Union operator stats estimation
### Why are the changes needed?
This is a followup PR to update the null count also in the Union stats operator estimation. https://github.com/apache/spark/pull/30334
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Updated UTs, manual testing
Closes#32494 from shahidki31/shahid/updateNullCountForUnion.
Lead-authored-by: shahid <shahidki31@gmail.com>
Co-authored-by: Shahid <shahidki31@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
This PR adds `sentences`, a string function, which is present as of `2.0.0` but missing in `functions.{scala,py}`.
### Why are the changes needed?
This function can be only used from SQL for now.
It's good if we can use this function from Scala/Python code as well as SQL.
### Does this PR introduce _any_ user-facing change?
Yes. Users can use this function from Scala and Python.
### How was this patch tested?
New test.
Closes#32566 from sarutak/sentences-function.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
### What changes were proposed in this pull request?
Update histogram statistics for RANGE operator stats estimation
### Why are the changes needed?
If histogram optimization is enabled, this statistics can be used in various cost based optimizations.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added UTs. Manual test.
Closes#32498 from shahidki31/shahid/histogram.
Lead-authored-by: shahid <shahidki31@gmail.com>
Co-authored-by: Shahid <shahidki31@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
1. In HadoopMapReduceCommitProtocol, create parent directory before renaming custom partition path staging files
2. In InMemoryCatalog and HiveExternalCatalog, create new partition directory before renaming old partition path
3. Check return value of FileSystem#rename, if false, throw exception to avoid silent data loss cause by rename failure
4. Change DebugFilesystem#rename behavior to make it match HDFS's behavior (return false without rename when dst parent directory not exist)
### Why are the changes needed?
Depends on FileSystem#rename implementation, when destination directory does not exist, file system may
1. return false without renaming file nor throwing exception (e.g. HDFS), or
2. create destination directory, rename files, and return true (e.g. LocalFileSystem)
In the first case above, renames in HadoopMapReduceCommitProtocol for custom partition path will fail silently if the destination partition path does not exist. Failed renames can happen when
1. dynamicPartitionOverwrite == true, the custom partition path directories are deleted by the job before the rename; or
2. the custom partition path directories do not exist before the job; or
3. something else is wrong when file system handle `rename`
The renames in MemoryCatalog and HiveExternalCatalog for partition renaming also have similar issue.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Modified DebugFilesystem#rename, and added new unit tests.
Without the fix in src code, five InsertSuite tests and one AlterTableRenamePartitionSuite test failed:
InsertSuite.SPARK-20236: dynamic partition overwrite with custom partition path (existing test with modified FS)
```
== Results ==
!== Correct Answer - 1 == == Spark Answer - 0 ==
struct<> struct<>
![2,1,1]
```
InsertSuite.SPARK-35106: insert overwrite with custom partition path
```
== Results ==
!== Correct Answer - 1 == == Spark Answer - 0 ==
struct<> struct<>
![2,1,1]
```
InsertSuite.SPARK-35106: dynamic partition overwrite with custom partition path
```
== Results ==
!== Correct Answer - 2 == == Spark Answer - 1 ==
!struct<> struct<i:int,part1:int,part2:int>
[1,1,1] [1,1,1]
![1,1,2]
```
InsertSuite.SPARK-35106: Throw exception when rename custom partition paths returns false
```
Expected exception org.apache.spark.SparkException to be thrown, but no exception was thrown
```
InsertSuite.SPARK-35106: Throw exception when rename dynamic partition paths returns false
```
Expected exception org.apache.spark.SparkException to be thrown, but no exception was thrown
```
AlterTableRenamePartitionSuite.ALTER TABLE .. RENAME PARTITION V1: multi part partition (existing test with modified FS)
```
== Results ==
!== Correct Answer - 1 == == Spark Answer - 0 ==
struct<> struct<>
![3,123,3]
```
Closes#32530 from YuzhouSun/SPARK-35106.
Authored-by: Yuzhou Sun <yuzhosun@amazon.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR fixes an issue that streaming queries with V2Relation can have redundant `ProjectExec` in its physical plan.
You can easily reproduce this issue with the following code.
```
import org.apache.spark.sql.streaming.Trigger
val query = spark.
readStream.
format("rate").
option("rowsPerSecond", 1000).
option("rampUpTime", "10s").
load().
selectExpr("timestamp", "100", "value").
writeStream.
format("console").
trigger(Trigger.ProcessingTime("5 seconds")).
// trigger(Trigger.Continuous("5 seconds")). // You can reproduce with continuous processing too.
outputMode("append").
start()
```
The plan tree is here.
![ss-before](https://user-images.githubusercontent.com/4736016/118454996-ec439800-b733-11eb-8cd8-ed8af73a91b8.png)
### Why are the changes needed?
For better performance.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
I run the same code above and get the following plan tree.
![ss-after](https://user-images.githubusercontent.com/4736016/118455755-1bf2a000-b734-11eb-999e-4b8c19ad34d7.png)
Closes#32570 from sarutak/fix-redundant-projectexec.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
SPARK-35253 upgraded janino from 3.0.16 to 3.1.4, `ClassBodyEvaluator` provides the `getBytecodes` method to get
the mapping from `ClassFile#getThisClassName` to `ClassFile#toByteArray` directly in this version and we don't need to get this variable by reflection api anymore.
So the main purpose of this pr is simplify the way to get `bytecodes` from `ClassBodyEvaluator` in `CodeGenerator#updateAndGetCompilationStats` method.
### Why are the changes needed?
Code simplification.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
- Pass the Jenkins or GitHub Action
- Manual test:
1. Define a code fragment to be tested, for example:
```
val codeBody = s"""
public java.lang.Object generate(Object[] references) {
return new TestMetricCode(references);
}
class TestMetricCode {
public TestMetricCode(Object[] references) {
}
public long sumOfSquares(long left, long right) {
return left * left + right * right;
}
}
"""
```
2. Create a `ClassBodyEvaluator` and `cook` the `codeBody` as above, the process of creating `ClassBodyEvaluator` can extract from `CodeGenerator#doCompile` method.
3. Get `bytecodes` using `ClassBodyEvaluator#getBytecodes` api(after this pr) and reflection api(before this pr) respectively, then assert that they are the same. If the `bytecodes` not changed, we can be sure that metrics state will not change. The test code example as follows:
```
import scala.collection.JavaConverters._
val bytecodesFromApi = evaluator.getBytecodes.asScala
val bytecodesFromReflectionApi = {
val scField = classOf[ClassBodyEvaluator].getDeclaredField("sc")
scField.setAccessible(true)
val compiler = scField.get(evaluator).asInstanceOf[SimpleCompiler]
val loader = compiler.getClassLoader.asInstanceOf[ByteArrayClassLoader]
val classesField = loader.getClass.getDeclaredField("classes")
classesField.setAccessible(true)
classesField.get(loader).asInstanceOf[java.util.Map[String, Array[Byte]]].asScala
}
assert(bytecodesFromApi == bytecodesFromReflectionApi)
```
Closes#32536 from LuciferYang/SPARK-35253-FOLLOWUP.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
Introduce new shared methods to `ShuffleBlockFetcherIteratorSuite` to replace copy-pasted code. Use modern, Scala-like Mockito `Answer` syntax.
### Why are the changes needed?
`ShuffleFetcherBlockIteratorSuite` has tons of duplicate code, like 0494dc90af/core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala (L172-L185) . It's challenging to tell what the interesting parts are vs. what is just being set to some default/unused value.
Similarly but not as bad, there are many calls like the following
```
verify(transfer, times(1)).fetchBlocks(any(), any(), any(), any(), any(), any())
when(transfer.fetchBlocks(any(), any(), any(), any(), any(), any())).thenAnswer ...
```
These changes result in about 10% reduction in both lines and characters in the file:
```bash
# Before
> wc core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
1063 3950 43201 core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
# After
> wc core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
928 3609 39053 core/src/test/scala/org/apache/spark/storage/ShuffleBlockFetcherIteratorSuite.scala
```
It also helps readability, e.g.:
```
val iterator = createShuffleBlockIteratorWithDefaults(
transfer,
blocksByAddress,
maxBytesInFlight = 1000L
)
```
Now I can clearly tell that `maxBytesInFlight` is the main parameter we're interested in here.
### Does this PR introduce _any_ user-facing change?
No, test only. There aren't even any behavior changes, just refactoring.
### How was this patch tested?
Unit tests pass.
Closes#32389 from xkrogen/xkrogen-spark-35263-refactor-shuffleblockfetcheriteratorsuite.
Authored-by: Erik Krogen <xkrogen@apache.org>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
### What changes were proposed in this pull request?
This PR upgrades the scalatestplus artifacts and scalacheck.
### Why are the changes needed?
scalatestplus artifacts Spark uses are two years old and these artifacts are currently renamed.
So, let's follow up.
Also, the latest releases seem to support Scala 3.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
GA passed on my repository.
Closes#32581 from sarutak/upgrade-scalatestplus.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Write out Seq of product objects which contain TreeNode, to avoid the cases as described in https://issues.apache.org/jira/browse/SPARK-35411 that essential information will be ignored and just written out as null values. These information are necessary to understand the query plans.
### Why are the changes needed?
Information like cteRelations in With node, and branches in CaseWhen expression are necessary to understand the query plans, they should be written out to the result json string.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
UT case added.
Closes#32557 from ivoson/plan-json-fix.
Authored-by: Tengfei Huang <tengfei.h@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
To pass the TPCDS-related plan stability tests in scala-2.13, this PR proposes to fix two things below;
- (1) Sorts elements in the predicate `InSet` and the source filter `In` for printing their nodes.
- (2) Formats nested collection elements (`Seq`, `Array`, and `Set`) recursively in `TreeNode.argString`.
As for (1), it seems v2.12/v2.13 prints `Set` elements with a different order, so we need to sort them explicitly. As for (2), the `Seq` implementation is different between v2.12/v2.13, so we need to format nested `Seq` elements correctly to hide the name of its implementation (See an example below);
```
(74) Expand [codegen id : 20]
Input [5]: [sales#41, RETURNS#42, profit#43, channel#44, id#45]
-Arguments: [ArrayBuffer(sales#41, returns#42, ... <-- scala-2.12
+Arguments: [Vector(sales#41, returns#42, ... <-- scala-2.13
+Arguments: [[(sales#41, returns#42, ... <-- the proposed fix to hide the name of its implementation
```
### Why are the changes needed?
To pass the tests in Scala v2.13.
### Does this PR introduce _any_ user-facing change?
Yes, this fix changes query explain strings.
### How was this patch tested?
Manually checked.
Closes#32577 from maropu/FixTPCDSTestIssueInScala213.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
When creating `Invoke` and `StaticInvoke` for `ScalarFunction`'s magic method, set `propagateNull` to false.
### Why are the changes needed?
When `propgagateNull` is true (which is the default value), `Invoke` and `StaticInvoke` will return null if any of the argument is null. For scalar function this is incorrect, as we should leave the logic to function implementation instead.
### Does this PR introduce _any_ user-facing change?
Yes. Now null arguments shall be properly handled with magic method.
### How was this patch tested?
Added new tests.
Closes#32553 from sunchao/SPARK-35389.
Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
The following two things are done in this PR.
* Add note about Jinja2 as a required dependency for document build.
* Add Jinja2 dependency for the document build to `spark-rm/Dockerfile`
### Why are the changes needed?
SPARK-35375(#32509) confined the version of Jinja to <3.0.0.
So it's good to note about it in `docs/README.md` and add the dependency to `spark-rm/Dockerfile`.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
I confimed that `make html` succeed under `python/docs` with the following command.
```
sudo pip install 'sphinx<3.1.0' mkdocs numpy pydata_sphinx_theme ipython nbsphinx numpydoc 'jinja2<3.0.0'
```
Closes#32573 from sarutak/required-module-for-python-doc.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
### What changes were proposed in this pull request?
To pass `subquery/scalar-subquery/scalar-subquery-select.sql` (`SQLQueryTestSuite`) in Scala v2.13, this PR proposes to change the aggregate expr of a test query in the file from `collect_set(...)` to `sort_array(collect_set(...))` because `collect_set` depends on the `mutable.HashSet` implementation and elements in the set are printed in a different order in Scala v2.12/v2.13.
### Why are the changes needed?
To pass the test in Scala v2.13.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
Manually checked.
Closes#32578 from maropu/FixSQLTestIssueInScala213.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
https://github.com/apache/spark/pull/30309 added a configuration (disabled by default) that simplifies the error messages from Python UDFS, which removed internal stacktrace from Python workers:
```python
from pyspark.sql.functions import udf; spark.range(10).select(udf(lambda x: x/0)("id")).collect()
```
**Before**
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../python/pyspark/sql/dataframe.py", line 427, in show
print(self._jdf.showString(n, 20, vertical))
File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/.../python/pyspark/sql/utils.py", line 127, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.PythonException:
An exception was thrown from Python worker in the executor:
Traceback (most recent call last):
File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 605, in main
process()
File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 597, in process
serializer.dump_stream(out_iter, outfile)
File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 223, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 141, in dump_stream
for obj in iterator:
File "/.../python/lib/pyspark.zip/pyspark/serializers.py", line 212, in _batched
for item in iterator:
File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in mapper
result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 450, in <genexpr>
result = tuple(f(*[a[o] for o in arg_offsets]) for (arg_offsets, f) in udfs)
File "/.../python/lib/pyspark.zip/pyspark/worker.py", line 90, in <lambda>
return lambda *a: f(*a)
File "/.../python/lib/pyspark.zip/pyspark/util.py", line 107, in wrapper
return f(*args, **kwargs)
File "<stdin>", line 1, in <lambda>
ZeroDivisionError: division by zero
```
**After**
```
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/.../python/pyspark/sql/dataframe.py", line 427, in show
print(self._jdf.showString(n, 20, vertical))
File "/.../python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py", line 1305, in __call__
File "/.../python/pyspark/sql/utils.py", line 127, in deco
raise_from(converted)
File "<string>", line 3, in raise_from
pyspark.sql.utils.PythonException:
An exception was thrown from Python worker in the executor:
Traceback (most recent call last):
File "<stdin>", line 1, in <lambda>
ZeroDivisionError: division by zero
```
Note that the traceback (`return f(*args, **kwargs)`) is almost always same - I would say more than 99%. For 1% case, we can guide developers to enable this configuration for further debugging.
In Databricks, it has been enabled for around 6 months, and I have had zero negative feedback on it.
### Why are the changes needed?
To show simplified exception messages to end users.
### Does this PR introduce _any_ user-facing change?
Yes, it will hide the internal Python worker traceback.
### How was this patch tested?
Existing test cases should cover.
Closes#32569 from HyukjinKwon/SPARK-35419.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Fix test failure under Scala 2.13 by making test `ScalaFunction` `StrLenMagic` public.
### Why are the changes needed?
A few tests are failing when using Scala 2.13 with error message like the following:
```
[info] Cause: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 35, Column 121: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 35, Column 121: No a
pplicable constructor/method found for actual parameters "org.apache.spark.unsafe.types.UTF8String"; candidates are: "public int org.apache.spark.sql.connector.DataSourceV2FunctionSuite$StrLenMagic$.invoke(org.apache.spark.
unsafe.types.UTF8String)"
[info] at org.apache.spark.sql.errors.QueryExecutionErrors$.compilerError(QueryExecutionErrors.scala:387)
[info] at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1415)
[info] at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1501)
```
This seems to be caused by the fact that the `StrLenMagic` is using `private` scope. After removing the `private` keyword the tests are now passing.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
```
$ dev/change-scala-version.sh 2.13
$ build/sbt "sql/testOnly *.DataSourceV2FunctionSuite" -Pscala-2.13
```
Closes#32575 from sunchao/SPARK-34981-follow-up.
Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Fixes `mypy` errors and enables `mypy` check for pandas-on-Spark.
### Why are the changes needed?
The `mypy` check for pandas-on-Spark was disabled when the initial porting.
It should be enabled again; otherwise we will miss type checking errors.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
The enabled `mypy` check and existing unit tests should pass.
Closes#32540 from ueshin/issues/SPARK-34941/pandas_mypy.
Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
### What changes were proposed in this pull request?
This PR is used to fix this bug:
```
set spark.sql.legacy.charVarcharAsString=true;
create table chartb01(a char(3));
insert into chartb01 select 'aaaaa';
```
here we expect the data of table chartb01 is 'aaa', but it runs failed.
### Why are the changes needed?
Improve backward compatibility
```
spark-sql>
> create table tchar01(col char(2)) using parquet;
Time taken: 0.767 seconds
spark-sql>
> insert into tchar01 select 'aaa';
ERROR | Executor task launch worker for task 0.0 in stage 0.0 (TID 0) | Aborting task | org.apache.spark.util.Utils.logError(Logging.scala:94)
java.lang.RuntimeException: Exceeds char/varchar type length limitation: 2
at org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils.trimTrailingSpaces(CharVarcharCodegenUtils.java:31)
at org.apache.spark.sql.catalyst.util.CharVarcharCodegenUtils.charTypeWriteSideCheck(CharVarcharCodegenUtils.java:44)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.project_doConsume_0$(Unknown Source)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage1.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:755)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$executeTask$1(FileFormatWriter.scala:279)
at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1500)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.executeTask(FileFormatWriter.scala:288)
at org.apache.spark.sql.execution.datasources.FileFormatWriter$.$anonfun$write$15(FileFormatWriter.scala:212)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
at org.apache.spark.scheduler.Task.run(Task.scala:131)
at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:497)
at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1466)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:500)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
```
### Does this PR introduce _any_ user-facing change?
No (the legacy config is false by default).
### How was this patch tested?
Added unit tests.
Closes#32501 from fhygh/master.
Authored-by: fhygh <283452027@qq.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Spark doesn't support aggregate functions with mixed outer and local references. This PR applies this check earlier to fail with a clear error message instead of some weird ones, and simplifies the related code in `SubExprUtils.getOuterReferences`. This PR also refines the error message a bit.
### Why are the changes needed?
better error message
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
updated tests
Closes#32503 from cloud-fan/try.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Fix incorrect statement that state is no longer needed in the event of executor failure and document that it is needed in the case of a flaky app causing occasional executor failure.
SO [discussion](https://stackoverflow.com/questions/67466878/can-spark-with-external-shuffle-service-use-saved-shuffle-files-in-the-event-of/67507439#67507439).
### Why are the changes needed?
To fix the documentation and guide users as to additional use case for the Shuffle Service.
### Does this PR introduce _any_ user-facing change?
Documentation only.
### How was this patch tested?
N/A.
Closes#32538 from chrisheaththomas/shuffle-service-and-executor-failure.
Authored-by: Chris Thomas <chrisheaththomas@hotmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR replaces `toStringHelper`, an API which breaks in Guava 27.
### Why are the changes needed?
SPARK-30272 (#26911) removed usages which breaks in Guava 27 but `toStringHelper` is instroduced again.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Build successfully finished with the following command.
```
build/sbt -Dguava.version=27.0-jre -Phive -Phive-thriftserver -Pyarn -Pmesos -Pkubernetes -Phadoop-cloud -Pdocker-integration-tests -Pkubernetes-integration-tests -Pkinesis-asl -Pspark-ganglia-lgpl package
```
Closes#32567 from sarutak/remove-old-guava-usage.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Introduction: this PR is a part of SPARK-10816 (`EventTime based sessionization (session window)`). Please refer #31937 to see the overall view of the code change. (Note that code diff could be diverged a bit.)
### What changes were proposed in this pull request?
This PR introduces UpdatingSessionsIterator, which analyzes neighbor elements and adjust session information on elements.
UpdatingSessionsIterator calculates and updates the session window for each element in the given iterator, which makes elements in the same session window having same session spec. Downstream can apply aggregation to finally merge these elements bound to the same session window.
UpdatingSessionsIterator works on the precondition that given iterator is sorted by "group keys + start time of session window", and the iterator still retains the characteristic of the sort.
UpdatingSessionsIterator copies the elements to safely update on each element, as well as buffers elements which are bound to the same session window. Due to such overheads, MergingSessionsIterator which will be introduced via SPARK-34889 should be used whenever possible.
This PR also introduces UpdatingSessionsExec which is the physical node on leveraging UpdatingSessionsIterator to sort the input rows and updates session information on input rows.
### Why are the changes needed?
This part is a one of required on implementing SPARK-10816.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New test suite added.
Closes#31986 from HeartSaVioR/SPARK-34888-SPARK-10816-PR-31570-part-1.
Lead-authored-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
Provide clearer error message tied to the user's Python code if incorrect parameters are passed to `DataFrame.show` rather than the message about a missing JVM method the user is not calling directly.
```
py4j.Py4JException: Method showString([class java.lang.Boolean, class java.lang.Integer, class java.lang.Boolean]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
at py4j.Gateway.invoke(Gateway.java:274)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:238)
at java.lang.Thread.run(Thread.java:748
```
### Why are the changes needed?
For faster debugging through actionable error message.
### Does this PR introduce _any_ user-facing change?
No change for the correct parameters but different error messages for the parameters triggering an exception.
### How was this patch tested?
- unit test
- manually in PySpark REPL
Closes#32555 from gerashegalov/df_show_validation.
Authored-by: Gera Shegalov <gera@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims to add a new configuration, `spark.kubernetes.driver.reusePersistentVolumeClaim`, to reuse driver-owned `PersistentVolumeClaims` of the **deleted** executor pods.
Note also that `driver-owned PersistentVolumeClaims` is controlled by `spark.kubernetes.driver.ownPersistentVolumeClaim` which is recently added.
### Why are the changes needed?
PVC creations take some times. This feature can reduce it by reusing it.
For example, we can start `Pi` app with two executors with PVCs.
```
$ k logs -f pi | grep ExecutorPodsAllocator
21/05/16 23:36:32 INFO ExecutorPodsAllocator: Going to request 2 executors from Kubernetes for ResourceProfile Id: 0, target: 2 running: 0.
21/05/16 23:36:32 INFO ExecutorPodsAllocator: Found 0 reusable PVCs from 0 PVCs
21/05/16 23:36:32 INFO ExecutorPodsAllocator: Trying to create PersistentVolumeClaim pi-exec-1-pvc-0 with StorageClass scaleio
21/05/16 23:36:33 INFO ExecutorPodsAllocator: Trying to create PersistentVolumeClaim pi-exec-2-pvc-0 with StorageClass scaleio
```
After killing one executor, Spark is trying to look up the reusable PVCs, but the dead-executor's PVC may not returned yet because K8s works asynchronously. In this case, Spark is trying to create a new PVC as a normal operation.
```
21/05/16 23:38:51 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes for ResourceProfile Id: 0, target: 2 running: 1.
21/05/16 23:38:51 INFO ExecutorPodsAllocator: Found 0 reusable PVCs from 2 PVCs
21/05/16 23:38:51 INFO ExecutorPodsAllocator: Trying to create PersistentVolumeClaim pi-exec-3-pvc-0 with StorageClass scaleio
```
After killing another executor, Spark found one reusable PVC, `pi-exec-1-pvc-0`, and reuse it.
```
21/05/16 23:39:18 INFO ExecutorPodsAllocator: Going to request 1 executors from Kubernetes for ResourceProfile Id: 0, target: 2 running: 1.
21/05/16 23:39:18 INFO ExecutorPodsAllocator: Found 1 reusable PVCs from 3 PVCs
21/05/16 23:39:18 INFO ExecutorPodsAllocator: Reuse PersistentVolumeClaim pi-exec-1-pvc-0
```
In this case, we can easily notice the remounted PVC because `ClaimName`, `pi-exec-1-pvc-0`, doesn't have the prefix of pod name, `pi-exec-4`.
```
$ k describe pod pi-exec-4 | grep pi-exec-1-pvc-0
ClaimName: pi-exec-1-pvc-0
```
### Does this PR introduce _any_ user-facing change?
Yes, but this is a new feature which is disabled by the new conf.
### How was this patch tested?
Pass the CIs with the newly added test case.
K8S IT test also passed.
```
KubernetesSuite:
- Run SparkPi with no resources
- Run SparkPi with a very long application name.
- Use SparkLauncher.NO_RESOURCE
- Run SparkPi with a master URL without a scheme.
- Run SparkPi with an argument.
- Run SparkPi with custom labels, annotations, and environment variables.
- All pods have the same service account by default
- Run extraJVMOptions check on driver
- Run SparkRemoteFileTest using a remote data file
- Verify logging configuration is picked from the provided SPARK_CONF_DIR/log4j.properties
- Run SparkPi with env and mount secrets.
- Run PySpark on simple pi.py example
- Run PySpark to test a pyfiles example
- Run PySpark with memory customization
- Run in client mode.
- Start pod creation from template
- Launcher client dependencies
- SPARK-33615: Launcher client archives
- SPARK-33748: Launcher python client respecting PYSPARK_PYTHON
- SPARK-33748: Launcher python client respecting spark.pyspark.python and spark.pyspark.driver.python
- Launcher python client dependencies using a zip file
- Test basic decommissioning
- Test basic decommissioning with shuffle cleanup
- Test decommissioning with dynamic allocation & shuffle cleanups
- Test decommissioning timeouts
- Run SparkR on simple dataframe.R example
Run completed in 17 minutes, 7 seconds.
Total number of tests run: 26
Suites: completed 2, aborted 0
Tests: succeeded 26, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
...
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 24:14 min
[INFO] Finished at: 2021-05-16T17:24:40-07:00
[INFO] ------------------------------------------------------------------------
```
Closes#32564 from dongjoon-hyun/SPARK-35416.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This fixes the compilation error due to the logical conflicts between https://github.com/apache/spark/pull/31776 and https://github.com/apache/spark/pull/29642 .
### Why are the changes needed?
To recover compilation.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Closes#32568 from wangyum/HOT-FIX.
Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This PR aims to upgrade SBT to 1.5.2 for better Scala 2.13.x support.
### Why are the changes needed?
SBT 1.5.2 Release Note: https://github.com/sbt/sbt/releases/tag/v1.5.2
- Fixes ConcurrentModificationException while compiling Scala 2.13.4 and Java sources zinc
- Uses -Duser.home instead of $HOME to download launcher JAR
- Fixes -client by making it the same as --client
- Fixes metabuild ClassLoader missing util-interface
- Fixes sbt new leaving behind target directory
- Fixes "zip END header not found" error during pushRemoteCache
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the CIs.
Closes#32565 from dongjoon-hyun/SPARK-35417.
Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to use the SHA of the latest commit ([2a5078a782192ddb6efbcead8de9973d6ab4f069](2a5078a782)) when checking out `databricks/tpcds-kit`. This can prevent the test workflow from breaking accidentally if the repository changes drastically.
### Why are the changes needed?
For better test workflow.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
GA passed.
Closes#32561 from maropu/UseRefInCheckout.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
To fix a bug below in groupBy of year-month/day-time intervals, this PR proposes to make `HashMapGenerator` handle the two types for hash-aggregates;
```
scala> Seq(java.time.Duration.ofDays(1)).toDF("a").groupBy("a").count().show()
scala.MatchError: DayTimeIntervalType (of class org.apache.spark.sql.types.DayTimeIntervalType$)
at org.apache.spark.sql.execution.aggregate.HashMapGenerator.genComputeHash(HashMapGenerator.scala:159)
at org.apache.spark.sql.execution.aggregate.HashMapGenerator.$anonfun$generateHashFunction$1(HashMapGenerator.scala:102)
at scala.collection.TraversableLike.$anonfun$map$1(TraversableLike.scala:238)
at scala.collection.immutable.List.foreach(List.scala:392)
at scala.collection.TraversableLike.map(TraversableLike.scala:238)
at scala.collection.TraversableLike.map$(TraversableLike.scala:231)
at scala.collection.immutable.List.map(List.scala:298)
at org.apache.spark.sql.execution.aggregate.HashMapGenerator.genHashForKeys$1(HashMapGenerator.scala:99)
at org.apache.spark.sql.execution.aggregate.HashMapGenerator.generateHashFunction(HashMapGenerator.scala:111)
```
### Why are the changes needed?
Bugfix.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Added a unit test.
Closes#32560 from maropu/FixIntervalIssue.
Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
This is a followup from discussion in https://github.com/apache/spark/pull/32495#discussion_r632283178 . The hardcoded function name `findNextJoinRows` is not a real problem now as we always do code generation for SMJ's children separately. But this change is to make it future proof in case this assumption changed in the future.
### Why are the changes needed?
Fix the potential reliability issue.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing unit tests.
Closes#32548 from c21/smj-followup.
Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
`OriginalType` and `DecimalMetadata` has been marked as `Deprecated` in new Parquet code.
`Apache Parquet` suggest us replace `OriginalType` with `LogicalTypeAnnotation` and replace `DecimalMetadata` with `DecimalLogicalTypeAnnotation`, so the main change of this pr is clean up these deprecated usages in Parquet related code.
### Why are the changes needed?
Cleanup deprecated api usage.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the Jenkins or GitHub Action
Closes#31776 from LuciferYang/cleanup-parquet-dep-api.
Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Fix log info in BroadcastExchangeExec.scala
### Why are the changes needed?
Log info s"Cannot broadcast the table that is larger than 8GB: ${dataSize >> 30} GB") is not accurate info , because 8GB is not accurate.
### Does this PR introduce _any_ user-facing change?
yes
### How was this patch tested?
no
Closes#32544 from LittleCuteBug/SPARK-32484.
Authored-by: QuangHuyViettel <quanghuynguyen236@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Add required imports to Pyspark ML examples in CrossValidator, TrainValidationSplit
### Why are the changes needed?
The examples pass doctests because of previous imports, but as they appear in Pyspark documentation, are incomplete. The additional imports are required to make the example work.
### Does this PR introduce _any_ user-facing change?
No, docs only change.
### How was this patch tested?
Existing tests.
Closes#32554 from srowen/TuningImports.
Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
Move hash map lookup operation out of `InvokeLike.invoke` since it doesn't depend on the input.
### Why are the changes needed?
We shouldn't need to look up the hash map for every input row evaluated by `InvokeLike.invoke` since it doesn't depend on input. This could speed up the performance a bit.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existing tests.
Closes#32532 from sunchao/SPARK-35384-follow-up.
Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
### What changes were proposed in this pull request?
Add a common functions `getWorkspaceFilePath` (which prefixed with spark home) to `SparkFunctionSuite`, and applies these the function to where they're extracted from.
### Why are the changes needed?
Spark sql has test suites to read resources when running tests. The way of getting the path of resources is commonly used in different suites. We can extract them into a function to ease the code maintenance.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass existing tests.
Closes#32315 from Ngone51/extract-common-file-path.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
make these threads easier to identify in thread dumps
### Why are the changes needed?
make these threads easier to identify in thread dumps
### Does this PR introduce _any_ user-facing change?
yes. Driver thread dumps will show the timers with pretty names
### How was this patch tested?
verified locally
Closes#32549 from yaooqinn/SPARK-35404.
Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Refine comment in `CacheManager`.
### Why are the changes needed?
Avoid misleading developer.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Not needed.
Closes#32543 from ulysses-you/SPARK-35332-FOLLOWUP.
Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Kent Yao <yao@apache.org>
### What changes were proposed in this pull request?
Add toc tag on monitoring.md
### Why are the changes needed?
fix doc
### Does this PR introduce _any_ user-facing change?
yes, the table of content of the monitoring page will be shown on the official doc site.
### How was this patch tested?
pass GA doc build
Closes#32545 from yaooqinn/minor.
Authored-by: Kent Yao <yao@apache.org>
Signed-off-by: Kent Yao <yao@apache.org>
### What changes were proposed in this pull request?
Generally, we would expect that x = y => hash( x ) = hash( y ). However +-0 hash to different values for floating point types.
```
scala> spark.sql("select hash(cast('0.0' as double)), hash(cast('-0.0' as double))").show
+-------------------------+--------------------------+
|hash(CAST(0.0 AS DOUBLE))|hash(CAST(-0.0 AS DOUBLE))|
+-------------------------+--------------------------+
| -1670924195| -853646085|
+-------------------------+--------------------------+
scala> spark.sql("select cast('0.0' as double) == cast('-0.0' as double)").show
+--------------------------------------------+
|(CAST(0.0 AS DOUBLE) = CAST(-0.0 AS DOUBLE))|
+--------------------------------------------+
| true|
+--------------------------------------------+
```
Here is an extract from IEEE 754:
> The two zeros are distinguishable arithmetically only by either division-byzero ( producing appropriately signed infinities ) or else by the CopySign function recommended by IEEE 754 /854. Infinities, SNaNs, NaNs and Subnormal numbers necessitate four more special cases
From this, I deduce that the hash function must produce the same result for 0 and -0.
### Why are the changes needed?
It is a correctness issue
### Does this PR introduce _any_ user-facing change?
This changes only affect to the hash function applied to -0 value in float and double types
### How was this patch tested?
Unit testing and manual testing
Closes#32496 from planga82/feature/spark35207_hashnegativezero.
Authored-by: Pablo Langa <soypab@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR is a followup of https://github.com/apache/spark/pull/32436 which broke JavaScript linter. There was a logical conflict - the linter was added after the last successful test run in that PR.
```
added 118 packages in 1.482s
/__w/spark/spark/core/src/main/resources/org/apache/spark/ui/static/executorspage.js
34:41 error 'type' is defined but never used. Allowed unused args must match /^_ignored_.*/u no-unused-vars
34:47 error 'row' is defined but never used. Allowed unused args must match /^_ignored_.*/u no-unused-vars
35:1 error Expected indentation of 2 spaces but found 4 indent
36:1 error Expected indentation of 4 spaces but found 7 indent
37:1 error Expected indentation of 2 spaces but found 4 indent
38:1 error Expected indentation of 4 spaces but found 7 indent
39:1 error Expected indentation of 2 spaces but found 4 indent
556:1 error Expected indentation of 14 spaces but found 16 indent
557:1 error Expected indentation of 14 spaces but found 16 indent
```
### Why are the changes needed?
To recover the build
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
Manually tested:
```bash
./dev/lint-js
lint-js checks passed.
```
Closes#32541 from HyukjinKwon/SPARK-34764-followup.
Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
### What changes were proposed in this pull request?
In this PR I'm adding Structured Streaming Web UI state information documentation.
### Why are the changes needed?
Missing documentation.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
```
cd docs/
SKIP_API=1 bundle exec jekyll build
```
Manual webpage check.
Closes#32433 from gaborgsomogyi/SPARK-35311.
Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
### What changes were proposed in this pull request?
Adds the exec loss reason to the Spark web UI & in doing so also fix the Kube integration to pass exec loss reason into core.
UI change:
![image](https://user-images.githubusercontent.com/59893/117045762-b975ba80-acc4-11eb-9679-8edab3cfadc2.png)
### Why are the changes needed?
Debugging Spark jobs is *hard*, making it clearer why executors have exited could help.
### Does this PR introduce _any_ user-facing change?
Yes a new column on the executor page.
### How was this patch tested?
K8s unit test updated to validate exec loss reasons are passed through regardless of exec alive state, manual testing to validate the UI.
Closes#32436 from holdenk/SPARK-34764-propegate-reason-for-exec-loss.
Lead-authored-by: Holden Karau <hkarau@apple.com>
Co-authored-by: Holden Karau <holden@pigscanfly.ca>
Signed-off-by: Holden Karau <hkarau@apple.com>
### What changes were proposed in this pull request?
This patch replaces `sys.err` usages with explicit exception types.
### Why are the changes needed?
Motivated by the previous comment https://github.com/apache/spark/pull/32519#discussion_r630787080, it sounds better to replace `sys.err` usages with explicit exception type.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Existing tests.
Closes#32535 from viirya/replace-sys-err.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>