### What changes were proposed in this pull request?
This PR proposes:
1. To introduce `InheritableThread` class, that works identically with `threading.Thread` but it can inherit the inheritable attributes of a JVM thread such as `InheritableThreadLocal`.
This was a problem from the pinned thread mode, see also https://github.com/apache/spark/pull/24898. Now it works as below:
```python
import pyspark
spark.sparkContext.setLocalProperty("a", "hi")
def print_prop():
print(spark.sparkContext.getLocalProperty("a"))
pyspark.InheritableThread(target=print_prop).start()
```
```
hi
```
2. Also, it adds the resource leak fix into `InheritableThread`. Py4J leaks the thread and does not close the connection from Python to JVM. In `InheritableThread`, it manually closes the connections when PVM garbage collection happens. So, JVM threads finish safely. I manually verified by profiling but there's also another easy way to verify:
```bash
PYSPARK_PIN_THREAD=true ./bin/pyspark
```
```python
>>> from threading import Thread
>>> Thread(target=lambda: spark.range(1000).collect()).start()
>>> Thread(target=lambda: spark.range(1000).collect()).start()
>>> Thread(target=lambda: spark.range(1000).collect()).start()
>>> spark._jvm._gateway_client.deque
deque([<py4j.clientserver.ClientServerConnection object at 0x119f7aba8>, <py4j.clientserver.ClientServerConnection object at 0x119fc9b70>, <py4j.clientserver.ClientServerConnection object at 0x119fc9e10>, <py4j.clientserver.ClientServerConnection object at 0x11a015358>, <py4j.clientserver.ClientServerConnection object at 0x119fc00f0>])
>>> Thread(target=lambda: spark.range(1000).collect()).start()
>>> spark._jvm._gateway_client.deque
deque([<py4j.clientserver.ClientServerConnection object at 0x119f7aba8>, <py4j.clientserver.ClientServerConnection object at 0x119fc9b70>, <py4j.clientserver.ClientServerConnection object at 0x119fc9e10>, <py4j.clientserver.ClientServerConnection object at 0x11a015358>, <py4j.clientserver.ClientServerConnection object at 0x119fc08d0>, <py4j.clientserver.ClientServerConnection object at 0x119fc00f0>])
```
This issue is fixed now.
3. Because now we have a fix for the issue here, it also proposes to deprecate `collectWithJobGroup` which was a temporary workaround added to avoid this leak issue.
### Why are the changes needed?
To support pinned thread mode properly without a resource leak, and a proper inheritable local properties.
### Does this PR introduce _any_ user-facing change?
Yes, it adds an API `InheritableThread` class for pinned thread mode.
### How was this patch tested?
Manually tested as described above, and unit test was added as well.
Closes#28968 from HyukjinKwon/SPARK-32010.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims to recover Java 11 build in `GitHub Action`.
### Why are the changes needed?
This test coverage is removed before. Now, it's time to recover it.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the GitHub Action.
Closes#29295 from dongjoon-hyun/SPARK-32248.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR aims to remove `java.ws.rs.NotFoundException` from two problematic `import` statements. All the other use cases are correct.
### Why are the changes needed?
In `StagesResource` and `OneApplicationResource`, there exist two `NotFoundException`s.
- javax.ws.rs.NotFoundException
- org.apache.spark.status.api.v1.NotFoundException
To use `org.apache.spark.status.api.v1.NotFoundException` correctly, we should not import `java.ws.rs.NotFoundException`. This causes UT failures in Scala 2.13 environment.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
- Scala 2.12: Pass the GitHub Action or Jenkins.
- Scala 2.13: Do the following manually.
```
$ dev/change-scala-version.sh 2.13
$ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.deploy.history.HistoryServerSuite
```
**BEFORE**
```
*** 4 TESTS FAILED ***
```
**AFTER**
```
*** 1 TEST FAILED ***
```
Closes#29293 from dongjoon-hyun/SPARK-32487.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Upgrade codehaus maven build helper to allow people to specify a time during the build to avoid snapshot artifacts with different version strings.
### Why are the changes needed?
During builds of snapshots the maven may assign different versions to different artifacts based on the time each individual sub-module starts building.
The timestamp is used as part of the version string when run `maven deploy` on a snapshot build. This results in different sub-modules having different version strings.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Manual build while specifying the current time, ensured the time is consistent in the sub components.
Open question: Ideally I'd like to backport this as well since it's sort of a bug fix and while it does change a dependency version it's not one that is propagated. I'd like to hear folks thoughts about this.
Closes#29274 from holdenk/SPARK-32397-snapshot-artifact-timestamp-differences.
Authored-by: Holden Karau <hkarau@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
### What changes were proposed in this pull request?
This PR adds abstract classes for shuffle and broadcast, so that users can provide their columnar implementations.
This PR updates several places to use the abstract exchange classes, and also update `AdaptiveSparkPlanExec` so that the columnar rules can see exchange nodes.
This is an alternative of https://github.com/apache/spark/pull/29134 .
Close https://github.com/apache/spark/pull/29134
### Why are the changes needed?
To allow columnar exchanges.
### Does this PR introduce _any_ user-facing change?
no
### How was this patch tested?
new tests
Closes#29262 from cloud-fan/columnar.
Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
### What changes were proposed in this pull request?
Document the stage level scheduling feature.
### Why are the changes needed?
Document the stage level scheduling feature.
### Does this PR introduce _any_ user-facing change?
Documentation.
### How was this patch tested?
n/a docs only
Closes#29292 from tgravescs/SPARK-30322.
Authored-by: Thomas Graves <tgraves@nvidia.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
### What changes were proposed in this pull request?
This PR aims to make `ResourceAllocator.availableAddrs` deterministic.
### Why are the changes needed?
Currently, this function returns indeterministically due to the underlying `HashMap`. So, the test case itself is creating a list `[0, 1, 2]` initially, but ends up with comparing `[2, 1, 0]`.
Not only this happens in the 3.0.0, but also this causes UT failures on Scala 2.13 environment.
### Does this PR introduce _any_ user-facing change?
Yes, but this fixes the in-deterministic behavior.
### How was this patch tested?
- Scala 2.12: This should pass the UT with the modified test case.
- Scala 2.13: This can be tested like the following (at least `JsonProtocolSuite`)
```
$ dev/change-scala-version.sh 2.13
$ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.deploy.JsonProtocolSuite
```
**BEFORE**
```
*** 2 TESTS FAILED ***
```
**AFTER**
```
All tests passed.
```
Closes#29281 from dongjoon-hyun/SPARK-32476.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
In the PR, I propose to support pushed down filters in Avro datasource V1 and V2.
1. Added new SQL config `spark.sql.avro.filterPushdown.enabled` to control filters pushdown to Avro datasource. It is on by default.
2. Renamed `CSVFilters` to `OrderedFilters`.
3. `OrderedFilters` is used in `AvroFileFormat` (DSv1) and in `AvroPartitionReaderFactory` (DSv2)
4. Modified `AvroDeserializer` to return None from the `deserialize` method when pushdown filters return `false`.
### Why are the changes needed?
The changes improve performance on synthetic benchmarks up to **2** times on JDK 11:
```
OpenJDK 64-Bit Server VM 11.0.7+10-post-Ubuntu-2ubuntu218.04 on Linux 4.15.0-1063-aws
Intel(R) Xeon(R) CPU E5-2670 v2 2.50GHz
Filters pushdown: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
w/o filters 9614 9669 54 0.1 9614.1 1.0X
pushdown disabled 10077 10141 66 0.1 10077.2 1.0X
w/ filters 4681 4713 29 0.2 4681.5 2.1X
```
### Does this PR introduce any user-facing change?
No
### How was this patch tested?
- Added UT to `AvroCatalystDataConversionSuite` and `AvroSuite`
- Re-running `AvroReadBenchmark` using Amazon EC2:
| Item | Description |
| ---- | ----|
| Region | us-west-2 (Oregon) |
| Instance | r3.xlarge (spot instance) |
| AMI | ami-06f2f779464715dc5 (ubuntu/images/hvm-ssd/ubuntu-bionic-18.04-amd64-server-20190722.1) |
| Java | OpenJDK8/11 installed by`sudo add-apt-repository ppa:openjdk-r/ppa` & `sudo apt install openjdk-11-jdk`|
and `./dev/run-benchmarks`:
```python
#!/usr/bin/env python3
import os
from sparktestsupport.shellutils import run_cmd
benchmarks = [
['avro/test', 'org.apache.spark.sql.execution.benchmark.AvroReadBenchmark']
]
print('Set SPARK_GENERATE_BENCHMARK_FILES=1')
os.environ['SPARK_GENERATE_BENCHMARK_FILES'] = '1'
for b in benchmarks:
print("Run benchmark: %s" % b[1])
run_cmd(['build/sbt', '%s:runMain %s' % (b[0], b[1])])
```
Closes#29145 from MaxGekk/avro-filters-pushdown.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
### What changes were proposed in this pull request?
Add training summary to MultilayerPerceptronClassificationModel...
### Why are the changes needed?
so that user can get the training process status, such as loss value of each iteration and total iteration number.
### Does this PR introduce _any_ user-facing change?
Yes
MultilayerPerceptronClassificationModel.summary
MultilayerPerceptronClassificationModel.evaluate
### How was this patch tested?
new tests
Closes#29250 from huaxingao/mlp_summary.
Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR aims to make `JsonProtocol.accumulablesToJson` deterministic.
### Why are the changes needed?
Currently, `JsonProtocol.accumulablesToJson` is indeterministic. So, `JsonProtocolSuite` itself is also using mixed test cases in terms of `"Accumulables": [ ... ]`.
Not only this is indeterministic, but also this causes a UT failure in `JsonProtocolSuite` in Scala 2.13.
### Does this PR introduce _any_ user-facing change?
Yes. However, this is a fix on indeterministic behavior.
### How was this patch tested?
- Scala 2.12: Pass the GitHub Action or Jenkins.
- Scala 2.13: Do the following.
```
$ dev/change-scala-version.sh 2.13
$ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.util.JsonProtocolSuite
```
**BEFORE**
```
*** 1 TEST FAILED ***
```
**AFTER**
```
All tests passed.
```
Closes#29282 from dongjoon-hyun/SPARK-32477.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR changes the order between initialization for ExecutorPlugin and starting heartbeat thread in Executor.
### Why are the changes needed?
In the current master, heartbeat thread in a executor starts after plugin initialization so if the initialization takes long time, heartbeat is not sent to driver and the executor will be removed from cluster.
### Does this PR introduce _any_ user-facing change?
Yes. Plugins for executors will be allowed to take long time for initialization.
### How was this patch tested?
New testcase.
Closes#29002 from sarutak/fix-heartbeat-issue.
Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
### What changes were proposed in this pull request?
`spark.kryo.registrator` in 3.0 has a regression problem. From [SPARK-12080](https://issues.apache.org/jira/browse/SPARK-12080), it supports multiple user registrators by
```scala
private val userRegistrators = conf.get("spark.kryo.registrator", "")
.split(',').map(_.trim)
.filter(!_.isEmpty)
```
But it donsn't work in 3.0. Fix it by `toSequence` in `Kryo.scala`
### Why are the changes needed?
In previous Spark version (2.x), it supported multiple user registrators by
```scala
private val userRegistrators = conf.get("spark.kryo.registrator", "")
.split(',').map(_.trim)
.filter(!_.isEmpty)
```
But it doesn't work in 3.0. It's should be a regression.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Existed unit tests.
Closes#29123 from LantaoJin/SPARK-32283.
Authored-by: LantaoJin <jinlantao@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to migrate the following function related commands to use `UnresolvedFunc` to resolve function identifier:
- DROP FUNCTION
- DESCRIBE FUNCTION
- SHOW FUNCTIONS
`DropFunctionStatement`, `DescribeFunctionStatement` and `ShowFunctionsStatement` logical plans are replaced with `DropFunction`, `DescribeFunction` and `ShowFunctions` logical plans respectively, and each contains `UnresolvedFunc` as its child so that it can be resolved in `Analyzer`.
### Why are the changes needed?
Migrating to the new resolution framework, which resolves `UnresolvedFunc` in `Analyzer`.
### Does this PR introduce _any_ user-facing change?
The message of exception thrown when a catalog is resolved to v2 has been merged to:
`function is only supported in v1 catalog`
Previously, it printed out the command used. E.g.,:
`CREATE FUNCTION is only supported in v1 catalog`
### How was this patch tested?
Updated existing tests.
Closes#29198 from imback82/function_framework.
Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
1. Describe the JSON option `allowNonNumericNumbers` which is used in read
2. Add new test cases for allowed JSON field values: NaN, +INF, +Infinity, Infinity, -INF and -Infinity
### Why are the changes needed?
To improve UX with Spark SQL and to provide users full info about the supported option.
### Does this PR introduce _any_ user-facing change?
Yes, in PySpark.
### How was this patch tested?
Added new test to `JsonParsingOptionsSuite`
Closes#29275 from MaxGekk/allowNonNumericNumbers-doc.
Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Updates to tests to use correctly sized `getInt` or `getLong` calls.
### Why are the changes needed?
The reads were incorrectly sized (i.e. `putLong` paired with `getInt` and `putInt` paired with `getLong`). This causes test failures on big-endian systems.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Tests were run on a big-endian system (s390x). This change is unlikely to have any practical effect on little-endian systems.
Closes#29258 from mundaym/fix-row.
Authored-by: Michael Munday <mike.munday@ibm.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Rewrite a clearer and complete BLAS native acceleration enabling guide.
### Why are the changes needed?
The document of enabling BLAS native acceleration in ML guide (https://spark.apache.org/docs/latest/ml-guide.html#dependencies) is incomplete and unclear to the user.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
N/A
Closes#29139 from xwu99/blas-doc.
Lead-authored-by: Xiaochang Wu <xiaochang.wu@intel.com>
Co-authored-by: Wu, Xiaochang <xiaochang.wu@intel.com>
Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>
### What changes were proposed in this pull request?
This PR supports `WrappedArray` as `customCollectionCls` in `MapObjects`.
### Why are the changes needed?
This helps fix the regression caused by SPARK-31826. For the following test, it can pass in branch-3.0 but fail in master branch:
```scala
test("WrappedArray") {
val myUdf = udf((a: WrappedArray[Int]) =>
WrappedArray.make[Int](Array(a.head + 99)))
checkAnswer(Seq(Array(1))
.toDF("col")
.select(myUdf(Column("col"))),
Row(ArrayBuffer(100)))
}
```
In SPARK-31826, we've changed the catalyst-to-scala converter from `CatalystTypeConverters` to `ExpressionEncoder.deserializer`. However, `CatalystTypeConverters` supports `WrappedArray` while `ExpressionEncoder.deserializer` doesn't.
### Does this PR introduce _any_ user-facing change?
No, SPARK-31826 is merged into master and branch-3.1, which haven't been released.
### How was this patch tested?
Added a new test for `WrappedArray` in `UDFSuite`; Also updated `ObjectExpressionsSuite` for `MapObjects`.
Closes#29261 from Ngone51/fix-wrappedarray.
Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
Normally, a Null aware anti join will be planed into BroadcastNestedLoopJoin which is very time consuming, for instance, in TPCH Query 16.
```
select
p_brand,
p_type,
p_size,
count(distinct ps_suppkey) as supplier_cnt
from
partsupp,
part
where
p_partkey = ps_partkey
and p_brand <> 'Brand#45'
and p_type not like 'MEDIUM POLISHED%'
and p_size in (49, 14, 23, 45, 19, 3, 36, 9)
and ps_suppkey not in (
select
s_suppkey
from
supplier
where
s_comment like '%Customer%Complaints%'
)
group by
p_brand,
p_type,
p_size
order by
supplier_cnt desc,
p_brand,
p_type,
p_size
```
In above query, will planed into
LeftAnti
condition Or((ps_suppkey=s_suppkey), IsNull(ps_suppkey=s_suppkey))
Inside BroadcastNestedLoopJoinExec will perform O(M\*N), BUT if there is only single column in NAAJ, we can always change buildSide into a HashSet, and streamedSide just need to lookup in the HashSet, then the calculation will be optimized into O(M).
But this optimize is only targeting on null aware anti join with single column case, because multi-column support is much more complicated, we might be able to support multi-column in future.
After apply this patch, the TPCH Query 16 performance decrease from 41mins to 30s
The semantic of null-aware anti join is:
![image](https://user-images.githubusercontent.com/17242071/88077041-66a39a00-cbad-11ea-8fb6-c235c4d219b4.png)
### Why are the changes needed?
TPCH is a common benchmark for distributed compute engine, all other 21 Query works fine on Spark, except for Query 16, apply this patch will make Spark more competitive among all these popular engine. BTW, this patch has restricted rules and only apply on NAAJ Single Column case, which is safe enough.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
1. SQLQueryTestSuite with NOT IN keyword SQL, add CONFIG_DIM with spark.sql.optimizeNullAwareAntiJoin on and off
2. added case in org.apache.spark.sql.JoinSuite.
3. added case in org.apache.spark.sql.SubquerySuite.
3. Compare performance before and after applying this patch against TPCH Query 16.
4. config combination against e2e test with following
```
Map(
"spark.sql.optimizeNullAwareAntiJoin" -> "true",
"spark.sql.adaptive.enabled" -> "false",
"spark.sql.codegen.wholeStage" -> "false"
),
Map(
"sspark.sql.optimizeNullAwareAntiJoin" -> "true",
"spark.sql.adaptive.enabled" -> "false",
"spark.sql.codegen.wholeStage" -> "true"
),
Map(
"spark.sql.optimizeNullAwareAntiJoin" -> "true",
"spark.sql.adaptive.enabled" -> "true",
"spark.sql.codegen.wholeStage" -> "false"
),
Map(
"spark.sql.optimizeNullAwareAntiJoin" -> "true",
"spark.sql.adaptive.enabled" -> "true",
"spark.sql.codegen.wholeStage" -> "true"
)
```
Closes#29104 from leanken/leanken-SPARK-32290.
Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
return an empty list instead of None when calling `df.head()`
### Why are the changes needed?
`df.head()` and `df.head(1)` are inconsistent when df is empty.
### Does this PR introduce _any_ user-facing change?
Yes. If a user relies on `df.head()` to return None, things like `if df.head() is None:` will be broken.
### How was this patch tested?
Closes#29214 from tianshizz/SPARK-31525.
Authored-by: Tianshi Zhu <zhutianshirea@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Fixes spacing in an error message
### Why are the changes needed?
Makes error messages easier to read
### Does this PR introduce _any_ user-facing change?
Yes, it changes the error message
### How was this patch tested?
This patch doesn't affect any logic, so existing tests should cover it
Closes#29264 from hauntsaninja/patch-1.
Authored-by: Shantanu <12621235+hauntsaninja@users.noreply.github.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR is intended to solve schema pruning not working with window functions, as described in SPARK-32059. It also solved schema pruning not working with `Sort`. It also generalizes with `Project->Filter->[any node can be pruned]`.
### Why are the changes needed?
This is needed because of performance issues with nested structures with querying using window functions as well as sorting.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Introduced two tests: 1) optimizer planning level 2) end-to-end tests with SQL queries.
Closes#28898 from frankyin-factual/master.
Authored-by: Frank Yin <frank@factual.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
### What changes were proposed in this pull request?
logParam `thresholds` in DT/GBT/FM/LR/MLP
### Why are the changes needed?
param `thresholds` is logged in NB/RF, but not in other ProbabilisticClassifier
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
existing testsuites
Closes#29257 from zhengruifeng/instr.logParams_add_thresholds.
Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>
### What changes were proposed in this pull request?
This PR aims to use `command -v` in non-Window operating systems instead of executing the given command.
### Why are the changes needed?
1. `command` is POSIX-compatible
- **POSIX.1-2017**: https://pubs.opengroup.org/onlinepubs/9699919799/utilities/command.html
2. `command` is faster and safer than the direct execution
- `command` doesn't invoke another process.
```scala
scala> sys.process.Process("ls").run().exitValue()
LICENSE
NOTICE
bin
doc
lib
man
res1: Int = 0
```
3. The existing way behaves inconsistently.
- `rm` cannot be checked.
**AS-IS**
```scala
scala> sys.process.Process("rm").run().exitValue()
usage: rm [-f | -i] [-dPRrvW] file ...
unlink file
res0: Int = 64
```
**TO-BE**
```
Welcome to Scala 2.13.3 (OpenJDK 64-Bit Server VM, Java 1.8.0_262).
Type in expressions for evaluation. Or try :help.
scala> sys.process.Process(Seq("sh", "-c", s"command -v ls")).run().exitValue()
/bin/ls
val res1: Int = 0
```
4. The existing logic is already broken in Scala 2.13 environment because it hangs like the following.
```scala
$ bin/scala
Welcome to Scala 2.13.3 (OpenJDK 64-Bit Server VM, Java 1.8.0_262).
Type in expressions for evaluation. Or try :help.
scala> sys.process.Process("cat").run().exitValue() // hang here.
```
### Does this PR introduce _any_ user-facing change?
No. Although this is inside `main` source directory, this is used for testing purpose.
```
$ git grep testCommandAvailable | grep -v 'def testCommandAvailable'
core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(TestUtils.testCommandAvailable("cat"))
core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(TestUtils.testCommandAvailable("wc"))
core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(TestUtils.testCommandAvailable("cat"))
core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(TestUtils.testCommandAvailable("cat"))
core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(TestUtils.testCommandAvailable("cat"))
core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(TestUtils.testCommandAvailable(envCommand))
core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(!TestUtils.testCommandAvailable("some_nonexistent_command"))
core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(TestUtils.testCommandAvailable("cat"))
core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(TestUtils.testCommandAvailable("cat"))
core/src/test/scala/org/apache/spark/rdd/PipedRDDSuite.scala: assume(TestUtils.testCommandAvailable(envCommand))
sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala: private lazy val isPythonAvailable: Boolean = TestUtils.testCommandAvailable(pythonExec)
sql/core/src/test/scala/org/apache/spark/sql/IntegratedUDFTestUtils.scala: if (TestUtils.testCommandAvailable(pythonExec)) {
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: skip = !TestUtils.testCommandAvailable("/bin/bash"))
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: skip = !TestUtils.testCommandAvailable("/bin/bash"))
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: skip = !TestUtils.testCommandAvailable("/bin/bash"))
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: skip = !TestUtils.testCommandAvailable("/bin/bash"))
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: skip = !TestUtils.testCommandAvailable("/bin/bash"))
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: skip = !TestUtils.testCommandAvailable("/bin/bash"))
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash"))
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: skip = !TestUtils.testCommandAvailable("/bin/bash"))
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveQuerySuite.scala: skip = !TestUtils.testCommandAvailable("/bin/bash"))
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash"))
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash"))
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash"))
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash"))
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash"))
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala: assume(TestUtils.testCommandAvailable("python"))
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash"))
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveScriptTransformationSuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash"))
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash"))
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala: assume(TestUtils.testCommandAvailable("echo | sed"))
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash"))
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash"))
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash"))
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/SQLQuerySuite.scala: assume(TestUtils.testCommandAvailable("/bin/bash"))
```
### How was this patch tested?
- **Scala 2.12**: Pass the Jenkins with the existing tests and one modified test.
- **Scala 2.13**: Do the following manually. It should pass instead of `hang`.
```
$ dev/change-scala-version.sh 2.13
$ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.rdd.PipedRDDSuite
...
Tests: succeeded 12, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```
Closes#29241 from dongjoon-hyun/SPARK-32443.
Lead-authored-by: HyukjinKwon <gurwls223@apache.org>
Co-authored-by: Dongjoon Hyun <dongjoon@apache.org>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
When using `Seconds.toMicros` API to convert epoch seconds to microseconds,
```scala
/**
* Equivalent to
* {link #convert(long, TimeUnit) MICROSECONDS.convert(duration, this)}.
* param duration the duration
* return the converted duration,
* or {code Long.MIN_VALUE} if conversion would negatively
* overflow, or {code Long.MAX_VALUE} if it would positively overflow.
*/
```
This PR change it to `Math.multiplyExact(epochSeconds, MICROS_PER_SECOND)`
### Why are the changes needed?
fix silent data change between 3.x and 2.x
```
~/Downloads/spark/spark-3.1.0-SNAPSHOT-bin-20200722 bin/spark-sql -S -e "select to_timestamp('300000', 'y');"
+294247-01-10 12:00:54.775807
```
```
kentyaohulk ~/Downloads/spark/spark-2.4.5-bin-hadoop2.7 bin/spark-sql -S -e "select to_timestamp('300000', 'y');"
284550-10-19 15:58:1010.448384
```
### Does this PR introduce _any_ user-facing change?
Yes, we will raise `ArithmeticException` instead of giving the wrong answer if overflow.
### How was this patch tested?
add unit test
Closes#29220 from yaooqinn/SPARK-32424.
Authored-by: Kent Yao <yaooqinn@hotmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
`HashRelation` has two separate code paths for unique key look up and non-unique key look up E.g. in its subclass [`UnsafeHashedRelation`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/HashedRelation.scala#L144-L177), unique key look up is more efficient as it does not have e.g. extra `Iterator[UnsafeRow].hasNext()/next()` overhead per row.
`BroadcastHashJoinExec` has handled unique key vs non-unique key separately in [code-gen path](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/joins/BroadcastHashJoinExec.scala#L289-L321). But the non-codegen path for broadcast hash join and shuffled hash join do not separate it yet, so adding the support here.
### Why are the changes needed?
Shuffled hash join and non-codegen broadcast hash join still rely on this code path for execution. So this PR will help save CPU for executing this two type of join. Adding codegen for shuffled hash join would be a different topic and I will add it in https://issues.apache.org/jira/browse/SPARK-32421 .
Ran the same query as [`JoinBenchmark`](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/benchmark/JoinBenchmark.scala#L153-L167), with enabling and disabling this feature. Verified 20% wall clock time improvement (switch control and test group order as well to verify the improvement to not be the noise).
```
Running benchmark: shuffle hash join
Running case: shuffle hash join unique key SHJ off
Stopped after 5 iterations, 4039 ms
Running case: shuffle hash join unique key SHJ on
Stopped after 5 iterations, 2898 ms
Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15.4
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
shuffle hash join: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
shuffle hash join unique key SHJ off 707 808 81 5.9 168.6 1.0X
shuffle hash join unique key SHJ on 547 580 50 7.7 130.4 1.3X
```
```
Running benchmark: shuffle hash join
Running case: shuffle hash join unique key SHJ on
Stopped after 5 iterations, 3333 ms
Running case: shuffle hash join unique key SHJ off
Stopped after 5 iterations, 4268 ms
Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15.4
Intel(R) Core(TM) i9-9980HK CPU 2.40GHz
shuffle hash join: Best Time(ms) Avg Time(ms) Stdev(ms) Rate(M/s) Per Row(ns) Relative
------------------------------------------------------------------------------------------------------------------------
shuffle hash join unique key SHJ on 565 667 60 7.4 134.8 1.0X
shuffle hash join unique key SHJ off 774 854 85 5.4 184.4 0.7X
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
* Added test in `OuterJoinSuite` to cover left outer and right outer join.
* Added test in `ExistenceJoinSuite` to cover left semi join, and existence join.
* [Existing `joinSuite` already covered inner join.](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/JoinSuite.scala#L182)
* [Existing `ExistenceJoinSuite` already covered left anti join, and existence join.](https://github.com/apache/spark/blob/master/sql/core/src/test/scala/org/apache/spark/sql/execution/joins/ExistenceJoinSuite.scala#L228)
Closes#29216 from c21/unique-key.
Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR is basically a followup of SPARK-26132 and SPARK-32434. You can't define an environment variable within an-if to use it within the block. See also https://superuser.com/questions/78496/variables-in-batch-file-not-being-set-when-inside-if
### Why are the changes needed?
For Windows users to use Spark and fix the build in AppVeyor.
### Does this PR introduce _any_ user-facing change?
No, it's only in unreleased branches.
### How was this patch tested?
Manually tested on a local Windows machine, and AppVeyor build at https://github.com/HyukjinKwon/spark/pull/13. See https://ci.appveyor.com/project/HyukjinKwon/spark/builds/34316409Closes#29254 from HyukjinKwon/SPARK-32434.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Support set off heap memory in `ExecutorResourceRequests`
### Why are the changes needed?
Support stage level scheduling
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Added UT in `ResourceProfileSuite` and `DAGSchedulerSuite`
Closes#28972 from warrenzhu25/30794.
Authored-by: Warren Zhu <zhonzh@microsoft.com>
Signed-off-by: Thomas Graves <tgraves@apache.org>
### What changes were proposed in this pull request?
This PR removes the manual port of `heapq3.py` introduced from SPARK-3073. The main reason of this was to support Python 2.6 and 2.7 because Python 2's `heapq.merge()` doesn't not support `key` and `reverse`.
See
- https://docs.python.org/2/library/heapq.html#heapq.merge in Python 2
- https://docs.python.org/3.8/library/heapq.html#heapq.merge in Python 3
Since we dropped the Python 2 at SPARK-32138, we can remove this away.
### Why are the changes needed?
To remove unnecessary codes. Also, we can leverage bug fixes made in Python 3.x at `heapq`.
### Does this PR introduce _any_ user-facing change?
No, dev-only.
### How was this patch tested?
Existing tests should cover. I locally ran and verified:
```bash
./python/run-tests --python-executable=python3 --testname="pyspark.tests.test_shuffle"
./python/run-tests --python-executable=python3 --testname="pyspark.shuffle ExternalSorter"
./python/run-tests --python-executable=python3 --testname="pyspark.tests.test_rdd RDDTests.test_external_group_by_key"
```
Closes#29229 from HyukjinKwon/SPARK-32435.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR proposes to redesign the PySpark documentation.
I made a demo site to make it easier to review: https://hyukjin-spark.readthedocs.io/en/stable/reference/index.html.
Here is the initial draft for the final PySpark docs shape: https://hyukjin-spark.readthedocs.io/en/latest/index.html.
In more details, this PR proposes:
1. Use [pydata_sphinx_theme](https://github.com/pandas-dev/pydata-sphinx-theme) theme - [pandas](https://pandas.pydata.org/docs/) and [Koalas](https://koalas.readthedocs.io/en/latest/) use this theme. The CSS overwrite is ported from Koalas. The colours in the CSS were actually chosen by designers to use in Spark.
2. Use the Sphinx option to separate `source` and `build` directories as the documentation pages will likely grow.
3. Port current API documentation into the new style. It mimics Koalas and pandas to use the theme most effectively.
One disadvantage of this approach is that you should list up APIs or classes; however, I think this isn't a big issue in PySpark since we're being conservative on adding APIs. I also intentionally listed classes only instead of functions in ML and MLlib to make it relatively easier to manage.
### Why are the changes needed?
Often I hear the complaints, from the users, that current PySpark documentation is pretty messy to read - https://spark.apache.org/docs/latest/api/python/index.html compared other projects such as [pandas](https://pandas.pydata.org/docs/) and [Koalas](https://koalas.readthedocs.io/en/latest/).
It would be nicer if we can make it more organised instead of just listing all classes, methods and attributes to make it easier to navigate.
Also, the documentation has been there from almost the very first version of PySpark. Maybe it's time to update it.
### Does this PR introduce _any_ user-facing change?
Yes, PySpark API documentation will be redesigned.
### How was this patch tested?
Manually tested, and the demo site was made to show.
Closes#29188 from HyukjinKwon/SPARK-32179.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
As the part of this PR https://github.com/apache/spark/pull/29045 added the helper method. This PR is the FOLLOWUP PR to update the description of helper method.
### Why are the changes needed?
For better readability and understanding of the code
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
Since its only change of updating the description , So ran the Spark shell
Closes#29232 from SaurabhChawla100/SPARK-32234-Desc.
Authored-by: SaurabhChawla <s.saurabhtim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
### What changes were proposed in this pull request?
This PR proposes to set the minimum Arrow version as 1.0.0 to minimise the maintenance overhead and keep the minimal version up to date.
Other required changes to support 1.0.0 were already made in SPARK-32451.
### Why are the changes needed?
R side, people rather aggressively encourage people to use the latest version, and SparkR vectorization is very experimental that was added from Spark 3.0.
Also, we're technically not testing old Arrow versions in SparkR for now.
### Does this PR introduce _any_ user-facing change?
Yes, users wouldn't be able to use SparkR with old Arrow.
### How was this patch tested?
GitHub Actions and AppVeyor are already testing them.
Closes#29253 from HyukjinKwon/SPARK-32452.
Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
Two different versions are used for the same artifacts, `exec-maven-plugin` and `scalatest-maven-plugin`. This PR aims to use the same versions for `exec-maven-plugin` and `scalatest-maven-plugin`. In addition, this PR removes `scala-maven-plugin.version` from `K8s` integration suite because it's unused.
### Why are the changes needed?
This will prevent the mistake which upgrades only one place and forgets the others.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the Jenkins K8S IT.
Closes#29248 from dongjoon-hyun/SPARK-32448.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This patch upgrades pycodestyle from v2.4.0 to v2.6.0. The changes at each release:
2.5.0: https://pycodestyle.pycqa.org/en/latest/developer.html#id3
2.6.0a1: https://pycodestyle.pycqa.org/en/latest/developer.html#a1-2020-04-23
2.6.0: https://pycodestyle.pycqa.org/en/latest/developer.html#id2
Changes: Dropped Python 2.6 and 3.3 support, added Python 3.7 and 3.8 support...
### Why are the changes needed?
Including bug fixes and newer Python version support.
### Does this PR introduce _any_ user-facing change?
No, dev only.
### How was this patch tested?
Ran `dev/lint-python` locally.
Closes#29249 from viirya/upgrade-pycodestyle.
Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
### What changes were proposed in this pull request?
This PR aims to use `python3` instead of `python` inside `bin/pyspark`, `bin/find-spark-home` and `bin/find-spark-home.cmd` script.
```
$ git diff master --stat
bin/find-spark-home | 4 ++--
bin/find-spark-home.cmd | 4 ++--
bin/pyspark | 4 ++--
```
### Why are the changes needed?
According to [PEP 394](https://www.python.org/dev/peps/pep-0394/), we have four different cases for `python` while `python3` will be there always.
```
- Distributors may choose to set the behavior of the python command as follows:
python2,
python3,
not provide python command,
allow python to be configurable by an end user or a system administrator.
```
Moreover, these scripts already depend on `find_spark_home.py` which is using `#!/usr/bin/env python3`.
```
FIND_SPARK_HOME_PYTHON_SCRIPT="$(cd "$(dirname "$0")"; pwd)/find_spark_home.py"
```
### Does this PR introduce _any_ user-facing change?
No. Apache Spark 3.1 already drops Python 2.7 via SPARK-32138 .
### How was this patch tested?
Pass the Jenkins or GitHub Action.
Closes#29246 from dongjoon-hyun/SPARK-FIND-SPARK-HOME.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
…istently print the metrics on driver's stdout
### What changes were proposed in this pull request?
Call collect on RDD before calling foreach so that it sends the result to the driver node and print it on this node's stdout.
### Why are the changes needed?
Some RDDs in this example (e.g., precision, recall) call println without calling collect.
If the job is under local mode, it sends the data to the driver node and prints the metrics on the driver's stdout.
However if the job is under cluster mode, the job prints the metrics on the executor's stdout.
It seems inconsistent compared to the other metrics nothing to do with RDD (e.g., auPRC, auROC) since these metrics always output the result on the driver's stdout.
All of the metrics should output its result on the driver's stdout.
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
This is example code. It doesn't have any tests.
Closes#29222 from titsuki/SPARK-32428.
Authored-by: Itsuki Toyota <titsuki@cpan.org>
Signed-off-by: Sean Owen <srowen@gmail.com>
### What changes were proposed in this pull request?
This PR aims to upgrade `json4s` to from 3.6.6 to 3.7.0-M5 for Scala 2.13 support at Apache Spark 3.1.0 on December. We will upgrade to the latest `json4s` around November.
### Why are the changes needed?
`json4s` starts to support Scala 2.13 since v3.7.0-M4.
- https://github.com/json4s/json4s/issues/660
- b013af8e75
Old `json4s` causes many UT failures with `NoSuchMethodException`.
```scala
Cause: java.lang.NoSuchMethodException: scala.collection.immutable.Seq$.apply(scala.collection.Seq)
at java.lang.Class.getMethod(Class.java:1786)
```
The following is one example.
```scala
$ dev/change-scala-version.sh 2.13
$ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.executor.CoarseGrainedExecutorBackendSuite
...
Tests: succeeded 4, failed 9, canceled 0, ignored 0, pending 0
*** 9 TESTS FAILED ***
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
1. **Scala 2.12**: Pass the Jenkins or GitHub Action with the existing tests.
2. **Scala 2.13**: Do the following manually at least.
```scala
$ dev/change-scala-version.sh 2.13
$ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.executor.CoarseGrainedExecutorBackendSuite
...
Tests: succeeded 13, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```
Closes#29239 from dongjoon-hyun/SPARK-32441.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR aims to increase the memory parameter in `BlockManagerSuite`'s worker decommission test cases.
### Why are the changes needed?
Scala 2.13 generates different Java objects and this affects Spark's `SizeEstimator/SizeTracker/SizeTrackingVector`. This causes UT failures like the following. If we decrease the values, those test cases fails in Scala 2.12, too.
```
$ dev/change-scala-version.sh 2.13
$ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.storage.BlockManagerSuite
...
- test decommission block manager should not be part of peers *** FAILED ***
0 did not equal 2 (BlockManagerSuite.scala:1869)
- test decommissionRddCacheBlocks should offload all cached blocks *** FAILED ***
0 did not equal 2 (BlockManagerSuite.scala:1884)
...
Tests: succeeded 81, failed 2, canceled 0, ignored 0, pending 0
*** 2 TESTS FAILED ***
```
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
```
$ dev/change-scala-version.sh 2.13
$ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.storage.BlockManagerSuite
...
Tests: succeeded 83, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```
Closes#29238 from dongjoon-hyun/SPARK-32440.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
Since Scala 2.13, `HashMap` is changed to become a final in the future and `.withDefault` is recommended. This PR aims to use `HashMap.withDefaultValue` instead of overriding manually in the test case.
- https://www.scala-lang.org/api/current/scala/collection/mutable/HashMap.html
```scala
deprecatedInheritance(message =
"HashMap wil be made final; use .withDefault for the common use case of computing a default value",
since = "2.13.0")
```
### Why are the changes needed?
In Scala 2.13, the existing code causes a failure because the default value function doesn't work correctly.
```
$ dev/change-scala-version.sh 2.13
$ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.rdd.RDDSuite
- aggregate *** FAILED ***
org.apache.spark.SparkException: Job aborted due to stage failure:
Task 0 in stage 61.0 failed 1 times, most recent failure: Lost task 0.0 in stage 61.0 (TID 198, localhost, executor driver):
java.util.NoSuchElementException: key not found: a
```
### Does this PR introduce _any_ user-facing change?
No. This is a test case change.
### How was this patch tested?
1. **Scala 2.12:** Pass the Jenkins or GitHub with the existing tests.
2. **Scala 2.13**: Manually do the following.
```
$ dev/change-scala-version.sh 2.13
$ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.rdd.RDDSuite
...
Tests: succeeded 72, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```
Closes#29235 from dongjoon-hyun/SPARK-32438.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR aims to initialize `numNonEmptyBlocks` in `HighlyCompressedMapStatus.readExternal`.
In Scala 2.12, this is initialized to `-1` via the following.
```scala
protected def this() = this(null, -1, null, -1, null, -1) // For deserialization only
```
### Why are the changes needed?
In Scala 2.13, this causes several UT failures because `HighlyCompressedMapStatus.readExternal` doesn't initialize this field. The following is one example.
- org.apache.spark.scheduler.MapStatusSuite
```
MapStatusSuite:
- compressSize
- decompressSize
*** RUN ABORTED ***
java.lang.NoSuchFieldError: numNonEmptyBlocks
at org.apache.spark.scheduler.HighlyCompressedMapStatus.<init>(MapStatus.scala:181)
at org.apache.spark.scheduler.HighlyCompressedMapStatus$.apply(MapStatus.scala:281)
at org.apache.spark.scheduler.MapStatus$.apply(MapStatus.scala:73)
at org.apache.spark.scheduler.MapStatusSuite.$anonfun$new$8(MapStatusSuite.scala:64)
at scala.runtime.java8.JFunction1$mcVD$sp.apply(JFunction1$mcVD$sp.scala:18)
at scala.collection.immutable.List.foreach(List.scala:333)
at org.apache.spark.scheduler.MapStatusSuite.$anonfun$new$7(MapStatusSuite.scala:61)
at scala.runtime.java8.JFunction1$mcVJ$sp.apply(JFunction1$mcVJ$sp.scala:18)
at scala.collection.immutable.List.foreach(List.scala:333)
at org.apache.spark.scheduler.MapStatusSuite.$anonfun$new$6(MapStatusSuite.scala:60)
...
```
### Does this PR introduce _any_ user-facing change?
No. This is a private class.
### How was this patch tested?
1. Pass the GitHub Action or Jenkins with the existing tests.
2. Test with Scala-2.13 with `MapStatusSuite`.
```
$ dev/change-scala-version.sh 2.13
$ build/mvn test -pl core --am -Pscala-2.13 -Dtest=none -DwildcardSuites=org.apache.spark.scheduler.MapStatusSuite
...
MapStatusSuite:
- compressSize
- decompressSize
- MapStatus should never report non-empty blocks' sizes as 0
- large tasks should use org.apache.spark.scheduler.HighlyCompressedMapStatus
- HighlyCompressedMapStatus: estimated size should be the average non-empty block size
- SPARK-22540: ensure HighlyCompressedMapStatus calculates correct avgSize
- RoaringBitmap: runOptimize succeeded
- RoaringBitmap: runOptimize failed
- Blocks which are bigger than SHUFFLE_ACCURATE_BLOCK_THRESHOLD should not be underestimated.
- SPARK-21133 HighlyCompressedMapStatus#writeExternal throws NPE
Run completed in 7 seconds, 971 milliseconds.
Total number of tests run: 10
Suites: completed 2, aborted 0
Tests: succeeded 10, failed 0, canceled 0, ignored 0, pending 0
All tests passed.
```
Closes#29231 from dongjoon-hyun/SPARK-32436.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
This PR aims to speed up `MapStatus` deserialization by 5~18% with the latest RoaringBitmap `0.9.0` and new APIs. Note that we focus on `deserialization` time because `serialization` occurs once while `deserialization` occurs many times.
### Why are the changes needed?
The current version is too old. We had better upgrade it to get the performance improvement and bug fixes.
Although `MapStatusesSerDeserBenchmark` is synthetic, the benchmark result is updated with this patch.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Pass the Jenkins or GitHub Action.
Closes#29233 from dongjoon-hyun/SPARK-ROAR.
Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
### What changes were proposed in this pull request?
When submitting sql with variables, the sql displayed by ui is not replaced by variables.
### Why are the changes needed?
See the final executed sql in ui
### Does this PR introduce _any_ user-facing change?
No
### How was this patch tested?
manual test
Closes#29221 from cxzl25/SPARK-32426.
Authored-by: sychen <sychen@ctrip.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>