Commit graph

31284 commits

Author SHA1 Message Date
Angerszhuuuu e356f6aa11 [SPARK-36741][SQL] ArrayDistinct handle duplicated Double.NaN and Float.Nan
### What changes were proposed in this pull request?
For query
```
select array_distinct(array(cast('nan' as double), cast('nan' as double)))
```
This returns [NaN, NaN], but it should return [NaN].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr fix this based on https://github.com/apache/spark/pull/33955

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
ArrayDistinct won't show duplicated `NaN` value

### How was this patch tested?
Added UT

Closes #33993 from AngersZhuuuu/SPARK-36741.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-17 20:48:17 +08:00
Leona Yoda 1312a87365 [SPARK-36778][SQL] Support ILIKE API on Scala(dataframe)
### What changes were proposed in this pull request?

Support ILIKE (case insensitive LIKE) API on Scala.

### Why are the changes needed?

ILIKE statement on SQL interface is supported by SPARK-36674.
This PR will support Scala(dataframe) API for it.

### Does this PR introduce _any_ user-facing change?

Yes. Users can call `ilike` from dataframe.

### How was this patch tested?

unit tests.

Closes #34027 from yoda-mon/scala-ilike.

Authored-by: Leona Yoda <yodal@oss.nttdata.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
2021-09-17 14:37:10 +03:00
Wenchen Fan 4145498826 [SPARK-36789][SQL] Use the correct constant type as the null value holder in array functions
### What changes were proposed in this pull request?

In array functions, we use constant 0 as the placeholder when adding a null value to an array buffer. This PR makes sure the constant 0 matches the type of the array element.

### Why are the changes needed?

Fix a potential bug. Somehow we can hit this bug sometimes after https://github.com/apache/spark/pull/33955 .

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing tests

Closes #34029 from cloud-fan/minor.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-17 16:49:54 +09:00
Cheng Su 4a34db9a17 [SPARK-32709][SQL] Support writing Hive bucketed table (Parquet/ORC format with Hive hash)
### What changes were proposed in this pull request?

This is a re-work of https://github.com/apache/spark/pull/30003, here we add support for writing Hive bucketed table with Parquet/ORC file format (data source v1 write path and Hive hash as the hash function). Support for Hive's other file format will be added in follow up PR.

The changes are mostly on:

* `HiveMetastoreCatalog.scala`: When converting hive table relation to data source relation, pass bucket info (BucketSpec) and other hive related info as options into `HadoopFsRelation` and `LogicalRelation`, which can be later accessed by `FileFormatWriter` to customize bucket id and file name.

* `FileFormatWriter.scala`: Use `HiveHash` for `bucketIdExpression` if it's writing to Hive bucketed table. In addition, Spark output file name should follow Hive/Presto/Trino bucketed file naming convention. Introduce another parameter `bucketFileNamePrefix` and it introduces subsequent change in `FileFormatDataWriter`.

* `HadoopMapReduceCommitProtocol`: Implement the new file name APIs introduced in https://github.com/apache/spark/pull/33012, and change its sub-class `PathOutputCommitProtocol`, to make Hive bucketed table writing work with all commit protocol (including S3A commit protocol).

### Why are the changes needed?

To make Spark write other-SQL-engines-compatible bucketed table. Currently Spark bucketed table cannot be leveraged by other SQL engines like Hive and Presto, because it uses a different hash function (Spark murmur3hash) and different file name scheme. With this PR, the Spark-written-Hive-bucketed-table can be efficiently read by Presto and Hive to do bucket filter pruning, join, group-by, etc. This was and is blocking several companies (confirmed from Facebook, Lyft, etc) migrate bucketing workload from Hive to Spark.

### Does this PR introduce _any_ user-facing change?

Yes, any Hive bucketed table (with Parquet/ORC format) written by Spark, is properly bucketed and can be efficiently processed by Hive and Presto/Trino.

### How was this patch tested?

* Added unit test in BucketedWriteWithHiveSupportSuite.scala, to verify bucket file names and each row in each bucket is written properly.
* Tested by Lyft Spark team (Shashank Pedamallu) to read Spark-written bucketed table from Trino, Spark and Hive.

Closes #33432 from c21/hive-bucket-v1.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-17 14:28:51 +08:00
Hyukjin Kwon 917d7dad4d [SPARK-36788][SQL] Change log level of AQE for non-supported plans from warning to debug
### What changes were proposed in this pull request?

This PR suppresses the warnings for plans where AQE is not supported. Currently we show the warnings such as:

```
org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324881 DESC NULLS FIRST], true, 23
+- Scan ExistingRDD[a#324881]
```

for every plan that AQE is not supported.

### Why are the changes needed?

It's too noisy now. Below is the example of `SortSuite` run:

```
14:51:40.675 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324881 DESC NULLS FIRST], true, 23
+- Scan ExistingRDD[a#324881]
.
[info] - sorting on DayTimeIntervalType(0,1) with nullable=true, sortOrder=List('a DESC NULLS FIRST) (785 milliseconds)
14:51:41.416 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: ReferenceSort [a#324884 ASC NULLS FIRST], true
+- Scan ExistingRDD[a#324884]
.
14:51:41.467 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324884 ASC NULLS FIRST], true, 23
+- Scan ExistingRDD[a#324884]
.
[info] - sorting on DayTimeIntervalType(0,1) with nullable=false, sortOrder=List('a ASC NULLS FIRST) (796 milliseconds)
14:51:42.210 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: ReferenceSort [a#324887 ASC NULLS LAST], true
+- Scan ExistingRDD[a#324887]
.
14:51:42.259 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324887 ASC NULLS LAST], true, 23
+- Scan ExistingRDD[a#324887]
.
[info] - sorting on DayTimeIntervalType(0,1) with nullable=false, sortOrder=List('a ASC NULLS LAST) (797 milliseconds)
14:51:43.009 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: ReferenceSort [a#324890 DESC NULLS LAST], true
+- Scan ExistingRDD[a#324890]
.
14:51:43.061 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324890 DESC NULLS LAST], true, 23
+- Scan ExistingRDD[a#324890]
.
[info] - sorting on DayTimeIntervalType(0,1) with nullable=false, sortOrder=List('a DESC NULLS LAST) (848 milliseconds)
14:51:43.857 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: ReferenceSort [a#324893 DESC NULLS FIRST], true
+- Scan ExistingRDD[a#324893]
.
14:51:43.903 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324893 DESC NULLS FIRST], true, 23
+- Scan ExistingRDD[a#324893]
.
[info] - sorting on DayTimeIntervalType(0,1) with nullable=false, sortOrder=List('a DESC NULLS FIRST) (827 milliseconds)
14:51:44.682 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: ReferenceSort [a#324896 ASC NULLS FIRST], true
+- Scan ExistingRDD[a#324896]
.
14:51:44.748 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324896 ASC NULLS FIRST], true, 23
+- Scan ExistingRDD[a#324896]
.
[info] - sorting on YearMonthIntervalType(0,1) with nullable=true, sortOrder=List('a ASC NULLS FIRST) (565 milliseconds)
14:51:45.248 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: ReferenceSort [a#324899 ASC NULLS LAST], true
+- Scan ExistingRDD[a#324899]
.
14:51:45.312 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324899 ASC NULLS LAST], true, 23
+- Scan ExistingRDD[a#324899]
.
[info] - sorting on YearMonthIntervalType(0,1) with nullable=true, sortOrder=List('a ASC NULLS LAST) (591 milliseconds)
14:51:45.841 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: ReferenceSort [a#324902 DESC NULLS LAST], true
+- Scan ExistingRDD[a#324902]
.
14:51:45.905 WARN org.apache.spark.sql.execution.adaptive.InsertAdaptiveSparkPlan: spark.sql.adaptive.enabled is enabled but is not supported for query: Sort [a#324902 DESC NULLS LAST], true, 23
+- Scan ExistingRDD[a#324902]
.
```

### Does this PR introduce _any_ user-facing change?

Yes, it will show less warnings to users. Note that AQE is enabled by default from Spark 3.2, see SPARK-33679

### How was this patch tested?

Manually tested via unittests.

Closes #34026 from HyukjinKwon/minor-log-level.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-17 12:01:43 +09:00
Wenchen Fan dfd5237c0c [SPARK-36783][SQL] ScanOperation should not push Filter through nondeterministic Project
### What changes were proposed in this pull request?

`ScanOperation` collects adjacent Projects and Filters. The caller side always assume that the collected Filters should run before collected Projects, which means `ScanOperation` effectively pushes Filter through Project.

Following `PushPredicateThroughNonJoin`, we should not push Filter through nondeterministic Project. This PR fixes `ScanOperation` to follow this rule.

### Why are the changes needed?

Fix a bug that violates the semantic of nondeterministic expressions.

### Does this PR introduce _any_ user-facing change?

Most likely no change, but in some cases, this is a correctness bug fix which changes the query result.

### How was this patch tested?

existing tests

Closes #34023 from cloud-fan/scan.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-17 10:51:15 +08:00
BelodengKlaus 3712502de4 [SPARK-36773][SQL][TEST] Fixed unit test to check the compression for parquet
### What changes were proposed in this pull request?
Change the unit test for parquet compression

### Why are the changes needed?
To check the compression for parquet

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
change unit test

Closes #34012 from BelodengKlaus/spark36773.

Authored-by: BelodengKlaus <jp.xiong520@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-17 11:25:09 +09:00
dgd-contributor 8f895e9e96 [SPARK-36779][PYTHON] Fix when list of data type tuples has len = 1
### What changes were proposed in this pull request?

Fix when list of data type tuples has len = 1

### Why are the changes needed?
Fix when list of data type tuples has len = 1

``` python
>>> ps.DataFrame[("a", int), [int]]
typing.Tuple[pyspark.pandas.typedef.typehints.IndexNameType, int]

>>> ps.DataFrame[("a", int), [("b", int)]]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/dgd/spark/python/pyspark/pandas/frame.py", line 11998, in __class_getitem__
    return create_tuple_for_frame_type(params)
  File "/Users/dgd/spark/python/pyspark/pandas/typedef/typehints.py", line 685, in create_tuple_for_frame_type
    return Tuple[extract_types(params)]
  File "/Users/dgd/spark/python/pyspark/pandas/typedef/typehints.py", line 755, in extract_types
    return (index_type,) + extract_types(data_types)
  File "/Users/dgd/spark/python/pyspark/pandas/typedef/typehints.py", line 770, in extract_types
    raise TypeError(
TypeError: Type hints should be specified as one of:
  - DataFrame[type, type, ...]
  - DataFrame[name: type, name: type, ...]
  - DataFrame[dtypes instance]
  - DataFrame[zip(names, types)]
  - DataFrame[index_type, [type, ...]]
  - DataFrame[(index_name, index_type), [(name, type), ...]]
  - DataFrame[dtype instance, dtypes instance]
  - DataFrame[(index_name, index_type), zip(names, types)]
However, got [('b', <class 'int'>)].
```

### Does this PR introduce _any_ user-facing change?

After:
``` python
>>> ps.DataFrame[("a", int), [("b", int)]]
typing.Tuple[pyspark.pandas.typedef.typehints.IndexNameType, pyspark.pandas.typedef.typehints.NameType]

```

### How was this patch tested?
exist test

Closes #34019 from dgd-contributor/fix_when_list_of_tuple_data_type_have_len=1.

Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-17 09:27:46 +09:00
Josh Rosen 3ae6e6775b [SPARK-36774][CORE][TESTS] Move SparkSubmitTestUtils to core module and use it in SparkSubmitSuite
### What changes were proposed in this pull request?

This PR refactors test code in order to improve the debugability of `SparkSubmitSuite`.

The `sql/hive` module contains a `SparkSubmitTestUtils` helper class which launches `spark-submit` and captures its output in order to display better error messages when tests fail. This helper is currently used by `HiveSparkSubmitSuite` and `HiveExternalCatalogVersionsSuite`, but isn't used by `SparkSubmitSuite`.

In this PR, I moved `SparkSubmitTestUtils` and `ProcessTestUtils` into the `core` module and updated `SparkSubmitSuite`, `BufferHolderSparkSubmitSuite`, and `WholestageCodegenSparkSubmitSuite` to use the relocated helper classes. This required me to change `SparkSubmitTestUtils` to make its timeouts configurable and to generalize its method for locating the `spark-submit` binary.

### Why are the changes needed?

Previously, `SparkSubmitSuite` tests would fail with messages like:

```
[info] - launch simple application with spark-submit *** FAILED *** (1 second, 832 milliseconds)
[info]   Process returned with exit code 101. See the log4j logs for more detail. (SparkSubmitSuite.scala:1551)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
```

which require the Spark developer to hunt in log4j logs in order to view the logs from the failed `spark-submit` command.

After this change, those tests will fail with detailed error messages that include the text of failed command plus timestamped logs captured from the failed proces:

```
[info] - launch simple application with spark-submit *** FAILED *** (2 seconds, 800 milliseconds)
[info]   spark-submit returned with exit code 101.
[info]   Command line: '/Users/joshrosen/oss-spark/bin/spark-submit' '--class' 'invalidClassName' '--name' 'testApp' '--master' 'local' '--conf' 'spark.ui.enabled=false' '--conf' 'spark.master.rest.enabled=false' 'file:/Users/joshrosen/oss-spark/target/tmp/spark-0a8a0c93-3aaf-435d-9cf3-b97abd318d91/testJar-1631768004882.jar'
[info]
[info]   2021-09-15 21:53:26.041 - stderr> SLF4J: Class path contains multiple SLF4J bindings.
[info]   2021-09-15 21:53:26.042 - stderr> SLF4J: Found binding in [jar:file:/Users/joshrosen/oss-spark/assembly/target/scala-2.12/jars/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
[info]   2021-09-15 21:53:26.042 - stderr> SLF4J: Found binding in [jar:file:/Users/joshrosen/.m2/repository/org/slf4j/slf4j-log4j12/1.7.30/slf4j-log4j12-1.7.30.jar!/org/slf4j/impl/StaticLoggerBinder.class]
[info]   2021-09-15 21:53:26.042 - stderr> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
[info]   2021-09-15 21:53:26.042 - stderr> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
[info]   2021-09-15 21:53:26.619 - stderr> Error: Failed to load class invalidClassName. (SparkSubmitTestUtils.scala:97)
[info]   org.scalatest.exceptions.TestFailedException:
[info]   at org.scalatest.Assertions.newAssertionFailedException(Assertions.scala:472)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

I manually ran the affected test suites.

Closes #34013 from JoshRosen/SPARK-36774-move-SparkSubmitTestUtils-to-core.

Authored-by: Josh Rosen <joshrosen@databricks.com>
Signed-off-by: Josh Rosen <joshrosen@databricks.com>
2021-09-16 14:28:47 -07:00
Liang-Chi Hsieh f1f2ec3704 [SPARK-36735][SQL][FOLLOWUP] Fix indentation of DynamicPartitionPruningSuite
### What changes were proposed in this pull request?

As a follow up of #33975, this fixes a few indentation in DynamicPartitionPruningSuite.

### Why are the changes needed?

Fix wrong indentation.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests.

Closes #34016 from viirya/fix-style.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-09-16 08:30:00 -07:00
Dongjoon Hyun adbea252db [SPARK-36759][BUILD][FOLLOWUP] Update version in scala-2.12 profile and doc
### What changes were proposed in this pull request?

This is a follow-up to fix the leftover during switching the Scala version.

### Why are the changes needed?

This should be consistent.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

This is not tested by UT. We need to check manually. There is no more `2.12.14`.
```
$ git grep 2.12.14
R/pkg/tests/fulltests/test_sparkSQL.R:               c(as.Date("2012-12-14"), as.Date("2013-12-15"), as.Date("2014-12-16")))
data/mllib/ridge-data/lpsa.data:3.5307626,0.987291634724086 -0.36279314978779 -0.922212414640967 0.232904453212813 -0.522940888712441 1.79270085261407 0.342627053981254 1.26288870310799
sql/hive/src/test/resources/data/files/over10k:-3|454|65705|4294967468|62.12|14.32|true|mike white|2013-03-01 09:11:58.703087|40.18|joggying
```

Closes #34020 from dongjoon-hyun/SPARK-36759-2.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-09-16 05:10:54 -07:00
Huaxin Gao fb11c466ae [SPARK-36587][SQL][FOLLOWUP] Remove unused CreateNamespaceStatement
### What changes were proposed in this pull request?
remove `CreateNamespaceStatement`

### Why are the changes needed?
remove unused code

### Does this PR introduce _any_ user-facing change?
no

### How was this patch tested?
existing tests

Closes #34015 from huaxingao/rm_create_ns_stmt.

Authored-by: Huaxin Gao <huaxin_gao@apple.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-09-16 19:56:45 +08:00
Kousuke Saruta 89a9456b13 [SPARK-36777][INFRA] Move Java 17 on GitHub Actions from EA to LTS release
### What changes were proposed in this pull request?

This PR aims to move Java 17 on GA from early access release to LTS release.

### Why are the changes needed?

Java 17 LTS was released a few days ago and it's available on GA.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

GA itself.

Closes #34017 from sarutak/ga-java17.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Yuming Wang <yumwang@ebay.com>
2021-09-16 18:04:35 +08:00
Thejdeep Gudivada 23f4a650ea [SPARK-36433][WEBUI] Fix log message in WebUI
### What changes were proposed in this pull request?

This fixes the info log message output when starting a WebUI server

### Why are the changes needed?
This is needed by the user to go to the right location of the started service
```
21/08/05 14:33:30 INFO HistoryServer: Bound HistoryServer to 0.0.0.0, and started at http://tgudivad-mn1.test.biz:18080
```

### Does this PR introduce _any_ user-facing change?
Yes, fixes the URL displayed in the logs when starting the service.

### How was this patch tested?
Tested by running an instance of HistoryServer

Closes #33659 from thejdeep/SPARK-36433.

Authored-by: Thejdeep Gudivada <tgudivada@linkedin.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-09-16 17:26:56 +08:00
Gengliang Wang ff7705ad2a [SPARK-36775][DOCS] Add documentation for ANSI store assignment rules
### What changes were proposed in this pull request?

Add documentation for ANSI store assignment rules for
- the valid source/target type combinations
- runtime error will happen on numberic overflow

### Why are the changes needed?

Better  docs
### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Build docs and preview:
![image](https://user-images.githubusercontent.com/1097932/133554600-8c80c0a9-8753-4c01-94d0-994d8082e319.png)

Closes #34014 from gengliangwang/addStoreAssignDoc.

Authored-by: Gengliang Wang <gengliang@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-16 15:50:40 +08:00
Dongjoon Hyun c217797297 [SPARK-36732][SQL][BUILD] Upgrade ORC to 1.6.11
### What changes were proposed in this pull request?

This PR aims to upgrade Apache ORC to 1.6.11 to bring the latest bug fixes.

### Why are the changes needed?

Apache ORC 1.6.11 has the following fixes.
- https://issues.apache.org/jira/projects/ORC/versions/12350499

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass the CIs.

Closes #33971 from dongjoon-hyun/SPARK-36732.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-09-15 23:36:26 -07:00
Yannis Sismanis afd406e4d0 [SPARK-36745][SQL] ExtractEquiJoinKeys should return the original predicates on join keys
### What changes were proposed in this pull request?

This PR updates `ExtractEquiJoinKeys` to return an extra field for the join condition with join keys.

### Why are the changes needed?

Sometimes we need to restore the original join condition. Before this PR, we need to build `EqualTo` expressions with the join keys, which is not always the original join condition. E.g. `EqualNullSafe(a, b)` will become `EqualTo(Coalesce(a, lit), Coalesce(b, lit))`. After this PR, we can simply use the new returned field.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests.

Closes #33985 from YannisSismanis/SPARK-36475-fix.

Authored-by: Yannis Sismanis <yannis.sismanis@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-16 13:16:16 +08:00
Liang-Chi Hsieh bbb33af2e4 [SPARK-36735][SQL] Adjust overhead of cached relation for DPP
### What changes were proposed in this pull request?

This patch proposes to adjust the current overhead of cached relation for DPP.

### Why are the changes needed?

Currently we calculate if there is benefit of pruning with DPP by simply summing up the size of all scan relations as the overhead. However, for cached relations, the overhead should be different than a non-cached relation. This proposes to use adjusted overhead for cached relation with DPP.

### Does this PR introduce _any_ user-facing change?

Yes.

### How was this patch tested?

Added unit test.

Closes #33975 from viirya/reduce-cache-overhead.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-09-15 14:00:45 -07:00
Dongjoon Hyun 16f1f71ba5 [SPARK-36759][BUILD] Upgrade Scala to 2.12.15
### What changes were proposed in this pull request?

This PR aims to upgrade Scala to 2.12.15 to support Java 17/18 better.

### Why are the changes needed?

Scala 2.12.15 improves compatibility with JDK 17 and 18:

https://github.com/scala/scala/releases/tag/v2.12.15

- Avoids IllegalArgumentException in JDK 17+ for lambda deserialization
- Upgrades to ASM 9.2, for JDK 18 support in optimizer

### Does this PR introduce _any_ user-facing change?

Yes, this is a Scala version change.

### How was this patch tested?

Pass the CIs

Closes #33999 from dongjoon-hyun/SPARK-36759.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-09-15 13:43:25 -07:00
Chao Sun a927b0836b [SPARK-36726] Upgrade Parquet to 1.12.1
### What changes were proposed in this pull request?

Upgrade Apache Parquet to 1.12.1

### Why are the changes needed?

Parquet 1.12.1 contains the following bug fixes:
- PARQUET-2064: Make Range public accessible in RowRanges
- PARQUET-2022: ZstdDecompressorStream should close `zstdInputStream`
- PARQUET-2052: Integer overflow when writing huge binary using dictionary encoding
- PARQUET-1633: Fix integer overflow
- PARQUET-2054: fix TCP leaking when calling ParquetFileWriter.appendFile
- PARQUET-2072: Do Not Determine Both Min/Max for Binary Stats
- PARQUET-2073: Fix estimate remaining row count in ColumnWriteStoreBase
- PARQUET-2078: Failed to read parquet file after writing with the same

In particular PARQUET-2078 is a blocker for the upcoming Apache Spark 3.2.0 release.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Existing tests + a new test for the issue in SPARK-36696

Closes #33969 from sunchao/upgrade-parquet-12.1.

Authored-by: Chao Sun <sunchao@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2021-09-15 19:17:34 +00:00
dgd-contributor c15072cc73 [SPARK-36722][PYTHON] Fix Series.update with another in same frame
### What changes were proposed in this pull request?
Fix Series.update with another in same frame

also add test for update series in diff frame

### Why are the changes needed?
Fix Series.update with another in same frame

Pandas behavior:
``` python
>>> pdf = pd.DataFrame(
...     {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 1, None, 0, 0]},
... )
>>> pdf
     a    b
0  NaN  NaN
1  2.0  5.0
2  3.0  NaN
3  4.0  3.0
4  5.0  2.0
5  6.0  1.0
6  7.0  NaN
7  8.0  0.0
8  NaN  0.0
>>> pdf.a.update(pdf.b)
>>> pdf
     a    b
0  NaN  NaN
1  5.0  5.0
2  3.0  NaN
3  3.0  3.0
4  2.0  2.0
5  1.0  1.0
6  7.0  NaN
7  0.0  0.0
8  0.0  0.0
```

### Does this PR introduce _any_ user-facing change?
Before
```python
>>> psdf = ps.DataFrame(
...     {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 1, None, 0, 0]},
... )

>>> psdf.a.update(psdf.b)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/dgd/spark/python/pyspark/pandas/series.py", line 4551, in update
    combined = combine_frames(self._psdf, other._psdf, how="leftouter")
  File "/Users/dgd/spark/python/pyspark/pandas/utils.py", line 141, in combine_frames
    assert not same_anchor(
AssertionError: We don't need to combine. `this` and `that` are same.
>>>
```

After
```python
>>> psdf = ps.DataFrame(
...     {"a": [None, 2, 3, 4, 5, 6, 7, 8, None], "b": [None, 5, None, 3, 2, 1, None, 0, 0]},
... )

>>> psdf.a.update(psdf.b)
>>> psdf
     a    b
0  NaN  NaN
1  5.0  5.0
2  3.0  NaN
3  3.0  3.0
4  2.0  2.0
5  1.0  1.0
6  7.0  NaN
7  0.0  0.0
8  0.0  0.0
>>>
```

### How was this patch tested?
unit tests

Closes #33968 from dgd-contributor/SPARK-36722_fix_update_same_anchor.

Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-09-15 11:08:01 -07:00
Angerszhuuuu b665782f0d [SPARK-36755][SQL] ArraysOverlap should handle duplicated Double.NaN and Float.NaN
### What changes were proposed in this pull request?
For query
```
select arrays_overlap(array(cast('nan' as double), 1d), array(cast('nan' as double)))
```
This returns [false], but it should return [true].
This issue is caused by `scala.mutable.HashSet` can't handle `Double.NaN` and `Float.NaN`.

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
arrays_overlap won't handle equal `NaN` value

### How was this patch tested?
Added UT

Closes #34006 from AngersZhuuuu/SPARK-36755.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-15 22:31:46 +08:00
Angerszhuuuu 638085953f [SPARK-36702][SQL][FOLLOWUP] ArrayUnion handle duplicated Double.NaN and Float.NaN
### What changes were proposed in this pull request?
According to https://github.com/apache/spark/pull/33955#discussion_r708570515 use normalized  NaN

### Why are the changes needed?
Use normalized NaN for duplicated NaN value

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Exiting UT

Closes #34003 from AngersZhuuuu/SPARK-36702-FOLLOWUP.

Authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-15 22:04:09 +08:00
Leona Yoda 0666f5c003 [SPARK-36751][SQL][PYTHON][R] Add bit/octet_length APIs to Scala, Python and R
### What changes were proposed in this pull request?

octet_length: caliculate the byte length of strings
bit_length: caliculate the bit length of strings
Those two string related functions are only implemented on SparkSQL, not on Scala, Python and R.

### Why are the changes needed?

Those functions would be useful for multi-bytes character users, who mainly working with Scala, Python or R.

### Does this PR introduce _any_ user-facing change?

Yes. Users can call octet_length/bit_length APIs on Scala(Dataframe), Python, and R.

### How was this patch tested?

unit tests

Closes #33992 from yoda-mon/add-bit-octet-length.

Authored-by: Leona Yoda <yodal@oss.nttdata.com>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2021-09-15 16:27:13 +09:00
Yuto Akutsu 5a9d4c17de [SPARK-36660][SQL][FOLLOW-UP] Add cot to pyspark.sql.rst
### What changes were proposed in this pull request?

Added cot to pyspark.sql.rst (follow-up)

### Why are the changes needed?

[My previous PR](https://github.com/apache/spark/pull/33906) was missing it.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

manual check

Closes #34002 from yutoacts/SPARK-36660.

Authored-by: Yuto Akutsu <yuto.akutsu@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-15 13:05:44 +09:00
Kousuke Saruta e43b9e8520 [SPARK-36733][SQL] Fix a perf issue in SchemaPruning when a struct has many fields
### What changes were proposed in this pull request?

This PR fixes a perf issue in `SchemaPruning` when a struct has many fields (e.g. >10K fields).
The root cause is `SchemaPruning.sortLeftFieldsByRight` does N * M order searching.
```
 val filteredRightFieldNames = rightStruct.fieldNames
    .filter(name => leftStruct.fieldNames.exists(resolver(_, name)))
```

To fix this issue, this PR proposes to use `HashMap` to expect a constant order searching.
This PR also adds `case _ if left == right => left` to the method as a short-circuit code.

### Why are the changes needed?

To fix a perf issue.

### Does this PR introduce _any_ user-facing change?

No. The logic should be identical.

### How was this patch tested?

I confirmed that the following micro benchmark finishes within a few seconds.
```
import org.apache.spark.sql.catalyst.expressions.SchemaPruning
import org.apache.spark.sql.types._

var struct1 = new StructType()
(1 to 50000).foreach { i =>
  struct1 = struct1.add(new StructField(i + "", IntegerType))
}

var struct2 = new StructType()
(50001 to 100000).foreach { i =>
  struct2 = struct2.add(new StructField(i + "", IntegerType))
}

SchemaPruning.sortLeftFieldsByRight(struct1, struct2)
SchemaPruning.sortLeftFieldsByRight(struct2, struct2)
```

The correctness should be checked by existing tests.

Closes #33981 from sarutak/improve-schemapruning-performance.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-15 10:33:58 +09:00
Hyukjin Kwon 0aaf86b520 [SPARK-36709][PYTHON] Support new syntax for specifying index type and name in pandas API on Spark
### What changes were proposed in this pull request?

This PR proposes new syntax to specify the index type and name in pandas API on Spark. This is a base work for SPARK-36707.

More specifically, users now can use the type hints when typing as below:

```
pd.DataFrame[int, [int, int]]
pd.DataFrame[pdf.index.dtype, pdf.dtypes]
pd.DataFrame[("index", int), [("id", int), ("A", int)]]
pd.DataFrame[(pdf.index.name, pdf.index.dtype), zip(pdf.columns, pdf.dtypes)]
```

Note that the types of `[("id", int), ("A", int)]` or  `("index", int)` are matched to how you provide a compound NumPy type (see also https://numpy.org/doc/stable/user/basics.rec.html#introduction).

Therefore, the syntax will be:

**Without index:**

```
pd.DataFrame[type, type, ...]
pd.DataFrame[name: type, name: type, ...]
pd.DataFrame[dtypes instance]
pd.DataFrame[zip(names, types)]
```

(New) **With index:**

```
pd.DataFrame[index_type, [type, ...]]
pd.DataFrame[(index_name, index_type), [(name, type), ...]]
pd.DataFrame[dtype instance, dtypes instance]
pd.DataFrame[(index_name, index_type), zip(names, types)]
```

### Why are the changes needed?

Currently, there is no way to specify the type hint for index type - the type hints are converted to return type of pandas UDFs internally. Therefore, we always attach default index which degrade performance:

```python
>>> def transform(pdf) -> pd.DataFrame[int, int]:
...     pdf['A'] = pdf.id + 1
...     return pdf
...
>>> ks.range(5).koalas.apply_batch(transform)
```

```
   c0  c1
0   0   1
1   1   2
2   2   3
3   3   4
4   4   5
```

The [default index](https://koalas.readthedocs.io/en/latest/user_guide/options.html#default-index-type) (for the first column that looks unnamed) is attached when the type hint is specified. For better performance, we should have a way to work around, see also https://github.com/apache/spark/pull/33954#issuecomment-917742920 and [Specify the index column in conversion from Spark DataFrame to Koalas DataFrame](https://koalas.readthedocs.io/en/latest/user_guide/best_practices.html#specify-the-index-column-in-conversion-from-spark-dataframe-to-koalas-dataframe).

Note that this still remains as experimental because Python itself yet doesn't support such kind of typing out of the box. Once pandas completes typing support like NumPy did in `numpy.typing`, we should implement Koalas typing package, and migrate to it with leveraging pandas' typing way.

### Does this PR introduce _any_ user-facing change?

No, this PR does not yet affect any user-facing behavior in theory.

### How was this patch tested?

Unittests were added.

Closes #33954 from HyukjinKwon/SPARK-36709.

Authored-by: Hyukjin Kwon <gurwls223@apache.org>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-15 10:13:33 +09:00
Kevin Su 3e5d3d1cfe [SPARK-34943][BUILD] Upgrade flake8 to 3.8.0 or above in Jenkins
### What changes were proposed in this pull request?

Upgrade flake8 to 3.8.0 or above in Jenkins

### Why are the changes needed?

In flake8 < 3.8.0, F401 error occurs for imports in if statements when TYPE_CHECKING is True. However, TYPE_CHECKING is always False at runtime, so there is no need to treat it as an error in static analysis.

Since this behavior is fixed In flake8 >= 3.8.0, we should upgrade the flake8 installed in Jenkins to 3.8.0 or above. Otherwise, it occurs F401 error for several lines in pandas-on-PySpark that use TYPE_CHECKING

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass the CI

Closes #32749 from pingsutw/SPARK-34943.

Lead-authored-by: Kevin Su <pingsutw@gmail.com>
Co-authored-by: Hyukjin Kwon <gurwls223@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-15 09:24:50 +09:00
Dongjoon Hyun d730ef24fe [SPARK-36712][BUILD][FOLLOWUP] Improve the regex to avoid breaking pom.xml
### What changes were proposed in this pull request?

This PR aims to fix the regex to avoid breaking `pom.xml`.

### Why are the changes needed?

**BEFORE**
```
$ dev/change-scala-version.sh 2.12
$ git diff | head -n10
diff --git a/core/pom.xml b/core/pom.xml
index dbde22f2bf..6ed368353b 100644
--- a/core/pom.xml
+++ b/core/pom.xml
 -35,7 +35,7
   </properties>

   <dependencies>
-    <!--<!--
```

**AFTER**
Since the default Scala version is `2.12`, the following `no-op` is the correct behavior which is consistent with the previous behavior.
```
$ dev/change-scala-version.sh 2.12
$ git diff
```

### Does this PR introduce _any_ user-facing change?

No. This is a dev only change.

### How was this patch tested?

Manually.

Closes #33996 from dongjoon-hyun/SPARK-36712.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-09-14 16:26:50 -07:00
yangjie01 119ddd7e95 [SPARK-36737][BUILD][CORE][SQL][SS] Upgrade Apache commons-io to 2.11.0 and revert change of SPARK-36456
### What changes were proposed in this pull request?
SPARK-36456 change to use `JavaUtils. closeQuietly` instead of `IOUtils.closeQuietly`, but there is slightly different from the 2 methods in default behavior: swallowing IOException is same, but the former logs it as ERROR while the latter doesn't log by default.

`Apache commons-io` community decided to retain the `IOUtils.closeQuietly` method in the [new version](75f20dca72/src/main/java/org/apache/commons/io/IOUtils.java (L465-L467)) and removed deprecated annotation,  the change has been released in version 2.11.0.

So the change of this pr is to upgrade `Apache commons-io` to 2.11.0 and revert change of SPARK-36456 to maintain original behavior(don't print error log).

### Why are the changes needed?

1. Upgrade `Apache commons-io` to 2.11.0 to use non-deprecated `closeQuietly` API, other changes related to `Apache commons-io are detailed in [commons-io/changes-report](https://commons.apache.org/proper/commons-io/changes-report.html#a2.11.0)

2. Revert change of SPARK-36737 to maintain original `IOUtils.closeQuietly` API behavior(don't print error log).

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Pass the Jenkins or GitHub Action

Closes #33977 from LuciferYang/upgrade-commons-io.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Jungtaek Lim <kabhwan.opensource@gmail.com>
2021-09-14 21:16:58 +09:00
Angerszhuuuu f71f37755d [SPARK-36702][SQL] ArrayUnion handle duplicated Double.NaN and Float.Nan
### What changes were proposed in this pull request?
For query
```
select array_union(array(cast('nan' as double), cast('nan' as double)), array())
```
This returns [NaN, NaN], but it should return [NaN].
This issue is caused by `OpenHashSet` can't handle `Double.NaN` and `Float.NaN` too.
In this pr we add a wrap for OpenHashSet that can handle `null`, `Double.NaN`, `Float.NaN` together

### Why are the changes needed?
Fix bug

### Does this PR introduce _any_ user-facing change?
ArrayUnion won't show duplicated `NaN` value

### How was this patch tested?
Added UT

Closes #33955 from AngersZhuuuu/SPARK-36702-WrapOpenHashSet.

Lead-authored-by: Angerszhuuuu <angers.zhu@gmail.com>
Co-authored-by: AngersZhuuuu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-14 18:25:47 +08:00
Minchu Yang 2d7dc7c7ce [SPARK-36705][FOLLOW-UP] Fix unnecessary logWarning when PUSH_BASED_SHUFFLE_ENABLED is set to false
### What changes were proposed in this pull request?

Only throw logWarning when `PUSH_BASED_SHUFFLE_ENABLED` is set to true and `canDoPushBasedShuffle` is false

### Why are the changes needed?

Currently, this logWarning will still be printed out even when `PUSH_BASED_SHUFFLE_ENABLED` is set to false, which is unnecessary.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Passed existing UT.

Closes #33984 from rmcyang/SPARK-36705-follow-up.

Authored-by: Minchu Yang <minyang@minyang-mn3.linkedin.biz>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
2021-09-13 23:23:33 -05:00
Xinrong Meng 1ed671cec6 [SPARK-36748][PYTHON] Introduce the 'compute.isin_limit' option
### What changes were proposed in this pull request?
Introduce the 'compute.isin_limit' option, with the default value of 80.

### Why are the changes needed?
`Column.isin(list)` doesn't perform well when the given `list` is large, as https://issues.apache.org/jira/browse/SPARK-33383.
Thus, 'compute.isin_limit' is introduced to constrain the usage of `Column.isin(list)` in the code base.
If the length of the ‘list’ is above the `'compute.isin_limit'`, broadcast join is used instead for better performance.

#### Why is the default value 80?
After reproducing the benchmark mentioned in https://issues.apache.org/jira/browse/SPARK-33383,

| length of filtering list | isin time /ms| broadcast DF time / ms|
| :---:   | :-: | :-: |
| 200 | 69411 | 39296 |
| 100 | 43074 | 40087 |
| 80 | 35592 | 40350 |
| 50 | 28134 | 37847 |

We may notice when the length of the filtering list <= 80, the `isin` approach performs better than `broadcast DF`.

### Does this PR introduce _any_ user-facing change?
Users may read/write the value of `'compute.isin_limit'` as follows
```py
>>> ps.get_option('compute.isin_limit')
80

>>> ps.set_option('compute.isin_limit', 10)
>>> ps.get_option('compute.isin_limit')
10

>>> ps.set_option('compute.isin_limit', -1)
...
ValueError: 'compute.isin_limit' should be greater than or equal to 0.

>>> ps.reset_option('compute.isin_limit')
>>> ps.get_option('compute.isin_limit')
80
```

### How was this patch tested?
Manual test.

Closes #33982 from xinrong-databricks/new_option.

Authored-by: Xinrong Meng <xinrong.meng@databricks.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-14 11:37:35 +09:00
Fu Chen 52c5ff20ca [SPARK-36715][SQL] InferFiltersFromGenerate should not infer filter for udf
### What changes were proposed in this pull request?

Fix InferFiltersFromGenerate bug, InferFiltersFromGenerate should not infer filter for generate when the children contain an expression which is instance of `org.apache.spark.sql.catalyst.expressions.UserDefinedExpression`.
Before this pr, the following case will throw an exception.

```scala
spark.udf.register("vec", (i: Int) => (0 until i).toArray)
sql("select explode(vec(8)) as c1").show
```

```
Once strategy's idempotence is broken for batch Infer Filters
 GlobalLimit 21                                                        GlobalLimit 21
 +- LocalLimit 21                                                      +- LocalLimit 21
    +- Project [cast(c1#3 as string) AS c1#12]                            +- Project [cast(c1#3 as string) AS c1#12]
       +- Generate explode(vec(8)), false, [c1#3]                            +- Generate explode(vec(8)), false, [c1#3]
          +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8)))            +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8)))
!            +- OneRowRelation                                                     +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8)))
!                                                                                     +- OneRowRelation

java.lang.RuntimeException:
Once strategy's idempotence is broken for batch Infer Filters
 GlobalLimit 21                                                        GlobalLimit 21
 +- LocalLimit 21                                                      +- LocalLimit 21
    +- Project [cast(c1#3 as string) AS c1#12]                            +- Project [cast(c1#3 as string) AS c1#12]
       +- Generate explode(vec(8)), false, [c1#3]                            +- Generate explode(vec(8)), false, [c1#3]
          +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8)))            +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8)))
!            +- OneRowRelation                                                     +- Filter ((size(vec(8), true) > 0) AND isnotnull(vec(8)))
!                                                                                     +- OneRowRelation

	at org.apache.spark.sql.errors.QueryExecutionErrors$.onceStrategyIdempotenceIsBrokenForBatchError(QueryExecutionErrors.scala:1200)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.checkBatchIdempotence(RuleExecutor.scala:168)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1(RuleExecutor.scala:254)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$execute$1$adapted(RuleExecutor.scala:200)
	at scala.collection.immutable.List.foreach(List.scala:431)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.execute(RuleExecutor.scala:200)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.$anonfun$executeAndTrack$1(RuleExecutor.scala:179)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker$.withTracker(QueryPlanningTracker.scala:88)
	at org.apache.spark.sql.catalyst.rules.RuleExecutor.executeAndTrack(RuleExecutor.scala:179)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$optimizedPlan$1(QueryExecution.scala:138)
	at org.apache.spark.sql.catalyst.QueryPlanningTracker.measurePhase(QueryPlanningTracker.scala:111)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executePhase$1(QueryExecution.scala:196)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.QueryExecution.executePhase(QueryExecution.scala:196)
	at org.apache.spark.sql.execution.QueryExecution.optimizedPlan$lzycompute(QueryExecution.scala:134)
	at org.apache.spark.sql.execution.QueryExecution.optimizedPlan(QueryExecution.scala:130)
	at org.apache.spark.sql.execution.QueryExecution.assertOptimized(QueryExecution.scala:148)
	at org.apache.spark.sql.execution.QueryExecution.$anonfun$executedPlan$1(QueryExecution.scala:166)
	at org.apache.spark.sql.execution.QueryExecution.withCteMap(QueryExecution.scala:73)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan$lzycompute(QueryExecution.scala:163)
	at org.apache.spark.sql.execution.QueryExecution.executedPlan(QueryExecution.scala:163)
	at org.apache.spark.sql.execution.QueryExecution.simpleString(QueryExecution.scala:214)
	at org.apache.spark.sql.execution.QueryExecution.org$apache$spark$sql$execution$QueryExecution$$explainString(QueryExecution.scala:259)
	at org.apache.spark.sql.execution.QueryExecution.explainString(QueryExecution.scala:228)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$5(SQLExecution.scala:98)
	at org.apache.spark.sql.execution.SQLExecution$.withSQLConfPropagated(SQLExecution.scala:163)
	at org.apache.spark.sql.execution.SQLExecution$.$anonfun$withNewExecutionId$1(SQLExecution.scala:90)
	at org.apache.spark.sql.SparkSession.withActive(SparkSession.scala:775)
	at org.apache.spark.sql.execution.SQLExecution$.withNewExecutionId(SQLExecution.scala:64)
	at org.apache.spark.sql.Dataset.withAction(Dataset.scala:3731)
	at org.apache.spark.sql.Dataset.head(Dataset.scala:2755)
	at org.apache.spark.sql.Dataset.take(Dataset.scala:2962)
	at org.apache.spark.sql.Dataset.getRows(Dataset.scala:288)
	at org.apache.spark.sql.Dataset.showString(Dataset.scala:327)
	at org.apache.spark.sql.Dataset.show(Dataset.scala:807)
```

### Does this PR introduce _any_ user-facing change?

No, only bug fix.

### How was this patch tested?

Unit test.

Closes #33956 from cfmcgrady/SPARK-36715.

Authored-by: Fu Chen <cfmcgrady@gmail.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-14 09:26:11 +09:00
Leona Yoda a440025f08 [SPARK-36739][DOCS][PYTHON] Add apache license headers to makefiles
### What changes were proposed in this pull request?

Add apache license headers to makefiles of PySpark documents.

### Why are the changes needed?

Makefiles of PySpark documentations do not have apache license headers, while the other files have.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

`make html`

Closes #33979 from yoda-mon/add-license-header-makefiles.

Authored-by: Leona Yoda <yodal@oss.nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-14 09:16:05 +09:00
dgd-contributor f8657d1924 [SPARK-36653][PYTHON] Implement Series.__xor__ and Series.__rxor__
### What changes were proposed in this pull request?
Implement Series.\_\_xor__ and Series.\_\_rxor__

### Why are the changes needed?
Follow pandas

### Does this PR introduce _any_ user-facing change?
Yes, user can use
``` python
psdf = ps.DataFrame([[11, 11], [1, 2]])
psdf[0] ^ psdf[1]
```

### How was this patch tested?
unit tests

Closes #33911 from dgd-contributor/SPARK-36653_Implement_Series._xor_.

Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Takuya UESHIN <ueshin@databricks.com>
2021-09-13 15:09:22 -07:00
Minchu Yang 999473b1a5 [SPARK-36705][SHUFFLE] Disable push based shuffle when IO encryption is enabled or serializer is not relocatable
### What changes were proposed in this pull request?

Disable push-based shuffle when IO encryption is enabled or serializer does not support relocation of serialized objects.

### Why are the changes needed?

Push based shuffle is not compatible with IO encryption or non-relocatable serialization.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added some tests to check whether push-based shuffle can be disabled successfully when IO encryption is enabled or a serializer that does not support relocation of serialized object is used.

Closes #33976 from rmcyang/SPARK-36705.

Authored-by: Minchu Yang <minyang@minyang-mn3.linkedin.biz>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
2021-09-13 16:14:35 -05:00
Lukas Rytz 1a62e6a2c1 [SPARK-36712][BUILD] Make scala-parallel-collections in 2.13 POM a direct dependency (not in maven profile)
As [reported on `devspark.apache.org`](https://lists.apache.org/thread.html/r84cff66217de438f1389899e6d6891b573780159cd45463acf3657aa%40%3Cdev.spark.apache.org%3E), the published POMs when building with Scala 2.13 have the `scala-parallel-collections` dependency only in the `scala-2.13` profile of the pom.

### What changes were proposed in this pull request?

This PR suggests to work around this by un-commenting the `scala-parallel-collections` dependency when switching to 2.13 using the the `change-scala-version.sh` script.

I included an upgrade to scala-parallel-collections version 1.0.3, the changes compared to 0.2.0 are minor.
  - removed OSGi metadata
  - renamed some internal inner classes
  - added `Automatic-Module-Name`

### Why are the changes needed?

According to the posts, this solves issues for developers that write unit tests for their applications.

Stephen Coy suggested to use the https://www.mojohaus.org/flatten-maven-plugin. While this sounds like a more principled solution, it is possibly too risky to do at this specific point in time?

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Locally

Closes #33948 from lrytz/parCollDep.

Authored-by: Lukas Rytz <lukas.rytz@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-09-13 11:06:50 -05:00
Max Gekk bd62ad9982 [SPARK-36736][SQL] Support ILIKE (ALL | ANY | SOME) - case insensitive LIKE
### What changes were proposed in this pull request?
In the PR, I propose to support a case-insensitive variant of the `LIKE (ALL | ANY | SOME)` expression - `ILIKE`. In this way, Spark's users can match strings to single pattern in the case-insensitive manner. For example:
```sql
spark-sql> create table ilike_example(subject varchar(20));
spark-sql> insert into ilike_example values
         > ('jane doe'),
         > ('Jane Doe'),
         > ('JANE DOE'),
         > ('John Doe'),
         > ('John Smith');
spark-sql> select *
         > from ilike_example
         > where subject ilike any ('jane%', '%SMITH')
         > order by subject;
JANE DOE
Jane Doe
John Smith
jane doe
```

The syntax of `ILIKE` is similar to `LIKE`:
```
str NOT? ILIKE (ANY | SOME | ALL) (pattern+)
```

### Why are the changes needed?
1. To improve user experience with Spark SQL. No need to use `lower(col_name)` in where clauses.
2. To make migration from other popular DMBSs to Spark SQL easier. DBMSs below support `ilike` in SQL:
    - [Snowflake](https://docs.snowflake.com/en/sql-reference/functions/ilike.html#ilike)
    - [PostgreSQL](https://www.postgresql.org/docs/12/functions-matching.html)
    - [CockroachDB](https://www.cockroachlabs.com/docs/stable/functions-and-operators.html)

### Does this PR introduce _any_ user-facing change?
No, it doesn't. The PR **extends** existing APIs.

### How was this patch tested?
1. By running of expression examples via:
```
$ build/sbt "sql/test:testOnly org.apache.spark.sql.expressions.ExpressionInfoSuite"
```
2. Added new test to test parsing of `ILIKE`:
```
$ build/sbt "test:testOnly *.ExpressionParserSuite"
```
3. Via existing test suites:
```
$ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z ilike-any.sql"
$ build/sbt "sql/testOnly org.apache.spark.sql.SQLQueryTestSuite -- -z ilike-all.sql"
```

Closes #33966 from MaxGekk/ilike-any.

Authored-by: Max Gekk <max.gekk@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-13 22:51:49 +08:00
Kousuke Saruta e858cd568a [SPARK-36724][SQL] Support timestamp_ntz as a type of time column for SessionWindow
### What changes were proposed in this pull request?

This PR proposes to support `timestamp_ntz` as a type of time column for `SessionWIndow` like `TimeWindow` does.

### Why are the changes needed?

For better usability.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

New test.

Closes #33965 from sarutak/session-window-ntz.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Gengliang Wang <gengliang@apache.org>
2021-09-13 21:47:43 +08:00
Yuto Akutsu 3747cfdb40 [SPARK-36738][SQL][DOC] Fixed the wrong documentation on Cot API
### What changes were proposed in this pull request?

Fixed wrong documentation on Cot API

### Why are the changes needed?

[Doc](https://spark.apache.org/docs/latest/api/sql/index.html#cot) says `1/java.lang.Math.cot` but it should be `1/java.lang.Math.tan`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual check.

Closes #33978 from yutoacts/SPARK-36738.

Authored-by: Yuto Akutsu <yuto.akutsu@nttdata.com>
Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
2021-09-13 21:51:29 +09:00
ulysses-you 4a6b2b9fc8 [SPARK-33832][SQL] Support optimize skewed join even if introduce extra shuffle
### What changes were proposed in this pull request?

- move the rule `OptimizeSkewedJoin` from stage optimization phase to stage preparation phase.
- run the rule `EnsureRequirements` one more time after the `OptimizeSkewedJoin` rule in the stage preparation phase.
- add `SkewJoinAwareCost` to support estimate skewed join cost
- add new config to decide if force optimize skewed join
- in `OptimizeSkewedJoin`, we generate 2 physical plans, one with skew join optimization and one without. Then we use the cost evaluator w.r.t. the force-skew-join flag and pick the plan with lower cost.

### Why are the changes needed?

In general, skewed join has more impact on performance  than once more shuffle. It makes sense to force optimize skewed join even if introduce extra shuffle.

A common case:
```
HashAggregate
  SortMergJoin
    Sort
      Exchange
    Sort
      Exchange
```
and after this PR, the plan looks like:
```
HashAggregate
  Exchange
    SortMergJoin (isSkew=true)
      Sort
        Exchange
      Sort
        Exchange
```

Note that, the new introduced shuffle also can be optimized by AQE.

### Does this PR introduce _any_ user-facing change?

Yes, a new config.

### How was this patch tested?

* Add new test
* pass exists test `SPARK-30524: Do not optimize skew join if introduce additional shuffle`
* pass exists test `SPARK-33551: Do not use custom shuffle reader for repartition`

Closes #32816 from ulysses-you/support-extra-shuffle.

Authored-by: ulysses-you <ulyssesyou18@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2021-09-13 17:21:27 +08:00
Kousuke Saruta e1e19619b7 [SPARK-36729][BUILD] Upgrade Netty from 4.1.63 to 4.1.68
### What changes were proposed in this pull request?

This PR upgrades Netty from `4.1.63` to `4.1.68`.

All the changes from `4.1.64` to `4.1.68` are as follows.

* 4.1.64 and 4.1.65
  * https://netty.io/news/2021/05/19/4-1-65-Final.html
* 4.1.66
  * https://netty.io/news/2021/07/16/4-1-66-Final.html
* 4.1.67
  * https://netty.io/news/2021/08/16/4-1-67-Final.html
* 4.1.68
  * https://netty.io/news/2021/09/09/4-1-68-Final.html

### Why are the changes needed?

Recently Netty `4.1.68` was released, which includes official M1 Mac support.
* Add support for mac m1
  * https://github.com/netty/netty/pull/11666

`4.1.65` also includes a critical bug fix which Spark might be affected.
* JNI classloader deadlock with latest JDK version
  * https://github.com/netty/netty/issues/11209

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

CIs.

Closes #33970 from sarutak/upgrade-netty-4.1.68.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-09-12 10:07:27 -07:00
yangjie01 0e1157df06 [SPARK-36636][CORE][TEST] LocalSparkCluster change to use tmp workdir in test to avoid directory name collision
### What changes were proposed in this pull request?
As described in SPARK-36636,if the test cases with config `local-cluster[n, c, m]`  are run continuously within 1 second, the workdir name collision will occur because appid use format as `app-yyyyMMddHHmmss-0000` and workdir name associated with it  in test now,  the related logs are as follows:

```
java.io.IOException: Failed to create directory /spark-mine/work/app-20210908074432-0000/1
	at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:578)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
21/09/08 22:44:32.266 dispatcher-event-loop-0 INFO Worker: Asked to launch executor app-20210908074432-0000/0 for test
21/09/08 22:44:32.266 dispatcher-event-loop-0 ERROR Worker: Failed to launch executor app-20210908074432-0000/0 for test.
java.io.IOException: Failed to create directory /spark-mine/work/app-20210908074432-0000/0
	at org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:578)
	at org.apache.spark.rpc.netty.Inbox.$anonfun$process$1(Inbox.scala:115)
	at org.apache.spark.rpc.netty.Inbox.safelyCall(Inbox.scala:213)
	at org.apache.spark.rpc.netty.Inbox.process(Inbox.scala:100)
	at org.apache.spark.rpc.netty.MessageLoop.org$apache$spark$rpc$netty$MessageLoop$$receiveLoop(MessageLoop.scala:75)
	at org.apache.spark.rpc.netty.MessageLoop$$anon$1.run(MessageLoop.scala:41)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:748)
```

Since the default value of s`park.deploy.maxExecutorRetries` is 10, the test failed will occur when 5 consecutive cases with local-cluster[3, 1, 1024] are completed within 1 second:

1. case 1: use worker directories: `/app-202109102324-0000/0`, `/app-202109102324-0000/1`, `/app-202109102324-0000/2`
2. case 2: retry 3 times then use worker directories: `/app-202109102324-0000/3`, `/app-202109102324-0000/4`, `/app-202109102324-0000/5`
3. case 3: retry 6 times then use worker directories: `/app-202109102324-0000/6`, `/app-202109102324-0000/7`, `/app-202109102324-0000/8`
4. case 4: retry 9 times then use worker directories: `/app-202109102324-0000/9`, `/app-202109102324-0000/10`, `/app-202109102324-0000/11`
5. case 5: retry more than **10** times then **failed**

To avoid this issue, this pr change to use tmp workdir in test with  config `local-cluster[n, c, m]`.

### Why are the changes needed?
Avoid UT failures caused by continuous workdir name collision.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- Pass GA or Jenkins Tests.
- Manual test: `build/mvn clean install -Pscala-2.13 -pl core -am` or `build/mvn clean install -pl core -am`, with Scala 2.13 is easier to reproduce this problem

**Before**

The test failed error logs as follows and randomness in test failure:
```
- SPARK-33084: Add jar support Ivy URI -- test exclude param when transitive=true *** FAILED ***
  org.apache.spark.SparkException: Only one SparkContext should be running in this JVM (see SPARK-2243).The currently running SparkContext was created at:
org.apache.spark.SparkContextSuite.$anonfun$new$134(SparkContextSuite.scala:1101)
org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
org.scalatest.Transformer.apply(Transformer.scala:22)
org.scalatest.Transformer.apply(Transformer.scala:20)
org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190)
org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62)
org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62)
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
scala.collection.immutable.List.foreach(List.scala:333)
  at org.apache.spark.SparkContext$.$anonfun$assertNoOtherContextIsRunning$2(SparkContext.scala:2647)
  at scala.Option.foreach(Option.scala:437)
  at org.apache.spark.SparkContext$.assertNoOtherContextIsRunning(SparkContext.scala:2644)
  at org.apache.spark.SparkContext$.markPartiallyConstructed(SparkContext.scala:2734)
  at org.apache.spark.SparkContext.<init>(SparkContext.scala:95)
  at org.apache.spark.SparkContextSuite.$anonfun$new$138(SparkContextSuite.scala:1109)
  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  ...
- SPARK-33084: Add jar support Ivy URI -- test different version *** FAILED ***
  org.apache.spark.SparkException: Only one SparkContext should be running in this JVM (see SPARK-2243).The currently running SparkContext was created at:
org.apache.spark.SparkContextSuite.$anonfun$new$134(SparkContextSuite.scala:1101)
org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
org.scalatest.Transformer.apply(Transformer.scala:22)
org.scalatest.Transformer.apply(Transformer.scala:20)
org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190)
org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62)
org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62)
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
scala.collection.immutable.List.foreach(List.scala:333)
  at org.apache.spark.SparkContext$.$anonfun$assertNoOtherContextIsRunning$2(SparkContext.scala:2647)
  at scala.Option.foreach(Option.scala:437)
  at org.apache.spark.SparkContext$.assertNoOtherContextIsRunning(SparkContext.scala:2644)
  at org.apache.spark.SparkContext$.markPartiallyConstructed(SparkContext.scala:2734)
  at org.apache.spark.SparkContext.<init>(SparkContext.scala:95)
  at org.apache.spark.SparkContextSuite.$anonfun$new$142(SparkContextSuite.scala:1118)
  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  ...
- SPARK-33084: Add jar support Ivy URI -- test invalid param *** FAILED ***
  org.apache.spark.SparkException: Only one SparkContext should be running in this JVM (see SPARK-2243).The currently running SparkContext was created at:
org.apache.spark.SparkContextSuite.$anonfun$new$134(SparkContextSuite.scala:1101)
org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
org.scalatest.Transformer.apply(Transformer.scala:22)
org.scalatest.Transformer.apply(Transformer.scala:20)
org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190)
org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62)
org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62)
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
scala.collection.immutable.List.foreach(List.scala:333)
  at org.apache.spark.SparkContext$.$anonfun$assertNoOtherContextIsRunning$2(SparkContext.scala:2647)
  at scala.Option.foreach(Option.scala:437)
  at org.apache.spark.SparkContext$.assertNoOtherContextIsRunning(SparkContext.scala:2644)
  at org.apache.spark.SparkContext$.markPartiallyConstructed(SparkContext.scala:2734)
  at org.apache.spark.SparkContext.<init>(SparkContext.scala:95)
  at org.apache.spark.SparkContextSuite.$anonfun$new$146(SparkContextSuite.scala:1129)
  at org.apache.spark.SparkFunSuite.withLogAppender(SparkFunSuite.scala:235)
  at org.apache.spark.SparkContextSuite.$anonfun$new$145(SparkContextSuite.scala:1127)
  at scala.runtime.java8.JFunction0$mcV$sp.apply(JFunction0$mcV$sp.scala:18)
  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
  ...
- SPARK-33084: Add jar support Ivy URI -- test multiple transitive params *** FAILED ***
  org.apache.spark.SparkException: Only one SparkContext should be running in this JVM (see SPARK-2243).The currently running SparkContext was created at:
org.apache.spark.SparkContextSuite.$anonfun$new$134(SparkContextSuite.scala:1101)
org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
org.scalatest.Transformer.apply(Transformer.scala:22)
org.scalatest.Transformer.apply(Transformer.scala:20)
org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190)
org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62)
org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62)
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
scala.collection.immutable.List.foreach(List.scala:333)
  at org.apache.spark.SparkContext$.$anonfun$assertNoOtherContextIsRunning$2(SparkContext.scala:2647)
  at scala.Option.foreach(Option.scala:437)
  at org.apache.spark.SparkContext$.assertNoOtherContextIsRunning(SparkContext.scala:2644)
  at org.apache.spark.SparkContext$.markPartiallyConstructed(SparkContext.scala:2734)
  at org.apache.spark.SparkContext.<init>(SparkContext.scala:95)
  at org.apache.spark.SparkContextSuite.$anonfun$new$149(SparkContextSuite.scala:1140)
  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  ...
- SPARK-33084: Add jar support Ivy URI -- test param key case sensitive *** FAILED ***
  org.apache.spark.SparkException: Only one SparkContext should be running in this JVM (see SPARK-2243).The currently running SparkContext was created at:
org.apache.spark.SparkContextSuite.$anonfun$new$134(SparkContextSuite.scala:1101)
org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
org.scalatest.Transformer.apply(Transformer.scala:22)
org.scalatest.Transformer.apply(Transformer.scala:20)
org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190)
org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62)
org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62)
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
scala.collection.immutable.List.foreach(List.scala:333)
  at org.apache.spark.SparkContext$.$anonfun$assertNoOtherContextIsRunning$2(SparkContext.scala:2647)
  at scala.Option.foreach(Option.scala:437)
  at org.apache.spark.SparkContext$.assertNoOtherContextIsRunning(SparkContext.scala:2644)
  at org.apache.spark.SparkContext$.markPartiallyConstructed(SparkContext.scala:2734)
  at org.apache.spark.SparkContext.<init>(SparkContext.scala:95)
  at org.apache.spark.SparkContextSuite.$anonfun$new$154(SparkContextSuite.scala:1155)
  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)
  ...
- SPARK-33084: Add jar support Ivy URI -- test transitive value case insensitive *** FAILED ***
  org.apache.spark.SparkException: Only one SparkContext should be running in this JVM (see SPARK-2243).The currently running SparkContext was created at:
org.apache.spark.SparkContextSuite.$anonfun$new$134(SparkContextSuite.scala:1101)
org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
org.scalatest.Transformer.apply(Transformer.scala:22)
org.scalatest.Transformer.apply(Transformer.scala:20)
org.scalatest.funsuite.AnyFunSuiteLike$$anon$1.apply(AnyFunSuiteLike.scala:226)
org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:190)
org.scalatest.funsuite.AnyFunSuiteLike.invokeWithFixture$1(AnyFunSuiteLike.scala:224)
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTest$1(AnyFunSuiteLike.scala:236)
org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
org.scalatest.funsuite.AnyFunSuiteLike.runTest(AnyFunSuiteLike.scala:236)
org.scalatest.funsuite.AnyFunSuiteLike.runTest$(AnyFunSuiteLike.scala:218)
org.apache.spark.SparkFunSuite.org$scalatest$BeforeAndAfterEach$$super$runTest(SparkFunSuite.scala:62)
org.scalatest.BeforeAndAfterEach.runTest(BeforeAndAfterEach.scala:234)
org.scalatest.BeforeAndAfterEach.runTest$(BeforeAndAfterEach.scala:227)
org.apache.spark.SparkFunSuite.runTest(SparkFunSuite.scala:62)
org.scalatest.funsuite.AnyFunSuiteLike.$anonfun$runTests$1(AnyFunSuiteLike.scala:269)
org.scalatest.SuperEngine.$anonfun$runTestsInBranch$1(Engine.scala:413)
scala.collection.immutable.List.foreach(List.scala:333)
  at org.apache.spark.SparkContext$.$anonfun$assertNoOtherContextIsRunning$2(SparkContext.scala:2647)
  at scala.Option.foreach(Option.scala:437)
  at org.apache.spark.SparkContext$.assertNoOtherContextIsRunning(SparkContext.scala:2644)
  at org.apache.spark.SparkContext$.markPartiallyConstructed(SparkContext.scala:2734)
  at org.apache.spark.SparkContext.<init>(SparkContext.scala:95)
  at org.apache.spark.SparkContextSuite.$anonfun$new$159(SparkContextSuite.scala:1166)
  at org.scalatest.OutcomeOf.outcomeOf(OutcomeOf.scala:85)
  at org.scalatest.OutcomeOf.outcomeOf$(OutcomeOf.scala:83)
  at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
  at org.scalatest.Transformer.apply(Transformer.scala:22)

```

**After**

```
Run completed in 26 minutes, 38 seconds.
Total number of tests run: 2863
Suites: completed 276, aborted 0
Tests: succeeded 2863, failed 0, canceled 4, ignored 8, pending 0
All tests passed.
```

Closes #33963 from LuciferYang/SPARK-36636.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-09-12 09:57:06 -05:00
attilapiros ba81b92402 [SPARK-36719][CORE] Supporting Netty Logging at the network layer
### What changes were proposed in this pull request?

Supporting Netty level logging at the network layer.

To configure Netty level logging a `LogHandler` must be added to the channel pipeline.
In this PR I have introduced a new class `NettyLogger` which is able to construct a log handler depending on the log level:
-  in case of `log4j.logger.org.apache.spark.network.util.NettyLogger=DEBUG`: a custom log handler is  created which does not dump the message contents. This way the log is a bit more compact. Moreover when network level encryption is switched on this level might be sufficient.
- in case of `log4j.logger.org.apache.spark.network.util.NettyLogger=TRACE`: Netty's own log handler is used which dumps the message contents.
- otherwise (when the logger is not TRACE or DEBUG) the pipeline does not contain a log handler (there is no runtime penalty for the default setting but a long running app/service must be restarted along with the new log level to have an effect).

### Why are the changes needed?

This level of logging proved to be sufficient during debugging some external shuffle related problem.
Compared with the tcpdump this log lines can be more easily correlated with the Spark internal calls.
Moreover the log layout can be configured to contain the thread names that way for a timeout a busy thread could be identified.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manually.

#### DEBUG level

```
╭─attilazsoltpirosapiros-MBP16 ~/git/attilapiros/spark ‹SPARK-36719*›
╰─$ tail -1 ./conf/log4j.properties
log4j.logger.org.apache.spark.network.util.NettyLogger=DEBUG
╭─attilazsoltpirosapiros-MBP16 ~/git/attilapiros/spark ‹SPARK-36719*›
╰─$ ./bin/spark-submit --class org.apache.spark.examples.JavaWordCount --master local\[8\]  ./examples/target/original-spark-examples_2.12-3.3.0-SNAPSHOT.jar README.md 2> >(grep NettyLogger) 1> /dev/null
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf] REGISTERED
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf] CONNECT: /172.30.64.219:61014
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] ACTIVE
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0x28101520, L:/172.30.64.219:61014 - R:/172.30.64.219:61015] REGISTERED
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0x28101520, L:/172.30.64.219:61014 - R:/172.30.64.219:61015] ACTIVE
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] WRITE 66B
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] FLUSH
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0x28101520, L:/172.30.64.219:61014 - R:/172.30.64.219:61015] READ 66B
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0x28101520, L:/172.30.64.219:61014 - R:/172.30.64.219:61015] WRITE: MessageWithHeader [headerLength: 74, bodyLength: 1552705]
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0x28101520, L:/172.30.64.219:61014 - R:/172.30.64.219:61015] FLUSH
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 74B
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0x28101520, L:/172.30.64.219:61014 - R:/172.30.64.219:61015] READ COMPLETE
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ COMPLETE
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 2048B
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 32768B
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ COMPLETE
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 65536B
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ 10561B
21/09/10 15:24:35 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ COMPLETE
21/09/10 15:24:40 DEBUG NettyLogger: [id: 0x28101520, L:/172.30.64.219:61014 ! R:/172.30.64.219:61015] INACTIVE
21/09/10 15:24:40 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 - R:/172.30.64.219:61014] READ COMPLETE
21/09/10 15:24:40 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 ! R:/172.30.64.219:61014] INACTIVE
21/09/10 15:24:40 DEBUG NettyLogger: [id: 0xb9d94fcf, L:/172.30.64.219:61015 ! R:/172.30.64.219:61014] UNREGISTERED
21/09/10 15:24:40 DEBUG NettyLogger: [id: 0x28101520, L:/172.30.64.219:61014 ! R:/172.30.64.219:61015] UNREGISTERED
```

#### TRACE level

```
╭─attilazsoltpirosapiros-MBP16 ~/git/attilapiros/spark ‹SPARK-36719*›
╰─$ tail -1 ./conf/log4j.properties
log4j.logger.org.apache.spark.network.util.NettyLogger=TRACE
╭─attilazsoltpirosapiros-MBP16 ~/git/attilapiros/spark ‹SPARK-36719*›
╰─$ ./bin/spark-submit --class org.apache.spark.examples.JavaWordCount --master local\[8\]  ./examples/target/original-spark-examples_2.12-3.3.0-SNAPSHOT.jar README.md  1> /dev/null 2>&1
...
21/09/10 15:29:14 TRACE NettyLogger: [id: 0xf1d25786] REGISTERED
21/09/10 15:29:14 TRACE NettyLogger: [id: 0xf1d25786] CONNECT: /172.30.64.219:61044
21/09/10 15:29:14 TRACE NettyLogger: [id: 0xf1d25786, L:/172.30.64.219:61045 - R:/172.30.64.219:61044] ACTIVE
21/09/10 15:29:14 INFO TransportClientFactory: Successfully created connection to /172.30.64.219:61044 after 37 ms (0 ms spent in bootstraps)
21/09/10 15:29:14 TRACE NettyLogger: [id: 0x362fc693, L:/172.30.64.219:61044 - R:/172.30.64.219:61045] REGISTERED
21/09/10 15:29:14 TRACE NettyLogger: [id: 0x362fc693, L:/172.30.64.219:61044 - R:/172.30.64.219:61045] ACTIVE
21/09/10 15:29:14 INFO Utils: Fetching spark://172.30.64.219:61044/jars/original-spark-examples_2.12-3.3.0-SNAPSHOT.jar to /private/var/folders/t_/fr_vqcyx23vftk81ftz1k5hw0000gn/T/spark-91e059f5-1e29-4727-8602-f81206bbe48b/userFiles-50b48490-8950-4c46-b3d3-61a2c85412a3/fetchFileTemp8803030587223485061.tmp
21/09/10 15:29:14 TRACE NettyLogger: [id: 0xf1d25786, L:/172.30.64.219:61045 - R:/172.30.64.219:61044] WRITE: 66B
         +-------------------------------------------------+
         |  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f |
+--------+-------------------------------------------------+----------------+
|00000000| 00 00 00 00 00 00 00 42 06 00 00 00 35 2f 6a 61 |.......B....5/ja|
|00000010| 72 73 2f 6f 72 69 67 69 6e 61 6c 2d 73 70 61 72 |rs/original-spar|
|00000020| 6b 2d 65 78 61 6d 70 6c 65 73 5f 32 2e 31 32 2d |k-examples_2.12-|
|00000030| 33 2e 33 2e 30 2d 53 4e 41 50 53 48 4f 54 2e 6a |3.3.0-SNAPSHOT.j|
|00000040| 61 72                                           |ar              |
+--------+-------------------------------------------------+----------------+
21/09/10 15:29:14 TRACE NettyLogger: [id: 0xf1d25786, L:/172.30.64.219:61045 - R:/172.30.64.219:61044] FLUSH
21/09/10 15:29:14 TRACE NettyLogger: [id: 0x362fc693, L:/172.30.64.219:61044 - R:/172.30.64.219:61045] READ: 66B
         +-------------------------------------------------+
         |  0  1  2  3  4  5  6  7  8  9  a  b  c  d  e  f |
+--------+-------------------------------------------------+----------------+
|00000000| 00 00 00 00 00 00 00 42 06 00 00 00 35 2f 6a 61 |.......B....5/ja|
|00000010| 72 73 2f 6f 72 69 67 69 6e 61 6c 2d 73 70 61 72 |rs/original-spar|
|00000020| 6b 2d 65 78 61 6d 70 6c 65 73 5f 32 2e 31 32 2d |k-examples_2.12-|
|00000030| 33 2e 33 2e 30 2d 53 4e 41 50 53 48 4f 54 2e 6a |3.3.0-SNAPSHOT.j|
|00000040| 61 72                                           |ar              |
+--------+-------------------------------------------------+----------------+
21/09/10 15:29:14 TRACE NettyLogger: [id: 0x362fc693, L:/172.30.64.219:61044 - R:/172.30.64.219:61045] WRITE: MessageWithHeader [headerLength: 74, bodyLength: 1552705]
21/09/10 15:29:14 TRACE NettyLogger: [id: 0x362fc693, L:/172.30.64.219:61044 - R:/172.30.64.219:61045] FLUSH
21/09/10 15:29:14 TRACE NettyLogger: [id: 0xf1d25786, L:/172.30.64.219:61045 - R:/172.30.64.219:61044] READ: 74B
...
```

Closes #33962 from attilapiros/SPARK-36719.

Authored-by: attilapiros <piros.attila.zsolt@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-09-11 16:14:02 -07:00
dgd_contributor ebca01f03e [SPARK-35822][UI] Spark UI-Executor tab is empty in IE11
### What changes were proposed in this pull request?
Refactor some functions in utils.js to fix the empty UI-Executor tab in yarn mode in IE11.

### Why are the changes needed?
Spark UI-Executor tab is empty in IE11: So this PR to fix this.
![Executortab_IE](https://user-images.githubusercontent.com/84778052/132786964-b17b6d12-457f-4ba3-894f-3f2e1c285b1e.PNG)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Existed UT Testcase

Closes #33937 from dgd-contributor/SPARK-35822-v2.

Authored-by: dgd_contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-09-11 15:58:31 -07:00
Kousuke Saruta c36d70836d [SPARK-36725][SQL][TESTS] Ensure HiveThriftServer2Suites to stop Thrift JDBC server on exit
### What changes were proposed in this pull request?

This PR aims to ensure that HiveThriftServer2Suites (e.g. `thriftserver.UISeleniumSuite`) stop Thrift JDBC server on exit using shutdown hook.

### Why are the changes needed?

Normally, HiveThriftServer2Suites stops Thrift JDBC server via `afterAll` method.
But, if they are killed by signal (e.g. Ctrl-C), Thrift JDBC server will be remain.
```
$ jps
2792969 SparkSubmit
```
### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Killed `thriftserver.UISeleniumSuite` by Ctrl-C and confirmed no Thrift JDBC server is remain by jps.

Closes #33967 from sarutak/stop-thrift-on-exit.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-09-11 15:54:35 -07:00
dgd-contributor 9af0132516 [SPARK-36685][ML][MLLIB] Fix wrong assert messages
### What changes were proposed in this pull request?
Fix wrong assert statement, a mistake when coding

### Why are the changes needed?
 wrong assert statement

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
 Existing tests

Closes #33953 from dgd-contributor/SPARK-36685.

Authored-by: dgd-contributor <dgd_contributor@viettel.com.vn>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2021-09-11 14:39:42 -07:00
Sean Owen e5283f5ed5 [SPARK-36704][CORE] Expand exception handling to more Java 9 cases where reflection is limited at runtime, when reflecting to manage DirectByteBuffer settings
### What changes were proposed in this pull request?

Improve exception handling in the Platform initialization, where it attempts to assess whether reflection is possible to modify DirectByteBuffer. This can apparently fail in more cases on Java 9+ than are currently handled, whereas Spark can continue without reflection if needed.

More detailed comments on the change inline.

### Why are the changes needed?

This exception seems to be possible and fails startup:

```
Caused by: java.lang.reflect.InaccessibleObjectException: Unable to make private java.nio.DirectByteBuffer(long,int) accessible: module java.base does not "opens java.nio" to unnamed module 71e9ddb4
        at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:357)
        at java.base/java.lang.reflect.AccessibleObject.checkCanSetAccessible(AccessibleObject.java:297)
        at java.base/java.lang.reflect.Constructor.checkCanSetAccessible(Constructor.java:188)
        at java.base/java.lang.reflect.Constructor.setAccessible(Constructor.java:181)
        at org.apache.spark.unsafe.Platform.<clinit>(Platform.java:56)
```

### Does this PR introduce _any_ user-facing change?

Should strictly allow Spark to continue in more cases.

### How was this patch tested?

Existing tests.

Closes #33947 from srowen/SPARK-36704.

Authored-by: Sean Owen <srowen@gmail.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2021-09-11 13:38:10 -05:00
Huaxin Gao 1f679ed8e9 [SPARK-36556][SQL] Add DSV2 filters
Co-Authored-By: DB Tsai d_tsaiapple.com
Co-Authored-By: Huaxin Gao huaxin_gaoapple.com

### What changes were proposed in this pull request?
Add DSV2 Filters and use these in V2 codepath.

### Why are the changes needed?
The motivation of adding DSV2 filters:
1. The values in V1 filters are Scala types. When translating catalyst `Expression` to V1 filers, we have to call `convertToScala` to convert from Catalyst types used internally in rows to standard Scala types, and later convert Scala types back to Catalyst types. This is very inefficient. In V2 filters, we use `Expression`  for filter values, so the conversion from  Catalyst types to Scala types and Scala types back to Catalyst types are avoided.
2. Improve nested column filter support.
3. Make the filters work better with the rest of the DSV2 APIs.

### Does this PR introduce _any_ user-facing change?
Yes. The new V2 filters

### How was this patch tested?
new test

Closes #33803 from huaxingao/filter.

Lead-authored-by: Huaxin Gao <huaxin_gao@apple.com>
Co-authored-by: DB Tsai <d_tsai@apple.com>
Signed-off-by: Liang-Chi Hsieh <viirya@gmail.com>
2021-09-11 10:12:21 -07:00