Commit graph

27883 commits

Author SHA1 Message Date
Jatin Puri 1fd54f4bf5 [SPARK-32662][ML] CountVectorizerModel: Remove requirement for minimum Vocab size
### What changes were proposed in this pull request?

The strict requirement for the vocabulary to remain non-empty has been removed in this pull request.

Link to the discussion: http://apache-spark-user-list.1001560.n3.nabble.com/Ability-to-have-CountVectorizerModel-vocab-as-empty-td38396.html

### Why are the changes needed?

This soothens running it across the corner cases. Without this, the user has to manupulate the data in genuine case, which may be a perfectly fine valid use-case.

Question: Should we a log when empty vocabulary is found instead?

### Does this PR introduce _any_ user-facing change?

May be a slight change. If someone has put a try-catch to detect an empty vocab. Then that behavior would no longer stand still.

### How was this patch tested?

1. Added testcase to `fit` generating an empty vocabulary
2. Added testcase to `transform` with empty vocabulary

Request to review: srowen hhbyyh

Closes #29482 from purijatin/spark_32662.

Authored-by: Jatin Puri <purijatin@gmail.com>
Signed-off-by: Huaxin Gao <huaxing@us.ibm.com>
2020-08-21 16:14:29 -07:00
Brandon Jiang 1450b5e095 [MINOR][DOCS] fix typo for docs,log message and comments
### What changes were proposed in this pull request?
Fix typo for docs, log messages and comments

### Why are the changes needed?
typo fix to increase readability

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
manual test has been performed to test the updated

Closes #29443 from brandonJY/spell-fix-doc.

Authored-by: Brandon Jiang <Brandon.jiang.a@outlook.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-08-22 06:45:35 +09:00
Wenchen Fan 3dca81e4f5 [SPARK-32669][SQL][TEST] Expression unit tests should explore all cases that can lead to null result
### What changes were proposed in this pull request?

Add document to `ExpressionEvalHelper`, and ask people to explore all the cases that can lead to null results (including null in struct fields, array elements and map values).

This PR also fixes `ComplexTypeSuite.GetArrayStructFields` to explore all the null cases.

### Why are the changes needed?

It happened several times that we hit correctness bugs caused by wrong expression nullability. When writing unit tests, we usually don't test the nullability flag directly, and it's too late to add such tests for all expressions.

In https://github.com/apache/spark/pull/22375, we extended the expression test framework, which checks the nullability flag when the expected result/field/element is null.

This requires the test cases to explore all the cases that can lead to null results

### Does this PR introduce _any_ user-facing change?

no

### How was this patch tested?

I reverted 5d296ed39e locally, and `ComplexTypeSuite` can catch the bug.

Closes #29493 from cloud-fan/small.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-08-22 06:23:46 +09:00
Takeshi Yamamuro 6dd37cbaac [SPARK-32682][INFRA] Use workflow_dispatch to enable manual test triggers
### What changes were proposed in this pull request?

This PR proposes to add a `workflow_dispatch` entry in the GitHub Action script (`build_and_test.yml`). This update can enable developers to run the Spark tests for a specific branch on their own local repository, so I think it might help to check if al the tests can pass before opening a new PR.

<img width="944" alt="Screen Shot 2020-08-21 at 16 28 41" src="https://user-images.githubusercontent.com/692303/90866249-96250c80-e3ce-11ea-8496-3dd6683e92ea.png">

### Why are the changes needed?

To reduce the pressure of GitHub Actions on the Spark repository.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manually checked.

Closes #29504 from maropu/DispatchTest.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-08-21 21:23:41 +09:00
Liang-Chi Hsieh e277ef1a83 [SPARK-32646][SQL] ORC predicate pushdown should work with case-insensitive analysis
### What changes were proposed in this pull request?

This PR proposes to fix ORC predicate pushdown under case-insensitive analysis case. The field names in pushed down predicates don't need to match in exact letter case with physical field names in ORC files, if we enable case-insensitive analysis.

### Why are the changes needed?

Currently ORC predicate pushdown doesn't work with case-insensitive analysis. A predicate "a < 0" cannot pushdown to ORC file with field name "A" under case-insensitive analysis.

But Parquet predicate pushdown works with this case. We should make ORC predicate pushdown work with case-insensitive analysis too.

### Does this PR introduce _any_ user-facing change?

Yes, after this PR, under case-insensitive analysis, ORC predicate pushdown will work.

### How was this patch tested?

Unit tests.

Closes #29457 from viirya/fix-orc-pushdown.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-21 07:57:24 +00:00
Chao Sun bf221debd0 [SPARK-32674][DOC] Add suggestion for parallel directory listing in tuning doc
### What changes were proposed in this pull request?

This adds some tuning guide for increasing parallelism of directory listing.

### Why are the changes needed?

Sometimes when job input has large number of directories, the listing can become a bottleneck. There are a few parameters to tune this. This adds some info to Spark tuning guide to make the knowledge better shared.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

N/A

Closes #29498 from sunchao/SPARK-32674.

Authored-by: Chao Sun <sunchao@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-21 16:48:54 +09:00
angerszhu c75a82794f [SPARK-32667][SQL] Script transform 'default-serde' mode should pad null value to filling column
### What changes were proposed in this pull request?
Hive no serde mode when  column less then output specified column, it will pad null value to it, spark should do this also.
```
hive> SELECT TRANSFORM(a, b)
    >   ROW FORMAT DELIMITED
    >   FIELDS TERMINATED BY '|'
    >   LINES TERMINATED BY '\n'
    >   NULL DEFINED AS 'NULL'
    > USING 'cat' as (a string, b string, c string, d string)
    >   ROW FORMAT DELIMITED
    >   FIELDS TERMINATED BY '|'
    >   LINES TERMINATED BY '\n'
    >   NULL DEFINED AS 'NULL'
    > FROM (
    > select 1 as a, 2 as b
    > ) tmp ;
OK
1	2	NULL	NULL
Time taken: 24.626 seconds, Fetched: 1 row(s)
```

### Why are the changes needed?
Keep save behavior with hive data.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #29500 from AngersZhuuuu/SPARK-32667.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-21 07:37:11 +00:00
“attilapiros” 79b4dea1b0 [SPARK-32663][CORE] Avoid individual closing of pooled TransportClients (which must be closed through the pool)
### What changes were proposed in this pull request?

Removing the individual `close` method calls on the pooled `TransportClient` instances.
The pooled clients should be only closed via `TransportClientFactory#close()`.

### Why are the changes needed?

Reusing a closed `TransportClient` leads to the exception `java.nio.channels.ClosedChannelException`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

This is a trivial case which is not tested by specific test.

Closes #29492 from attilapiros/SPARK-32663.

Authored-by: “attilapiros” <piros.attila.zsolt@gmail.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
2020-08-21 01:02:33 -05:00
Gengliang Wang de141a3271 [SPARK-32660][SQL][DOC] Show Avro related API in documentation
### What changes were proposed in this pull request?

Currently, the Avro related APIs are missing in the documentation https://spark.apache.org/docs/latest/api/scala/org/apache/spark/index.html . This PR is to:
1. Mark internal Avro related classes as private
2. Show Avro related API in Spark official API documentation

### Why are the changes needed?

Better documentation.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Build doc and preview:
![image](https://user-images.githubusercontent.com/1097932/90623042-d156ee00-e1ca-11ea-9edd-2c45b3001fd8.png)

![image](https://user-images.githubusercontent.com/1097932/90623047-d451de80-e1ca-11ea-94ba-02921b64d6f1.png)

![image](https://user-images.githubusercontent.com/1097932/90623058-d6b43880-e1ca-11ea-849a-b9ea9efe6527.png)

Closes #29476 from gengliangwang/avroAPIDoc.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-08-21 13:12:43 +08:00
Wenchen Fan 8b119f1663 [SPARK-32640][SQL] Downgrade Janino to fix a correctness bug
### What changes were proposed in this pull request?

This PR reverts https://github.com/apache/spark/pull/27860 to downgrade Janino, as the new version has a bug.

### Why are the changes needed?

The symptom is about NaN comparison. For code below
```
if (double_value <= 0.0) {
  ...
} else {
  ...
}
```

If `double_value` is NaN, `NaN <= 0.0` is false and we should go to the else branch. However, current Spark goes to the if branch and causes correctness issues like SPARK-32640.

One way to fix it is:
```
boolean cond = double_value <= 0.0;
if (cond) {
  ...
} else {
  ...
}
```

I'm not familiar with Janino so I don't know what's going on there.

### Does this PR introduce _any_ user-facing change?

Yes, fix correctness bugs.

### How was this patch tested?

a new test

Closes #29495 from cloud-fan/revert.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-20 13:26:39 -07:00
Wenchen Fan d378dc5f6d [SPARK-28863][SQL][FOLLOWUP] Do not reuse the physical plan
### What changes were proposed in this pull request?

This is a followup of https://github.com/apache/spark/pull/29469

Instead of passing the physical plan to the fallbacked v1 source directly and skipping analysis, optimization, planning altogether, this PR proposes to pass the optimized plan.

### Why are the changes needed?

It's a bit risky to pass the physical plan directly. When the fallbacked v1 source applies more operations to the input DataFrame, it will re-apply the post-planning physical rules like `CollapseCodegenStages`, `InsertAdaptiveSparkPlan`, etc., which is very tricky.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

existing test suite with some new tests

Closes #29489 from cloud-fan/follow.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-20 15:23:25 +00:00
Takeshi Yamamuro d80d0ced9a [SPARK-32665][SQL][TEST] Deletes orphan directories under a warehouse dir in SQLQueryTestSuite
### What changes were proposed in this pull request?

In case that a last `SQLQueryTestSuite` test run is killed, it will fail in a next run because of a following reason:
```
[info] org.apache.spark.sql.SQLQueryTestSuite *** ABORTED *** (17 seconds, 483 milliseconds)
[info]   org.apache.spark.sql.AnalysisException: Can not create the managed table('`testdata`'). The associated location('file:/Users/maropu/Repositories/spark/spark-master/sql/core/spark-warehouse/org.apache.spark.sql.SQLQueryTestSuite/testdata') already exists.;
[info]   at org.apache.spark.sql.catalyst.catalog.SessionCatalog.validateTableLocation(SessionCatalog.scala:355)
[info]   at org.apache.spark.sql.execution.command.CreateDataSourceTableAsSelectCommand.run(createDataSourceTables.scala:170)
[info]   at org.apache.spark.sql.execution.command.DataWritingCommandExec.sideEffectResult$lzycompute(commands.scala:108)
```
This PR intends to add code to deletes orphan directories under a warehouse dir in `SQLQueryTestSuite` before creating test tables.

### Why are the changes needed?

To improve test convenience

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Manually checked

Closes #29488 from maropu/DeleteDirs.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-20 06:12:05 -07:00
angerszhu 6dae11d034 [SPARK-32607][SQL] Script Transformation ROW FORMAT DELIMITED TOK_TABLEROWFORMATLINES only support '\n'
### What changes were proposed in this pull request?
Scrip Transform no-serde (`ROW FORMAT DELIMITED`) mode `LINE TERMINNATED BY `
only support `\n`.

Tested in hive :
Hive 1.1
![image](https://user-images.githubusercontent.com/46485123/90309510-ce82a180-df1b-11ea-96ab-56e2b3229489.png)

Hive 2.3.7
![image](https://user-images.githubusercontent.com/46485123/90309504-c88cc080-df1b-11ea-853e-8f65e9ed2375.png)

### Why are the changes needed?
Strictly limit the use method to ensure the accuracy of data

### Does this PR introduce _any_ user-facing change?
User use Scrip Transform no-serde (ROW FORMAT DELIMITED) mode  with `LINE TERMINNATED BY `
not equal `'\n'`. will throw error

### How was this patch tested?
Added UT

Closes #29438 from AngersZhuuuu/SPARK-32607.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-20 12:03:02 +00:00
yi.wu 44a288fc41 [SPARK-32653][CORE] Decommissioned host/executor should be considered as inactive in TaskSchedulerImpl
### What changes were proposed in this pull request?

Add decommissioning status checking for a host or executor while checking it's active or not. And a decommissioned host or executor should be considered as inactive.

### Why are the changes needed?

First of all, this PR is not a correctness bug fix but gives improvement indeed. And the main problem here we want to fix is that a decommissioned host or executor should be considered as inactive.

`TaskSetManager.computeValidLocalityLevels` depends on `TaskSchedulerImpl.isExecutorAlive/hasExecutorsAliveOnHost` to calculate the locality levels. Therefore, the `TaskSetManager` could also get corresponding locality levels of those decommissioned hosts or executors if they're not considered as inactive. However, on the other side,  `CoarseGrainedSchedulerBackend` won't construct the `WorkerOffer` for those decommissioned executors. That also means `TaskSetManager` might never have a chance to launch tasks at certain locality levels but only suffers the unnecessary delay because of delay scheduling. So, this PR helps to reduce this kind of unnecessary delay by making decommissioned host/executor inactive in `TaskSchedulerImpl`.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit tests

Closes #29468 from Ngone51/fix-decom-alive.

Lead-authored-by: yi.wu <yi.wu@databricks.com>
Co-authored-by: Devesh Agrawal <devesh.agrawal@gmail.com>
Co-authored-by: wuyi <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-20 12:00:32 +00:00
Jungtaek Lim (HeartSaVioR) e6795cd341 [SPARK-30462][SS] Streamline the logic on file stream source and sink metadata log to avoid memory issue
### What changes were proposed in this pull request?

In many operations on CompactibleFileStreamLog reads a metadata log file and materializes all entries into memory. As the nature of the compact operation, CompactibleFileStreamLog may have a huge compact log file with bunch of entries included, and for now they're just monotonically increasing, which means the amount of memory to materialize also grows incrementally. This leads pressure on GC.

This patch proposes to streamline the logic on file stream source and sink whenever possible to avoid memory issue. To make this possible we have to break the existing behavior of excluding entries - now the `compactLogs` method is called with all entries, which forces us to materialize all entries into memory. This is hopefully no effect on end users, because only file stream sink has a condition to exclude entries, and the condition has been never true. (DELETE_ACTION has been never set.)

Based on the observation, this patch also changes the existing UT a bit which simulates the situation where "A" file is added, and another batch marks the "A" file as deleted. This situation simply doesn't work with the change, but as I mentioned earlier it hasn't been used. (I'm not sure the UT is from the actual run. I guess not.)

### Why are the changes needed?

The memory issue (OOME) is reported by both JIRA issue and user mailing list.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

* Existing UTs
* Manual test done

The manual test leverages the simple apps which continuously writes the file stream sink metadata log.

bea7680e4c

The test is configured to have a batch metadata log file at 1.9M (10,000 entries) whereas other Spark configuration is set to the default. (compact interval = 10) The app runs as driver, and the heap memory on driver is set to 3g.

> before the patch

<img width="1094" alt="Screen Shot 2020-06-23 at 3 37 44 PM" src="https://user-images.githubusercontent.com/1317309/85375841-d94f3480-b571-11ea-817b-c6b48b34888a.png">

It only ran for 40 mins, with the latest compact batch file size as 1.3G. The process struggled with GC, and after some struggling, it threw OOME.

> after the patch

<img width="1094" alt="Screen Shot 2020-06-23 at 3 53 29 PM" src="https://user-images.githubusercontent.com/1317309/85375901-eff58b80-b571-11ea-837e-30d107f677f9.png">

It sustained 2 hours run (manually stopped as it's expected to run more), with the latest compact batch file size as 2.2G. The actual memory usage didn't even go up to 1.2G, and be cleaned up soon without outstanding GC activity.

Closes #28904 from HeartSaVioR/SPARK-30462.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-20 02:26:38 -07:00
Xingbo Jiang f793977e9a [SPARK-32658][CORE] Fix PartitionWriterStream partition length overflow
### What changes were proposed in this pull request?

The `count` in `PartitionWriterStream` should be a long value, instead of int. The issue is introduced by apache/sparkabef84a . When the overflow happens, the shuffle index file would record wrong index of a reduceId, thus lead to `FetchFailedException: Stream is corrupted` error.

Besides the fix, I also added some debug logs, so in the future it's easier to debug similar issues.

### Why are the changes needed?

This is a regression and bug fix.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

A Spark user reported this issue when migrating their workload to 3.0. One of the jobs fail deterministically on Spark 3.0 without the patch, and the job succeed after applied the fix.

Closes #29474 from jiangxb1987/fixPartitionWriteStream.

Authored-by: Xingbo Jiang <xingbo.jiang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-20 07:08:30 +00:00
ulysses 7048fff230 [SPARK-31999][SQL][FOLLOWUP] Adds negative test cases with typos
### What changes were proposed in this pull request?

Address the [#comment](https://github.com/apache/spark/pull/28840#discussion_r471172006).

### Why are the changes needed?

Make code robust.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

ut.

Closes #29453 from ulysses-you/SPARK-31999-FOLLOWUP.

Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-20 05:33:12 +00:00
Holden Karau 059fb6571e [SPARK-32657][K8S] Update the log strings we check for & imports in decommission K8s
### What changes were proposed in this pull request?

Update the log strings to match the new log messages.

### Why are the changes needed?

Tests are failing

### Does this PR introduce _any_ user-facing change?

No, test only change.

### How was this patch tested?
WIP: Make sure the DecommissionSuite passes in Jenkins.

Closes #29479 from holdenk/SPARK-32657-Decommissioning-tests-update-log-string.

Authored-by: Holden Karau <hkarau@apple.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-19 18:28:21 -07:00
Dongjoon Hyun 3722ed430d [SPARK-32655][K8S] Support appId/execId placeholder in K8s SPARK_EXECUTOR_DIRS
### What changes were proposed in this pull request?

This PR aims to support replacements of `SPARK_APPLICATION_ID`/`SPARK_EXECUTOR_ID` in `SPARK_EXECUTOR_DIRS ` executor environment.

### Why are the changes needed?

This PR provides users additional controllability.

**HOW TO RUN**
```
bin/spark-submit --master k8s://https://kubernetes.docker.internal:6443 --deploy-mode cluster \
-c spark.kubernetes.container.image=spark:SPARK-32655 \
-c spark.kubernetes.driver.pod.name=pi \
-c spark.kubernetes.executor.podNamePrefix=pi \
-c spark.kubernetes.executor.volumes.nfs.data.mount.path=/efs \
-c spark.kubernetes.executor.volumes.nfs.data.mount.readOnly=false \
-c spark.kubernetes.executor.volumes.nfs.data.options.server=efs-server-ip \
-c spark.kubernetes.executor.volumes.nfs.data.options.path=/ \
-c spark.executorEnv.SPARK_EXECUTOR_DIRS=/efs/SPARK_APPLICATION_ID/SPARK_EXECUTOR_ID \
--class org.apache.spark.examples.SparkPi \
local:///opt/spark/examples/jars/spark-examples_2.12-3.1.0-SNAPSHOT.jar 20000
```

**EFS Layout**
```
/efs
├── spark-f45039b13b0b4fd4baf80fed561a2228
│   ├── 1
│   │   ├── blockmgr-bbe76578-8ff2-4c2d-ab4f-37671d886f56
│   │   │   ├── 0e
│   │   │   └── 11
│   │   └── spark-e41aeb41-00fc-49e1-a77d-093b6df5958a
│   │       ├── -18375678081597852666997_cache
│   │       └── -18375678081597852666997_lock
│   └── 2
│       ├── blockmgr-765bfb50-ab13-4b2b-9350-356fed0169e3
│       │   ├── 0e
│       │   └── 11
│       └── spark-737671fc-1697-4367-9daf-2b1575f92aba
│           ├── -18375678081597852666997_cache
│           └── -18375678081597852666997_lock
```

### Does this PR introduce _any_ user-facing change?

- Yes because this is a new feature.
- This will not affect the existing jobs because users don't use the string pattern `SPARK_APPLICATION_ID` or `SPARK_EXECUTOR_ID` inside `SPARK_EXECUTOR_DIRS` environment variable.

### How was this patch tested?

Pass the newly added test case.

Closes #29472 from dongjoon-hyun/SPARK-32655.

Authored-by: Dongjoon Hyun <dongjoon@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-19 12:11:34 -07:00
Burak Yavuz 278d0dd25b [SPARK-28863][SQL] Introduce AlreadyPlanned to prevent reanalysis of V1FallbackWriters
### What changes were proposed in this pull request?

This PR introduces a LogicalNode AlreadyPlanned, and related physical plan and preparation rule.

With the DataSourceV2 write operations, we have a way to fallback to the V1 writer APIs using InsertableRelation. The gross part is that we're in physical land, but the InsertableRelation takes a logical plan, so we have to pass the logical plans to these physical nodes, and then potentially go through re-planning. This re-planning can cause issues for an already optimized plan.

A useful primitive could be specifying that a plan is ready for execution through a logical node AlreadyPlanned. This would wrap a physical plan, and then we can go straight to execution.

### Why are the changes needed?

To avoid having a physical plan that is disconnected from the physical plan that is being executed in V1WriteFallback execution. When a physical plan node executes a logical plan, the inner query is not connected to the running physical plan. The physical plan that actually runs is not visible through the Spark UI and its metrics are not exposed. In some cases, the EXPLAIN plan doesn't show it.

### Does this PR introduce _any_ user-facing change?

Nope

### How was this patch tested?

V1FallbackWriterSuite tests that writes still work

Closes #29469 from brkyvz/alreadyAnalyzed2.

Lead-authored-by: Burak Yavuz <brkyvz@gmail.com>
Co-authored-by: Burak Yavuz <burak@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-19 16:25:35 +00:00
Terry Kim 3d1dce75d9 [SPARK-32621][SQL] 'path' option can cause issues while inferring schema in CSV/JSON datasources
### What changes were proposed in this pull request?

When CSV/JSON datasources infer schema (e.g, `def inferSchema(files: Seq[FileStatus])`, they use the `files` along with the original options. `files` in `inferSchema` could have been deduced from the "path" option if the option was present, so this can cause issues (e.g., reading more data, listing the path again) since the "path" option is **added** to the `files`.

### Why are the changes needed?

The current behavior can cause the following issue:
```scala
class TestFileFilter extends PathFilter {
  override def accept(path: Path): Boolean = path.getParent.getName != "p=2"
}

val path = "/tmp"
val df = spark.range(2)
df.write.json(path + "/p=1")
df.write.json(path + "/p=2")

val extraOptions = Map(
  "mapred.input.pathFilter.class" -> classOf[TestFileFilter].getName,
  "mapreduce.input.pathFilter.class" -> classOf[TestFileFilter].getName
)

// This works fine.
assert(spark.read.options(extraOptions).json(path).count == 2)

// The following with "path" option fails with the following:
// assertion failed: Conflicting directory structures detected. Suspicious paths
//	file:/tmp
//	file:/tmp/p=1
assert(spark.read.options(extraOptions).format("json").option("path", path).load.count() === 2)
```

### Does this PR introduce _any_ user-facing change?

Yes, the above failure doesn't happen and you get the consistent experience when you use `spark.read.csv(path)` or `spark.read.format("csv").option("path", path).load`.

### How was this patch tested?

Updated existing tests.

Closes #29437 from imback82/path_bug.

Authored-by: Terry Kim <yuminkim@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-19 16:23:22 +00:00
Yuming Wang 409fea30cc [SPARK-32624][SQL] Use getCanonicalName to fix byte[] compile issue
### What changes were proposed in this pull request?
```scala
scala> Array[Byte](1, 2).getClass.getName
res13: String = [B

scala> Array[Byte](1, 2).getClass.getCanonicalName
res14: String = byte[]
```

This pr replace `getClass.getName` with `getClass.getCanonicalName` in `CodegenContext.addReferenceObj` to fix `byte[]` compile issue:
```
...
/* 030 */       value_1 = org.apache.spark.sql.catalyst.util.TypeUtils.compareBinary(value_2, (([B) references[0] /* min */)) >= 0 && org.apache.spark.sql.catalyst.util.TypeUtils.compareBinary(value_2, (([B) references[1] /* max */)) <= 0;
/* 031 */     }
/* 032 */     return !isNull_1 && value_1;
/* 033 */   }
/* 034 */
/* 035 */
/* 036 */ }

20:49:54.886 WARN org.apache.spark.sql.catalyst.expressions.Predicate: Expr codegen error and falling back to interpreter mode
java.util.concurrent.ExecutionException: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 30, Column 81: failed to compile: org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 30, Column 81: Unexpected token "[" in primary
...
```

### Why are the changes needed?

Fix compile issue when compiling generated code.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Unit test.

Closes #29439 from wangyum/SPARK-32624.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
2020-08-19 05:20:26 -07:00
yi.wu a1a32d2eb5 [SPARK-32600][CORE] Unify task name in some logs between driver and executor
### What changes were proposed in this pull request?

This PR replaces some arbitrary task names in logs with the widely used task name (e.g. "task 0.0 in stage 1.0 (TID 1)") among driver and executor. This will change the task name in `TaskDescription` by appending TID.

### Why are the changes needed?

Some logs are still using TID(a.k.a `taskId`) only as the task name, e.g.,

7f275ee597/core/src/main/scala/org/apache/spark/executor/Executor.scala (L786)

7f275ee597/core/src/main/scala/org/apache/spark/executor/Executor.scala (L632-L635)

And the task thread name also only has the `taskId`:

7f275ee597/core/src/main/scala/org/apache/spark/executor/Executor.scala (L325)

As mentioned in https://github.com/apache/spark/pull/1259, TID itself does not capture stage or retries, making it harder to correlate with the application. It's inconvenient when debugging applications.

Actually, task name like "task name (e.g. "task 0.0 in stage 1.0 (TID 1)")" has already been used widely after https://github.com/apache/spark/pull/1259. We'd better follow the naming convention.

### Does this PR introduce _any_ user-facing change?

Yes. Users will see the more consistent task names in the log.

### How was this patch tested?

Manually checked.

Closes #29418 from Ngone51/unify-task-name.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-19 08:44:49 +00:00
angerszhu 03e2de99ab [SPARK-32608][SQL] Script Transform ROW FORMAT DELIMIT value should format value
### What changes were proposed in this pull request?
For SQL
```
SELECT TRANSFORM(a, b, c)
  ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ','
  LINES TERMINATED BY '\n'
  NULL DEFINED AS 'null'
  USING 'cat' AS (a, b, c)
  ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ','
  LINES TERMINATED BY '\n'
  NULL DEFINED AS 'NULL'
FROM testData
```
The correct

TOK_TABLEROWFORMATFIELD should be `, `nut actually ` ','`

TOK_TABLEROWFORMATLINES should be `\n`  but actually` '\n'`

### Why are the changes needed?
Fix string value format

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Added UT

Closes #29428 from AngersZhuuuu/SPARK-32608.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-19 08:31:58 +00:00
yi.wu 3092527f75 [SPARK-32651][CORE] Decommission switch configuration should have the highest hierarchy
### What changes were proposed in this pull request?

Rename `spark.worker.decommission.enabled` to `spark.decommission.enabled` and move it from `org.apache.spark.internal.config.Worker` to `org.apache.spark.internal.config.package`.

### Why are the changes needed?

Decommission has been supported in Standalone and k8s yet and may be supported in Yarn(https://github.com/apache/spark/pull/27636) in the future. Therefore, the switch configuration should have the highest hierarchy rather than belongs to Standalone's Worker. In other words, it should be independent of the cluster managers.

### Does this PR introduce _any_ user-facing change?

No, as the decommission feature hasn't been released.

### How was this patch tested?

Pass existed tests.

Closes #29466 from Ngone51/fix-decom-conf.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-19 06:53:06 +00:00
Samir Khan e15ae60a53 [SPARK-32550][SQL] Make SpecificInternalRow constructors faster by using while loops instead of maps
### What changes were proposed in this pull request?
Change maps in two constructors of SpecificInternalRow to while loops.

### Why are the changes needed?
This was originally noticed with https://github.com/apache/spark/pull/29353 and https://github.com/apache/spark/pull/29354 and will have impacts on performance of reading ORC and Avro files. Ran AvroReadBenchmarks with the new cases of nested and array'd structs in https://github.com/apache/spark/pull/29352. Haven't run benchmarks for ORC but can do that if needed.

**Before:**
```
Nested Struct Scan:                       Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
Nested Struct                                     74674          75319         912          0.0      142429.1       1.0X

Array of Struct Scan:                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
Array of Structs                                  34193          34339         206          0.0       65217.9       1.0X
```
**After:**
```
Nested Struct Scan:                       Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
Nested Struct                                     48451          48619         237          0.0       92413.2       1.0X

Array of Struct Scan:                     Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
Array of Structs                                  18518          18683         234          0.0       35319.6       1.0X
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Ran AvroReadBenchmarks with the new cases of nested and array'd structs in https://github.com/apache/spark/pull/29352.

Closes #29366 from msamirkhan/spark-32550.

Lead-authored-by: Samir Khan <muhammad.samir.khan@gmail.com>
Co-authored-by: skhan04 <samirkhan@verizonmedia.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-19 14:57:34 +09:00
Sean Owen 891c5e661a [MINOR][DOCS] Add KMeansSummary and InheritableThread to documentation
### What changes were proposed in this pull request?

The class `KMeansSummary` in pyspark is not included in `clustering.py`'s `__all__` declaration. It isn't included in the docs as a result.

`InheritableThread` and `KMeansSummary` should be into corresponding RST files for documentation.

### Why are the changes needed?

It seems like an oversight to not include this as all similar "summary" classes are.
`InheritableThread` should also be documented.

### Does this PR introduce _any_ user-facing change?

I don't believe there are functional changes. It should make this public class appear in docs.

### How was this patch tested?

Existing tests / N/A.

Closes #29470 from srowen/KMeansSummary.

Lead-authored-by: Sean Owen <srowen@gmail.com>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-19 14:30:07 +09:00
Wenchen Fan f33b64a656 [SPARK-32652][SQL] ObjectSerializerPruning fails for RowEncoder
### What changes were proposed in this pull request?

Update `ObjectSerializerPruning.alignNullTypeInIf`, to consider the isNull check generated in `RowEncoder`, which is `Invoke(inputObject, "isNullAt", BooleanType, Literal(index) :: Nil)`.

### Why are the changes needed?

Query fails if we don't fix this bug, due to type mismatch in `If`.

### Does this PR introduce _any_ user-facing change?

Yes, the failed query can run after this fix.

### How was this patch tested?

new tests

Closes #29467 from cloud-fan/bug.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-19 13:50:29 +09:00
Gengliang Wang 1b39215a65 [SPARK-32018][FOLLOWUP][DOC] Add migration guide for decimal value overflow in sum aggregation
### What changes were proposed in this pull request?

Add migration guide for decimal value overflow behavior in sum aggregation, introduced in https://github.com/apache/spark/pull/29026

### Why are the changes needed?

Add migration guide for the behavior changes from 3.0 to 3.1.
See also: https://github.com/apache/spark/pull/29450#issuecomment-675222779

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Build docs and preview:
![image](https://user-images.githubusercontent.com/1097932/90589256-8b7e3380-e192-11ea-8ff1-05a447c20722.png)

Closes #29458 from gengliangwang/migrationGuideDecimalOverflow.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-08-19 11:37:53 +08:00
HyukjinKwon bfd8c34154 [SPARK-32645][INFRA] Upload unit-tests.log as an artifact
### What changes were proposed in this pull request?

This PR proposes to upload `target/unit-tests.log` into the artifact so it will be able to download here:
![Screen Shot 2020-08-18 at 2 23 18 PM](https://user-images.githubusercontent.com/6477701/90474095-789e3b80-e15f-11ea-87f8-e7da3df3c03e.png)

### Why are the changes needed?

Jenkins has this feature. It should be best to have the same dev functionalities with it.
Also, note that this was pointed out https://github.com/apache/spark/pull/29225#discussion_r471485011.

### Does this PR introduce _any_ user-facing change?

No, dev-only

### How was this patch tested?

https://github.com/apache/spark/actions/runs/213000777 should demonstrate it

Closes #29454 from HyukjinKwon/SPARK-32645.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-19 12:28:36 +09:00
yi.wu 70964e741a [SPARK-21040][CORE][FOLLOW-UP] Only calculate executorKillTime when speculation is enabled
### What changes were proposed in this pull request?

Only calculate `executorKillTime` in `TaskSetManager.executorDecommission()` when speculation is enabled.

### Why are the changes needed?

Avoid unnecessary operations to save time/memory.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Pass existed tests.

Closes #29464 from Ngone51/followup-SPARK-21040.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-18 13:50:57 +00:00
HyukjinKwon babb654c81 [SPARK-32647][INFRA] Report SparkR test results with JUnit reporter
### What changes were proposed in this pull request?

This PR proposes to generate JUnit XML test report in SparkR tests that can be leveraged in both Jenkins and GitHub Actions.

**GitHub Actions**

![Screen Shot 2020-08-18 at 12 42 46 PM](https://user-images.githubusercontent.com/6477701/90467934-55b85b00-e150-11ea-863c-c8415e764ddb.png)

**Jenkins**

![Screen Shot 2020-08-18 at 2 03 42 PM](https://user-images.githubusercontent.com/6477701/90472509-a5505400-e15b-11ea-9165-777ec9b96eaa.png)

NOTE that while I am here, I am switching back the console reporter from "progress" to "summary". Currently non-ascii codes are broken in Jenkins console and switching it to "summary" can work around it.
"summary" is the default format used in testthat 1.x.

### Why are the changes needed?

To check the test failures more easily.

### Does this PR introduce _any_ user-facing change?

No, dev-only

### How was this patch tested?

It is tested in GitHub Actions at https://github.com/HyukjinKwon/spark/pull/23/checks?check_run_id=996586446
In case of Jenkins, https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/127525/testReport/

Closes #29456 from HyukjinKwon/sparkr-junit.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2020-08-18 19:35:15 +09:00
HyukjinKwon d0dfe4986b [MINOR][INFRA] Rename master.yml to build_and_test.yml
### What changes were proposed in this pull request?

This PR renames `master.yml` to `build_and_test.yml` to indicate this is the workflow that builds and runs the tests.

### Why are the changes needed?

Just for readability. `master.yml` looks like the name of the branch (to me).

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

GitHub Actions build in this PR will test it out.

Closes #29459 from HyukjinKwon/minor-rename.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2020-08-18 18:18:47 +08:00
Luca Canali 21e0dd0461 [SPARK-32119][FOLLOWUP][DOC] Update monitoring doc following the improvement in SPARK-32119
### What changes were proposed in this pull request?
Update monitoring doc following the improvement/fix in SPARK-32119.

### Why are the changes needed?
SPARK-32119 removes the limitations listed in the monitoring doc "Distribution of the jar files containing the plugin code is currently not done by Spark."

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Not relevant

Closes #29463 from LucaCanali/followupSPARK32119.

Authored-by: Luca Canali <luca.canali@cern.ch>
Signed-off-by: Kousuke Saruta <sarutak@oss.nttdata.com>
2020-08-18 18:53:34 +09:00
Devesh Agrawal 1ac23dea52 [SPARK-32613][CORE] Fix regressions in DecommissionWorkerSuite
### What changes were proposed in this pull request?

The DecommissionWorkerSuite started becoming flaky and it revealed a real regression. Recently closed #29211 necessitates remembering the decommissioning shortly beyond the removal of the executor.

In addition to fixing this issue, ensure that DecommissionWorkerSuite continues to pass when executors haven't had a chance to exit eagery. That is the old behavior before #29211 also still works.

Added some more tests to TaskSchedulerImpl to ensure that the decommissioning information is indeed purged after a timeout.

Hardened the test DecommissionWorkerSuite to make it wait for successful job completion.

### Why are the changes needed?

First, let me describe the intended behavior of decommissioning: If a fetch failure happens where the source executor was decommissioned, we want to treat that as an eager signal to clear all shuffle state associated with that executor. In addition if we know that the host was decommissioned, we want to forget about all map statuses from all other executors on that decommissioned host. This is what the test "decommission workers ensure that fetch failures lead to rerun" is trying to test. This invariant is important to ensure that decommissioning a host does not lead to multiple fetch failures that might fail the job. This fetch failure can happen before the executor is truly marked "lost" because of heartbeat delays.

- However, #29211 eagerly exits the executors when they are done decommissioning. This removal of the executor was racing with the fetch failure. By the time the fetch failure is triggered the executor is already removed and thus has forgotten its decommissioning information. (I tested this by delaying the decommissioning). The fix is to keep the decommissioning information around for some time after removal with some extra logic to finally purge it after a timeout.

- In addition the executor loss can also bump up `shuffleFileLostEpoch` (added in #28848). This happens because when the executor is lost, it forgets the shuffle state about just that executor and increments the `shuffleFileLostEpoch`. This incrementing precludes the clearing of state of the entire host when the fetch failure happens because the failed task is still reusing the old epoch. The fix here is also simple: Ignore the `shuffleFileLostEpoch` when the shuffle status is being cleared due to a fetch failure resulting from host decommission.

I am strategically making both of these fixes be very local to decommissioning to avoid other regressions. Especially the version stuff is tricky (it hasn't been fundamentally changed since it was first introduced in 2013).

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
Manually ran DecommissionWorkerSuite several times using a script and ensured it all passed.

### (Internal) Configs added
I added two configs, one of which is sort of meant for testing only:
- `spark.test.executor.decommission.initial.sleep.millis`: Initial delay by the decommissioner shutdown thread. Default is same as before of 1 second. This is used for testing only. This one is kept "hidden" (ie not added as a constant to avoid config bloat)
- `spark.executor.decommission.removed.infoCacheTTL`: Number of seconds to keep the removed executors decom entries around. It defaults to 5 minutes. It should be around the average time it takes for all of the shuffle data to be fetched from the mapper to the reducer, but I think that can take a while since the reducers also do a multistep sort.

Closes #29422 from agrawaldevesh/decom_fixes.

Authored-by: Devesh Agrawal <devesh.agrawal@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-18 06:47:31 +00:00
Liang-Chi Hsieh b33066f42b [SPARK-32622][SQL][TEST] Add case-sensitivity test for ORC predicate pushdown
### What changes were proposed in this pull request?

During working on SPARK-25557, we found that ORC predicate pushdown doesn't have case-sensitivity test. This PR proposes to add case-sensitivity test for ORC predicate pushdown.

### Why are the changes needed?

Increasing test coverage for ORC predicate pushdown.

### Does this PR introduce _any_ user-facing change?

No

### How was this patch tested?

Pass Jenkins tests.

Closes #29427 from viirya/SPARK-25557-followup3.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-17 13:19:49 -07:00
HyukjinKwon 86852c57af [SPARK-32606][SPARK-32605][INFRA] Remove the forks of action-surefire-report and action-download-artifact in test_report.yml
### What changes were proposed in this pull request?

This PR proposes to remove the usage of my own forks and use the original plugins in GitHub Actions testing report.

SPARK-32357 introduced the GitHub Actions test reporting by leveraging two plugins:
 - [ScaCap/action-surefire-report](https://github.com/ScaCap/action-surefire-report)
 - [dawidd6/action-download-artifact](https://github.com/dawidd6/action-download-artifact)

In order to make it working, it had to fork two repositories with custom fixes:
  - HyukjinKwon/action-surefire-reportc96094c
  - f86c565d52

The two custom fixes are thankfully merged at https://github.com/ScaCap/action-surefire-report/pull/14 and https://github.com/dawidd6/action-download-artifact/pull/24, and they released new ones to use at [ScaCap/action-surefire-report/commits/v1](https://github.com/ScaCap/action-surefire-report/commits/v1) and [dawidd6/action-download-artifact/commits/v2](https://github.com/dawidd6/action-download-artifact/commits/v2)  - thanks jmisur and dawidd6 again.

### Why are the changes needed?

To avoid relying on forks and code duplications.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Logically there is no diff. I tested it at https://github.com/HyukjinKwon/spark/runs/992824229 for doubly sure.

NOTE that this PR cannot be tested here within the workflow triggered by this PR without merging the changes in `test_report.yml` into the master.

Closes #29449 from HyukjinKwon/SPARK-32606-SPARK-32605.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-17 11:17:50 -07:00
xuewei.linxuewei 108b1dc723 [SPARK-32615][SQL] Fix AQE aggregateMetrics java.util.NoSuchElementException
### What changes were proposed in this pull request?
Found java.util.NoSuchElementException in UT log of AdaptiveQueryExecSuite. During AQE, when sub-plan changed, LiveExecutionData is using the new sub-plan SQLMetrics to override the old ones, But in the final aggregateMetrics, since the plan was updated, the old metrics will throw NoSuchElementException when it try to match with the new metricTypes. To sum up, we need to filter out those outdated metrics to avoid throwing java.util.NoSuchElementException,  which cause SparkUI SQL Tab abnormally rendered.

### Why are the changes needed?
SQL Metrics is not correct for some AQE cases, and it break SparkUI SQL Tab when it comes to NAAJ rewritten to LocalRelation case.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
* Added case in SQLAppStatusListenerSuite.
* Run AdaptiveQueryExecSuite with no "java.util.NoSuchElementException".
* Validation on Spark Web UI

Closes #29431 from leanken/leanken-SPARK-32615.

Authored-by: xuewei.linxuewei <xuewei.linxuewei@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-17 14:28:10 +00:00
yi.wu 9f2893cf2c [SPARK-32466][TEST][SQL] Add PlanStabilitySuite to detect SparkPlan regression
### What changes were proposed in this pull request?

This PR proposes to detect possible regression inside `SparkPlan`. To achieve this goal, this PR added a base test suite called  `PlanStabilitySuite`. The basic workflow of this test suite is similar to  `SQLQueryTestSuite`. It also uses `SPARK_GENERATE_GOLDEN_FILES` to decide whether it should regenerate the golden files or compare to the golden result for each input query. The difference is, `PlanStabilitySuite` uses the serialized explain result(.txt format) of the `SparkPlan` as the output of a query, instead of the data result.

And since `SparkPlan` is non-deterministic for various reasons, e.g.,  expressions ids changes, expression order changes, we'd reduce the plan to a simplified version that only contains node names and references. And we only identify those important nodes, e.g., `Exchange`, `SubqueryExec`, in the simplified plan.

And we'd reuse TPC-DS queries(v1.4, v2.7, modified) to test plans' stability. Currently, one TPC-DS query can only have one corresponding simplified golden plan.

This PR also did a few refactor, which extracts `TPCDSBase` from `TPCDSQuerySuite`. So,  `PlanStabilitySuite` can use the TPC-DS queries as well.

### Why are the changes needed?

Nowadays, Spark is getting more and more complex. Any changes might cause regression unintentionally. Spark already has some benchmark to catch the performance regression. But, yet, it doesn't have a way to detect the regression inside `SparkPlan`. It would be good if we could detect the possible regression early during the compile phase before the runtime phase.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added `PlanStabilitySuite` and it's subclasses.

Closes #29270 from Ngone51/plan-stable.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-17 14:22:12 +00:00
Wenchen Fan b94c67b502 Revert "[SPARK-32511][SQL] Add dropFields method to Column class"
This reverts commit 0c850c71e7.
2020-08-17 13:18:46 +08:00
Cheng Su 8f0fef1843 [SPARK-32399][SQL] Full outer shuffled hash join
### What changes were proposed in this pull request?

Add support for full outer join inside shuffled hash join. Currently if the query is a full outer join, we only use sort merge join as the physical operator. However it can be CPU and IO intensive in case input table is large for sort merge join. Shuffled hash join on the other hand saves the sort CPU and IO compared to sort merge join, especially when table is large.

This PR implements the full outer join as followed:
* Process rows from stream side by looking up hash relation, and mark the matched rows from build side by:
  * for joining with unique key, a `BitSet` is used to record matched rows from build side (`key index` to represent each row)
  * for joining with non-unique key, a `HashSet[Long]` is  used to record matched rows from build side (`key index` + `value index` to represent each row).
`key index` is defined as the index into key addressing array `longArray` in `BytesToBytesMap`.
`value index` is defined as the iterator index of values for same key.

* Process rows from build side by iterating hash relation, and filter out rows from build side being looked up already (done in `ShuffledHashJoinExec.fullOuterJoin`)

For context, this PR was originally implemented as followed (up to commit e3322766d4):
1. Construct hash relation from build side, with extra boolean value at the end of row to track look up information (done in `ShuffledHashJoinExec.buildHashedRelation` and `UnsafeHashedRelation.apply`).
2. Process rows from stream side by looking up hash relation, and mark the matched rows from build side be looked up (done in `ShuffledHashJoinExec.fullOuterJoin`).
3. Process rows from build side by iterating hash relation, and filter out rows from build side being looked up already (done in `ShuffledHashJoinExec.fullOuterJoin`).

See discussion of pros and cons between these two approaches [here](https://github.com/apache/spark/pull/29342#issuecomment-672275450), [here](https://github.com/apache/spark/pull/29342#issuecomment-672288194) and [here](https://github.com/apache/spark/pull/29342#issuecomment-672640531).

TODO: codegen for full outer shuffled hash join can be implemented in another followup PR.

### Why are the changes needed?

As implementation in this PR, full outer shuffled hash join will have overhead to iterate build side twice (once for building hash map, and another for outputting non-matching rows), and iterate stream side once. However, full outer sort merge join needs to iterate both sides twice, and sort the large table can be more CPU and IO intensive. So full outer shuffled hash join can be more efficient than sort merge join when stream side is much more larger than build side.

For example query below, full outer SHJ saved 30% wall clock time compared to full outer SMJ.

```
def shuffleHashJoin(): Unit = {
    val N: Long = 4 << 22
    withSQLConf(
      SQLConf.SHUFFLE_PARTITIONS.key -> "2",
      SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "20000000") {
      codegenBenchmark("shuffle hash join", N) {
        val df1 = spark.range(N).selectExpr(s"cast(id as string) as k1")
        val df2 = spark.range(N / 10).selectExpr(s"cast(id * 10 as string) as k2")
        val df = df1.join(df2, col("k1") === col("k2"), "full_outer")
        df.noop()
    }
  }
}
```

```
Running benchmark: shuffle hash join
  Running case: shuffle hash join off
  Stopped after 2 iterations, 16602 ms
  Running case: shuffle hash join on
  Stopped after 5 iterations, 31911 ms

Java HotSpot(TM) 64-Bit Server VM 1.8.0_181-b13 on Mac OS X 10.15.4
Intel(R) Core(TM) i9-9980HK CPU  2.40GHz
shuffle hash join:                        Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
shuffle hash join off                              7900           8301         567          2.1         470.9       1.0X
shuffle hash join on                               6250           6382          95          2.7         372.5       1.3X
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit test in `JoinSuite.scala`, `AbstractBytesToBytesMapSuite.java` and `HashedRelationSuite.scala`.

Closes #29342 from c21/full-outer-shj.

Authored-by: Cheng Su <chengsu@fb.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-08-17 08:06:19 +09:00
Kousuke Saruta 9a79bbc8b6 [SPARK-32610][DOCS] Fix the link to metrics.dropwizard.io in monitoring.md to refer the proper version
### What changes were proposed in this pull request?

This PR fixes the link to metrics.dropwizard.io in monitoring.md to refer the proper version of the library.

### Why are the changes needed?

There are links to metrics.dropwizard.io in monitoring.md but the link targets refer the version 3.1.0, while we use 4.1.1.
Now that users can create their own metrics using the dropwizard library, it's better to fix the links to refer the proper version.

### Does this PR introduce _any_ user-facing change?

Yes. The modified links refer the version 4.1.1.

### How was this patch tested?

Build the docs and visit all the modified links.

Closes #29426 from sarutak/fix-dropwizard-url.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-08-16 12:07:37 -05:00
Yuming Wang c280c7f529 [SPARK-32625][SQL] Log error message when falling back to interpreter mode
### What changes were proposed in this pull request?

This pr log the error message when falling back to interpreter mode.

### Why are the changes needed?

Not all error messages are in `CodeGenerator`, such as:
```
21:48:44.612 WARN org.apache.spark.sql.catalyst.expressions.Predicate: Expr codegen error and falling back to interpreter mode
java.lang.IllegalArgumentException: Can not interpolate org.apache.spark.sql.types.Decimal into code block.
	at org.apache.spark.sql.catalyst.expressions.codegen.Block$BlockHelper$.$anonfun$code$1(javaCode.scala:240)
	at org.apache.spark.sql.catalyst.expressions.codegen.Block$BlockHelper$.$anonfun$code$1$adapted(javaCode.scala:236)
	at scala.collection.IndexedSeqOptimized.foreach(IndexedSeqOptimized.scala:36)
	at scala.collection.IndexedSeqOptimized.foreach$(IndexedSeqOptimized.scala:33)
```

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Manual test.

Closes #29440 from wangyum/SPARK-32625.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-15 12:31:32 -07:00
Kousuke Saruta 1a4c8f718f [SPARK-32119][CORE] ExecutorPlugin doesn't work with Standalone Cluster and Kubernetes with --jars
### What changes were proposed in this pull request?

This PR changes Executor to load jars and files added by --jars and --files on Executor initialization.
To avoid downloading those jars/files twice, they are assosiated with `startTime` as their uploaded timestamp.

### Why are the changes needed?

ExecutorPlugin can't work with Standalone Cluster and Kubernetes
when a jar which contains plugins and files used by the plugins are added by --jars and --files option with spark-submit.

This is because jars and files added by --jars and --files are not loaded on Executor initialization.
I confirmed it works with YARN because jars/files are distributed as distributed cache.

### Does this PR introduce _any_ user-facing change?

Yes. jars/files added by --jars and --files are downloaded on each executor on initialization.

### How was this patch tested?

Added a new testcase.

Closes #28939 from sarutak/fix-plugin-issue.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
2020-08-14 17:10:22 -05:00
yi.wu c6be2074cc [SPARK-32616][SQL] Window operators should be added determinedly
### What changes were proposed in this pull request?

Use the `LinkedHashMap` instead of `immutable.Map` to hold the `Window` expressions in `ExtractWindowExpressions.addWindow`.

### Why are the changes needed?

This is a bug fix for https://github.com/apache/spark/pull/29270. In that PR, the generated plan(especially for the queries q47, q49, q57) on Jenkins always can not match the golden plan generated on my laptop.

It happens because `ExtractWindowExpressions.addWindow` now uses `immutable.Map` to hold the `Window` expressions by the key `(spec.partitionSpec, spec.orderSpec, WindowFunctionType.functionType(expr))` and converts the map to `Seq` at the end. Then, the `Seq` is used to add Window operators on top of the child plan. However, for the same query, the order of Windows expression inside the `Seq` could be undetermined when the expression id changes(which can affect the key). As a result, the same query could have different plans because of the undetermined order of Window operators.

Therefore, we use `LinkedHashMap`, which records the insertion order of entries, to make the adding order determined.

### Does this PR introduce _any_ user-facing change?

Maybe yes, users now always see the same plan for the same queries with multiple Window operators.

### How was this patch tested?

It's really hard to make a reproduce demo. I just tested manually with https://github.com/apache/spark/pull/29270 and it looks good.

Closes #29432 from Ngone51/fix-addWindow.

Authored-by: yi.wu <yi.wu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-14 13:29:48 +00:00
alexander-daskalov 10edeafc69 [MINOR][SQL] Fixed approx_count_distinct rsd param description
### What changes were proposed in this pull request?

In the docs concerning the approx_count_distinct I have changed the description of the rsd parameter from **_maximum estimation error allowed_** to _**maximum relative standard deviation allowed**_

### Why are the changes needed?

Maximum estimation error allowed can be misleading. You can set the target relative standard deviation, which affects the estimation error, but on given runs the estimation error can still be above the rsd parameter.

### Does this PR introduce _any_ user-facing change?

This PR should make it easier for users reading the docs to understand that the rsd parameter in approx_count_distinct doesn't cap the estimation error, but just sets the target deviation instead,

### How was this patch tested?

No tests, as no code changes were made.

Closes #29424 from Comonut/fix-approx_count_distinct-rsd-param-description.

Authored-by: alexander-daskalov <alexander.daskalov@adevinta.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2020-08-14 22:10:41 +09:00
Huaxin Gao 14003d4c30 [SPARK-32590][SQL] Remove fullOutput from RowDataSourceScanExec
### What changes were proposed in this pull request?
Remove `fullOutput` from `RowDataSourceScanExec`

### Why are the changes needed?
`RowDataSourceScanExec` requires the full output instead of the scan output after column pruning. However, in v2 code path, we don't have the full output anymore so we just pass the pruned output. `RowDataSourceScanExec.fullOutput` is actually meaningless so we should remove it.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existing tests

Closes #29415 from huaxingao/rm_full_output.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-14 08:08:16 +00:00
ulysses 339eec5f32 [SPARK-20680][SQL][FOLLOW-UP] Add HiveVoidType in HiveClientImpl
### What changes were proposed in this pull request?

Discussion with [comment](https://github.com/apache/spark/pull/29244#issuecomment-671746329).

Add `HiveVoidType` class in `HiveClientImpl` then we can replace `NullType` to `HiveVoidType` before we call hive client.

### Why are the changes needed?

Better compatible with hive.

More details in [#29244](https://github.com/apache/spark/pull/29244).

### Does this PR introduce _any_ user-facing change?

Yes, user can create view with null type in Hive.

### How was this patch tested?

New test.

Closes #29423 from ulysses-you/add-HiveVoidType.

Authored-by: ulysses <youxiduo@weidian.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2020-08-14 06:59:15 +00:00
Hyukjin Kwon 5debde9401 [SPARK-32357][INFRA] Publish failed and succeeded test reports in GitHub Actions
### What changes were proposed in this pull request?

This PR proposes to report the failed and succeeded tests in GitHub Actions in order to improve the development velocity by leveraging [ScaCap/action-surefire-report](https://github.com/ScaCap/action-surefire-report). See the example below:

![Screen Shot 2020-08-13 at 8 17 52 PM](https://user-images.githubusercontent.com/6477701/90128649-28f7f280-dda2-11ea-9211-e98e34332f6b.png)

Note that we cannot just use [ScaCap/action-surefire-report](https://github.com/ScaCap/action-surefire-report) in Apache Spark because PRs are from the forked repository, and GitHub secrets are unavailable for the security reason. This plugin and all similar plugins require to have the GitHub token that has the write access in order to post test results but it is unavailable in PRs.

To work around this limitation, I took this approach:

1. In workflow A, run the tests and upload the JUnit XML test results. GitHub provides to upload and download some files.
2. GitHub introduced new event type [`workflow_run`](https://github.blog/2020-08-03-github-actions-improvements-for-fork-and-pull-request-workflows/) 10 days ago. By leveraging this, it triggers another workflow B.
3. Workflow B is in the main repo instead of fork repo, and has the write access the plugin needs. In workflow B, it downloads the artifact uploaded from workflow A (from the forked repository).
4. Workflow B generates the test reports to port from JUnit xml files.
5. Workflow B looks up the PR and posts the test reports.

The `workflow_run` event is very new feature, and looks not so many GitHub Actions plugins support. In order to make this working with [ScaCap/action-surefire-report](https://github.com/ScaCap/action-surefire-report), I had to fork two GitHub Actions plugins to use:
 - [ScaCap/action-surefire-report](https://github.com/ScaCap/action-surefire-report) to have this custom fix: c96094cc35
    It added `commit` argument to specify the commit to post the test reports. With `workflow_run`, it can access, in workflow B, to the commit from workflow A.

 - [dawidd6/action-download-artifact](https://github.com/dawidd6/action-download-artifact) to have this custom fix: 750b71af35
    It added the support of downloading all artifacts from workflow A, in workflow B. By default, it only supports to specify the name of artifact.

    Note that I was not able to use the official [actions/download-artifact](https://github.com/actions/download-artifact) because:
      - It does not support to download artifacts between different workflows, see also https://github.com/actions/download-artifact/issues/3. Once this issue is resolved, we can switch it back to [actions/download-artifact](https://github.com/actions/download-artifact).

I plan to make a pull request for both repositories so we don't have to rely on forks.

### Why are the changes needed?

Currently, it's difficult to check the failed tests. You should scroll down long logs from GitHub Actions logs.

### Does this PR introduce _any_ user-facing change?

No, dev-only.

### How was this patch tested?

Manually tested at: https://github.com/HyukjinKwon/spark/pull/17, https://github.com/HyukjinKwon/spark/pull/18, https://github.com/HyukjinKwon/spark/pull/19, https://github.com/HyukjinKwon/spark/pull/20, and master branch of my forked repository.

Closes #29333 from HyukjinKwon/SPARK-32357-fix.

Lead-authored-by: Hyukjin Kwon <gurwls223@apache.org>
Co-authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dongjoon@apache.org>
2020-08-13 20:50:47 -07:00
yangjie01 6ae2cb2db3 [SPARK-32526][SQL] Fix some test cases of sql/catalyst module in scala 2.13
### What changes were proposed in this pull request?
The purpose of this pr is to partial resolve [SPARK-32526](https://issues.apache.org/jira/browse/SPARK-32526), total of 88 failed and 2 aborted test cases were fixed, the related suite as follow:

- `DataSourceV2AnalysisBaseSuite` related test cases (71 FAILED -> Pass)
- `TreeNodeSuite` (1 FAILED -> Pass)
- `MetadataSuite `(1 FAILED -> Pass)
- `InferFiltersFromConstraintsSuite `(3 FAILED -> Pass)
- `StringExpressionsSuite ` (1 FAILED -> Pass)
- `JacksonParserSuite ` (1 FAILED -> Pass)
- `HigherOrderFunctionsSuite `(1 FAILED -> Pass)
- `ExpressionParserSuite` (1 FAILED -> Pass)
- `CollectionExpressionsSuite `(6 FAILED -> Pass)
- `SchemaUtilsSuite` (2 FAILED -> Pass)
- `ExpressionSetSuite `(ABORTED -> Pass)
- `ArrayDataIndexedSeqSuite `(ABORTED -> Pass)

The main change of this pr as following:

- `Optimizer` and `Analyzer` are changed to pass compile, `ArrayBuffer` is not a `Seq` in scala 2.13, call `toSeq` method manually to compatible with Scala 2.12

- `m.mapValues().view.force` pattern return a `Map` in scala 2.12 but return a `IndexedSeq` in scala 2.13, call `toMap` method manually to compatible with Scala 2.12. `TreeNode` are changed to pass `DataSourceV2AnalysisBaseSuite` related test cases and `TreeNodeSuite` failed case.

- call `toMap` method of `Metadata#hash` method `case map` branch because `map.mapValues` return `Map` in Scala 2.12 and return `MapView` in Scala 2.13.

- `impl` contact method of `ExpressionSet` in Scala 2.13 version refer to `ExpressionSet` in Scala 2.12 to support `+ + ` method conform to `ExpressionSet` semantics

- `GenericArrayData` not accept `ArrayBuffer` input, call `toSeq` when use `ArrayBuffer` construction `GenericArrayData`   for Scala version compatibility

-  Call `toSeq` in `RandomDataGenerator#randomRow` method to ensure contents of `fields` is `Seq` not `ArrayBuffer`

-  Call `toSeq` Let `JacksonParser#parse` still return a `Seq` because the check method of `JacksonParserSuite#"skipping rows using pushdown filters"` dependence on `Seq` type
- Call `toSeq` in `AstBuilder#visitFunctionCall`, otherwise `ctx.argument.asScala.map(expression)` is `Buffer` in Scala 2.13

- Add a `LongType` match to `ArraySetLike.nullValueHolder`

- Add a `sorted` to ensure `duplicateColumns` string in `SchemaUtils.checkColumnNameDuplication` method error message have a deterministic order

### Why are the changes needed?
We need to support a Scala 2.13 build.

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?

- Scala 2.12: Pass the Jenkins or GitHub Action

- Scala 2.13: Do the following:

```
dev/change-scala-version.sh 2.13
mvn clean install -DskipTests  -pl sql/catalyst -Pscala-2.13 -am
mvn test -pl sql/catalyst -Pscala-2.13
```

**Before**
```
Tests: succeeded 3853, failed 103, canceled 0, ignored 6, pending 0
*** 3 SUITES ABORTED ***
*** 103 TESTS FAILED ***
```

**After**

```
Tests: succeeded 4035, failed 17, canceled 0, ignored 6, pending 0
*** 1 SUITE ABORTED ***
*** 15 TESTS FAILED ***
```

Closes #29370 from LuciferYang/fix-DataSourceV2AnalysisBaseSuite.

Authored-by: yangjie01 <yangjie01@baidu.com>
Signed-off-by: Sean Owen <srowen@gmail.com>
2020-08-13 11:46:30 -05:00