Commit graph

25188 commits

Author SHA1 Message Date
Dongjoon Hyun 34915b22ab [SPARK-29104][CORE][TESTS] Fix PipedRDDSuite to use eventually to check thread termination
### What changes were proposed in this pull request?

`PipedRDD` will invoke `stdinWriterThread.interrupt()` at task completion, and `obj.wait` will get `InterruptedException`. However, there exists a possibility which the thread termination gets delayed because the thread starts from `obj.wait()` with that exception. To prevent test flakiness, we need to use `eventually`. Also, This PR fixes the typo in code comment and variable name.

### Why are the changes needed?

```
- stdin writer thread should be exited when task is finished *** FAILED ***
  Some(Thread[stdin writer for List(cat),5,]) was not empty (PipedRDDSuite.scala:107)
```

- https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-master-test-maven-hadoop-2.7/6867/testReport/junit/org.apache.spark.rdd/PipedRDDSuite/stdin_writer_thread_should_be_exited_when_task_is_finished/

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manual.

We can reproduce the same failure like Jenkins if we catch `InterruptedException` and sleep longer than the `eventually` timeout inside the test code. The following is the example to reproduce it.
```scala
val nums = sc.makeRDD(Array(1, 2, 3, 4), 1).map { x =>
  try {
    obj.synchronized {
      obj.wait() // make the thread waits here.
    }
  } catch {
    case ie: InterruptedException =>
      Thread.sleep(15000)
      throw ie
  }
  x
}
```

Closes #25808 from dongjoon-hyun/SPARK-29104.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-17 20:21:25 +09:00
WeichenXu 104b9b6f8c [SPARK-28483][FOLLOW-UP] Fix flaky test in BarrierTaskContextSuite
### What changes were proposed in this pull request?

I fix the test "barrier task killed" which is flaky:

* Split interrupt/no interrupt test into separate sparkContext. Prevent them to influence each other.
* only check exception on partiton-0. partition-1 is hang on sleep which may throw other exception.

### Why are the changes needed?
Make test robust.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
N/A

Closes #25799 from WeichenXu123/oss_fix_barrier_test.

Authored-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: WeichenXu <weichen.xu@databricks.com>
2019-09-17 19:08:09 +08:00
iRakson 79b10a1aab [SPARK-28929][CORE] Spark Logging level should be INFO instead of DEBUG in Executor Plugin API
### What changes were proposed in this pull request?

Log levels in Executor.scala are changed from DEBUG to INFO.

### Why are the changes needed?

Logging level DEBUG is too low here. These messages are simple acknowledgement for successful loading and initialization of plugins. So its better to keep them in INFO level.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Manually tested.

Closes #25634 from iRakson/ExecutorPlugin.

Authored-by: iRakson <raksonrakesh@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-17 00:53:12 -07:00
Maxim Gekk db996ccad9 [SPARK-29074][SQL] Optimize date_format for foldable fmt
### What changes were proposed in this pull request?

In the PR, I propose to create an instance of `TimestampFormatter` only once at the initialization, and reuse it inside of `nullSafeEval()` and `doGenCode()` in the case when the `fmt` parameter is foldable.

### Why are the changes needed?

The changes improve performance of the `date_format()` function.

Before:
```
format date:                             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
format date wholestage off                    7180 / 7181          1.4         718.0       1.0X
format date wholestage on                     7051 / 7194          1.4         705.1       1.0X
```

After:
```
format date:                             Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
format date wholestage off                    4787 / 4839          2.1         478.7       1.0X
format date wholestage on                     4736 / 4802          2.1         473.6       1.0X
```

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?

By existing test suites `DateExpressionsSuite` and `DateFunctionsSuite`.

Closes #25782 from MaxGekk/date_format-foldable.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-17 16:00:10 +09:00
Jungtaek Lim (HeartSaVioR) c8628354b7 [SPARK-28996][SQL][TESTS] Add tests regarding username of HiveClient
### What changes were proposed in this pull request?

This patch proposes to add new tests to test the username of HiveClient to prevent changing the semantic unintentionally. The owner of Hive table has been changed back-and-forth, principal -> username -> principal, and looks like the change is not intentional. (Please refer [SPARK-28996](https://issues.apache.org/jira/browse/SPARK-28996) for more details.) This patch intends to prevent this.

This patch also renames previous HiveClientSuite(s) to HivePartitionFilteringSuite(s) as it was commented as TODO, as well as previous tests are too narrowed to test only partition filtering.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Newly added UTs.

Closes #25696 from HeartSaVioR/SPARK-28996.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-17 14:04:23 +08:00
zhengruifeng 4d27a25908 [SPARK-22797][ML][PYTHON] Bucketizer support multi-column
### What changes were proposed in this pull request?
Bucketizer support multi-column in the python side

### Why are the changes needed?
Bucketizer should support multi-column like the scala side.

### Does this PR introduce any user-facing change?
yes, this PR add new Python API

### How was this patch tested?
added testsuites

Closes #25801 from zhengruifeng/20542_py.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2019-09-17 11:52:20 +08:00
Liang-Chi Hsieh dffd92e977 [SPARK-29100][SQL] Fix compilation error in codegen with switch from InSet expression
### What changes were proposed in this pull request?

When InSet generates Java switch-based code, if the input set is empty, we don't generate switch condition, but a simple expression that is default case of original switch.

### Why are the changes needed?

SPARK-26205 adds an optimization to InSet that generates Java switch condition for certain cases. When the given set is empty, it is possibly that codegen causes compilation error:

```
[info] - SPARK-29100: InSet with empty input set *** FAILED *** (58 milliseconds)
[info]   Code generation of input[0, int, true] INSET () failed:
[info]   org.codehaus.janino.InternalCompilerException: failed to compile: org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in "generated.java": Compiling "apply(java.lang.Object _i)"; apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous size 0, now 1
[info]   org.codehaus.janino.InternalCompilerException: failed to compile: org.codehaus.janino.InternalCompilerException: Compiling "GeneratedClass" in "generated.java": Compiling "apply(java.lang.Object _i)"; apply(java.lang.Object _i): Operand stack inconsistent at offset 45: Previous size 0, now 1
[info]         at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressions$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1308)
[info]         at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1386)
[info]         at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$$anon$1.load(CodeGenerator.scala:1383)
```

### Does this PR introduce any user-facing change?

Yes. Previously, when users have InSet against an empty set, generated code causes compilation error. This patch fixed it.

### How was this patch tested?

Unit test added.

Closes #25806 from viirya/SPARK-29100.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-17 11:06:10 +08:00
Takeshi Yamamuro 95073fb62b [SPARK-29008][SQL] Define an individual method for each common subexpression in HashAggregateExec
### What changes were proposed in this pull request?

This pr proposes to define an individual method for each common subexpression in HashAggregateExec. In the current master, the common subexpr elimination code in HashAggregateExec is expanded in a single method; 4664a082c2/sql/core/src/main/scala/org/apache/spark/sql/execution/aggregate/HashAggregateExec.scala (L397)

The method size can be too big for JIT compilation, so I believe splitting it is beneficial for performance. For example, in a query `SELECT SUM(a + b), AVG(a + b + c) FROM VALUES (1, 1, 1) t(a, b, c)`,

the current master generates;
```
/* 098 */   private void agg_doConsume_0(InternalRow localtablescan_row_0, int agg_expr_0_0, int agg_expr_1_0, int agg_expr_2_0) throws java.io.IOException {
/* 099 */     // do aggregate
/* 100 */     // common sub-expressions
/* 101 */     int agg_value_6 = -1;
/* 102 */
/* 103 */     agg_value_6 = agg_expr_0_0 + agg_expr_1_0;
/* 104 */
/* 105 */     int agg_value_5 = -1;
/* 106 */
/* 107 */     agg_value_5 = agg_value_6 + agg_expr_2_0;
/* 108 */     boolean agg_isNull_4 = false;
/* 109 */     long agg_value_4 = -1L;
/* 110 */     if (!false) {
/* 111 */       agg_value_4 = (long) agg_value_5;
/* 112 */     }
/* 113 */     int agg_value_10 = -1;
/* 114 */
/* 115 */     agg_value_10 = agg_expr_0_0 + agg_expr_1_0;
/* 116 */     // evaluate aggregate functions and update aggregation buffers
/* 117 */     agg_doAggregate_sum_0(agg_value_10);
/* 118 */     agg_doAggregate_avg_0(agg_value_4, agg_isNull_4);
/* 119 */
/* 120 */   }
```

On the other hand, this pr generates;
```
/* 121 */   private void agg_doConsume_0(InternalRow localtablescan_row_0, int agg_expr_0_0, int agg_expr_1_0, int agg_expr_2_0) throws java.io.IOException {
/* 122 */     // do aggregate
/* 123 */     // common sub-expressions
/* 124 */     long agg_subExprValue_0 = agg_subExpr_0(agg_expr_2_0, agg_expr_0_0, agg_expr_1_0);
/* 125 */     int agg_subExprValue_1 = agg_subExpr_1(agg_expr_0_0, agg_expr_1_0);
/* 126 */     // evaluate aggregate functions and update aggregation buffers
/* 127 */     agg_doAggregate_sum_0(agg_subExprValue_1);
/* 128 */     agg_doAggregate_avg_0(agg_subExprValue_0);
/* 129 */
/* 130 */   }
```

I run some micro benchmarks for this pr;
```
(base) maropu~:$system_profiler SPHardwareDataType
Hardware:
    Hardware Overview:
      Processor Name: Intel Core i5
      Processor Speed: 2 GHz
      Number of Processors: 1
      Total Number of Cores: 2
      L2 Cache (per Core): 256 KB
      L3 Cache: 4 MB
      Memory: 8 GB

(base) maropu~:$java -version
java version "1.8.0_181"
Java(TM) SE Runtime Environment (build 1.8.0_181-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.181-b13, mixed mode)

(base) maropu~:$ /bin/spark-shell --master=local[1] --conf spark.driver.memory=8g --conf spark.sql.shurtitions=1 -v

val numCols = 40
val colExprs = "id AS key" +: (0 until numCols).map { i => s"id AS _c$i" }
spark.range(3000000).selectExpr(colExprs: _*).createOrReplaceTempView("t")

val aggExprs = (2 until numCols).map { i =>
  (0 until i).map(d => s"_c$d")
    .mkString("AVG(", " + ", ")")
}

// Drops the time of a first run then pick that of a second run
timer { sql(s"SELECT ${aggExprs.mkString(", ")} FROM t").write.format("noop").save() }

// the master
maxCodeGen: 12957
Elapsed time: 36.309858661s

// this pr
maxCodeGen=4184
Elapsed time: 2.399490285s
```

### Why are the changes needed?

To avoid the too-long-function issue in JVMs.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added tests in `WholeStageCodegenSuite`

Closes #25710 from maropu/SplitSubexpr.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2019-09-17 11:09:55 +09:00
Jungtaek Lim (HeartSaVioR) 88c8d5eed2 [SPARK-23539][SS][FOLLOWUP][TESTS] Add UT to ensure existing query doesn't break with default conf of includeHeaders
### What changes were proposed in this pull request?

This patch adds new UT to ensure existing query (before Spark 3.0.0) with checkpoint doesn't break with default configuration of "includeHeaders" being introduced via SPARK-23539.

This patch also modifies existing test which checks type of columns to also check headers column as well.

### Why are the changes needed?

The patch adds missing tests which guarantees backward compatibility of the change of SPARK-23539.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

UT passed.

Closes #25792 from HeartSaVioR/SPARK-23539-FOLLOWUP.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-09-16 15:22:04 -05:00
hongdd 5881871ca5 [SPARK-26929][SQL] fix table owner use user instead of principal when create table through spark-sql or beeline
…create table through spark-sql or beeline

## What changes were proposed in this pull request?

fix table owner use user instead of principal when create table through spark-sql
private val userName = conf.getUser will get ugi's userName which is principal info, and i copy the source code into HiveClientImpl, and use ugi.getShortUserName() instead of ugi.getUserName(). The owner display correctly.

## How was this patch tested?

1. create a table in kerberos cluster
2. use "desc formatted tbName" check owner

Please review http://spark.apache.org/contributing.html before opening a pull request.

Closes #23952 from hddong/SPARK-26929-fix-table-owner.

Lead-authored-by: hongdd <jn_hdd@163.com>
Co-authored-by: hongdongdong <hongdongdong@cmss.chinamobile.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-09-16 11:07:50 -07:00
mcheah 67751e2694 [SPARK-29072][CORE] Put back usage of TimeTrackingOutputStream for UnsafeShuffleWriter and SortShuffleWriter
### What changes were proposed in this pull request?
The previous refactors of the shuffle writers using the shuffle writer plugin resulted in shuffle write metric updates - particularly write times - being lost in particular situations. This patch restores the lost metric updates.

### Why are the changes needed?
This fixes a regression. I'm pretty sure that without this, the Spark UI will lose shuffle write time information.

### Does this PR introduce any user-facing change?
No change from Spark 2.4. Without this, there would be a user-facing bug in Spark 3.0.

### How was this patch tested?
Existing unit tests.

Closes #25780 from mccheah/fix-write-metrics.

Authored-by: mcheah <mcheah@palantir.com>
Signed-off-by: Imran Rashid <irashid@cloudera.com>
2019-09-16 09:08:25 -05:00
Takeshi Yamamuro 6297287dfa [SPARK-29061][SQL] Prints bytecode statistics in debugCodegen
### What changes were proposed in this pull request?

This pr proposes to print bytecode statistics (max class bytecode size, max method bytecode size, max constant pool size, and # of inner classes) for generated classes in debug prints, `debugCodegen`. Since these metrics are critical for codegen framework developments, I think its worth printing there. This pr intends to enable `debugCodegen` to print these metrics as following;
```
scala> sql("SELECT sum(v) FROM VALUES(1) t(v)").debugCodegen
Found 2 WholeStageCodegen subtrees.
== Subtree 1 / 2 (maxClassCodeSize:2693; maxMethodCodeSize:124; maxConstantPoolSize:130(0.20% used); numInnerClasses:0) ==
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
*(1) HashAggregate(keys=[], functions=[partial_sum(cast(v#0 as bigint))], output=[sum#5L])
+- *(1) LocalTableScan [v#0]

Generated code:
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIteratorForCodegenStage1(references);
/* 003 */ }
...
```

### Why are the changes needed?

For efficient developments

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Manually tested

Closes #25766 from maropu/PrintBytecodeStats.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-16 21:48:07 +08:00
Dongjoon Hyun 471a3eff51 [SPARK-28932][BUILD][FOLLOWUP] Switch to scala-library compile dependency for JDK11
### What changes were proposed in this pull request?

This is a follow-up of https://github.com/apache/spark/pull/25638 to switch `scala-library` from `test` dependency to `compile` dependency in `network-common` module.

### Why are the changes needed?

Previously, we added `scala-library` as a test dependency to resolve the followings, but it was insufficient to resolve. This PR aims to switch it to compile dependency.
```
$ java -version
openjdk version "11.0.3" 2019-04-16
OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.3+7)
OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.3+7, mixed mode)

$ mvn clean install -pl common/network-common -DskipTests
...
[INFO] --- scala-maven-plugin:4.2.0:doc-jar (attach-scaladocs)  spark-network-common_2.12 ---
error: fatal error: object scala in compiler mirror not found.
one error found
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manually, run the following on JDK11.
```
$ mvn clean install -pl common/network-common -DskipTests
```

Closes #25800 from dongjoon-hyun/SPARK-28932-2.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-16 00:13:07 -07:00
Wenchen Fan 1b99d0cca4 [SPARK-29069][SQL] ResolveInsertInto should not do table lookup
### What changes were proposed in this pull request?

It's more clear to only do table lookup in `ResolveTables` rule (for v2 tables) and `ResolveRelations` rule (for v1 tables). `ResolveInsertInto` should only resolve the `InsertIntoStatement` with resolved relations.

### Why are the changes needed?

to make the code simpler

### Does this PR introduce any user-facing change?

no

### How was this patch tested?

existing tests

Closes #25774 from cloud-fan/simplify.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-16 09:46:34 +09:00
HyukjinKwon 7d4eb38bbc [SPARK-29052][DOCS][ML][PYTHON][CORE][R][SQL][SS] Create a Migration Guide tap in Spark documentation
### What changes were proposed in this pull request?

Currently, there is no migration section for PySpark, SparkCore and Structured Streaming.
It is difficult for users to know what to do when they upgrade.

This PR proposes to create create a "Migration Guide" tap at Spark documentation.

![Screen Shot 2019-09-11 at 7 02 05 PM](https://user-images.githubusercontent.com/6477701/64688126-ad712f80-d4c6-11e9-8672-9a2c56c05bf8.png)

![Screen Shot 2019-09-11 at 7 27 15 PM](https://user-images.githubusercontent.com/6477701/64689915-389ff480-d4ca-11e9-8c54-7f46095d0d23.png)

This page will contain migration guides for Spark SQL, PySpark, SparkR, MLlib, Structured Streaming and Core. Basically it is a refactoring.

There are some new information added, which I will leave a comment inlined for easier review.

1. **MLlib**
  Merge [ml-guide.html#migration-guide](https://spark.apache.org/docs/latest/ml-guide.html#migration-guide) and [ml-migration-guides.html](https://spark.apache.org/docs/latest/ml-migration-guides.html)

    ```
    'docs/ml-guide.md'
            ↓ Merge new/old migration guides
    'docs/ml-migration-guide.md'
    ```

2. **PySpark**
  Extract PySpark specific items from https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html

    ```
    'docs/sql-migration-guide-upgrade.md'
           ↓ Extract PySpark specific items
    'docs/pyspark-migration-guide.md'
    ```

3. **SparkR**
  Move [sparkr.html#migration-guide](https://spark.apache.org/docs/latest/sparkr.html#migration-guide) into a separate file, and extract from [sql-migration-guide-upgrade.html](https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html)

    ```
    'docs/sparkr.md'                     'docs/sql-migration-guide-upgrade.md'
     Move migration guide section ↘     ↙ Extract SparkR specific items
                     docs/sparkr-migration-guide.md
    ```

4. **Core**
  Newly created at `'docs/core-migration-guide.md'`. I skimmed resolved JIRAs at 3.0.0 and found some items to note.

5. **Structured Streaming**
  Newly created at `'docs/ss-migration-guide.md'`. I skimmed resolved JIRAs at 3.0.0 and found some items to note.

6. **SQL**
  Merged [sql-migration-guide-upgrade.html](https://spark.apache.org/docs/latest/sql-migration-guide-upgrade.html) and [sql-migration-guide-hive-compatibility.html](https://spark.apache.org/docs/latest/sql-migration-guide-hive-compatibility.html)
    ```
    'docs/sql-migration-guide-hive-compatibility.md'     'docs/sql-migration-guide-upgrade.md'
     Move Hive compatibility section ↘                   ↙ Left over after filtering PySpark and SparkR items
                                  'docs/sql-migration-guide.md'
    ```

### Why are the changes needed?

In order for users in production to effectively migrate to higher versions, and detect behaviour or breaking changes before upgrading and/or migrating.

### Does this PR introduce any user-facing change?
Yes, this changes Spark's documentation at https://spark.apache.org/docs/latest/index.html.

### How was this patch tested?

Manually build the doc. This can be verified as below:

```bash
cd docs
SKIP_API=1 jekyll build
open _site/index.html
```

Closes #25757 from HyukjinKwon/migration-doc.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-15 11:17:30 -07:00
changchun.wang b91648cfd0 [SPARK-28856][FOLLOW-UP][SQL][TEST] Add the namespaces keyword to TableIdentifierParserSuite
### What changes were proposed in this pull request?

This PR add the `namespaces` keyword to `TableIdentifierParserSuite`.

### Why are the changes needed?
Improve the test.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
N/A

Closes #25758 from highmoutain/3.0bugfix.

Authored-by: changchun.wang <251922566@qq.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-15 11:11:38 -07:00
Jungtaek Lim (HeartSaVioR) 61e5aebce3 [SPARK-29046][SQL] Fix NPE in SQLConf.get when active SparkContext is stopping
### What changes were proposed in this pull request?

This patch fixes the bug regarding NPE in SQLConf.get, which is only possible when SparkContext._dagScheduler is null due to stopping SparkContext. The logic doesn't seem to consider active SparkContext could be in progress of stopping.

Note that it can't be encountered easily as SparkContext.stop() blocks the main thread, but there're many cases which SQLConf.get is accessed concurrently while SparkContext.stop() is executing - users run another threads, or listener is accessing SQLConf.get after dagScheduler is set to null (this is the case what I encountered.)

### Why are the changes needed?

The bug brings NPE.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Previous patch #25753 was tested with new UT, and due to disruption with other tests in concurrent test run, the test is excluded in this patch.

Closes #25790 from HeartSaVioR/SPARK-29046-v2.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-15 11:04:56 -07:00
Maxim Gekk 1b7afc0c98 [SPARK-28471][SQL][DOC][FOLLOWUP] Fix year patterns in the comments of date-time expressions
### What changes were proposed in this pull request?

In the PR, I propose to fix comments of date-time expressions, and replace the `yyyy` pattern by `uuuu` when the implementation supposes the former one.

### Why are the changes needed?

To make comments consistent to implementations.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?

By running Scala Style checker.

Closes #25796 from MaxGekk/year-pattern-uuuu-followup.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-15 11:02:15 -07:00
Dongjoon Hyun 729b3180bc [SPARK-29087][CORE][STREAMING] Use DelegatingServletContextHandler to avoid CCE
### What changes were proposed in this pull request?

[SPARK-27122](https://github.com/apache/spark/pull/24088) fixes `ClassCastException` at `yarn` module by introducing `DelegatingServletContextHandler`. Initially, this was discovered with JDK9+, but the class path issues affected JDK8 environment, too. After [SPARK-28709](https://github.com/apache/spark/pull/25439), I also hit the similar issue at `streaming` module.

This PR aims to fix `streaming` module by adding `getContextPath` to `DelegatingServletContextHandler` and using it.

### Why are the changes needed?

Currently, when we test `streaming` module independently, it fails like the following.
```
$ build/mvn test -pl streaming
...
UISeleniumSuite:
- attaching and detaching a Streaming tab *** FAILED ***
  java.lang.ClassCastException: org.sparkproject.jetty.servlet.ServletContextHandler cannot be cast to org.eclipse.jetty.servlet.ServletContextHandler
...
Tests: succeeded 337, failed 1, canceled 0, ignored 1, pending 0
*** 1 TEST FAILED ***
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[INFO] ------------------------------------------------------------------------
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Pass the Jenkins with the modified tests. And do the following manually.
Since you can observe this when you run `streaming` module test only (instead of running all), you need to install the changed `core` module and use it.

```
$ java -version
openjdk version "1.8.0_222"
OpenJDK Runtime Environment (AdoptOpenJDK)(build 1.8.0_222-b10)
OpenJDK 64-Bit Server VM (AdoptOpenJDK)(build 25.222-b10, mixed mode)
$ build/mvn install -DskipTests
$ build/mvn test -pl streaming
```

Closes #25791 from dongjoon-hyun/SPARK-29087.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-15 10:15:49 -07:00
David Lewis 8c0e961f6c [SPARK-29081][CORE] Replace calls to SerializationUtils.clone on properties with a faster implementation
Replace use of `SerializationUtils.clone` with new `Utils.cloneProperties` method
Add benchmark + results showing dramatic speed up for effectively equivalent functionality.

### What changes were proposed in this pull request?
While I am not sure that SerializationUtils.clone is a performance issue in production, I am sure that it is overkill for the task it is doing (providing a distinct copy of a `Properties` object).
This PR provides a benchmark showing the dramatic improvement over the clone operation and replaces uses of `SerializationUtils.clone` on `Properties` with the more specialized `Utils.cloneProperties`.

### Does this PR introduce any user-facing change?
Strings are immutable so there is no reason to serialize and deserialize them, it just creates extra garbage.
The only functionality that would be changed is the unsupported insertion of non-String objects into the spark local properties.

### How was this patch tested?

1. Pass the Jenkins with the existing tests.
2. Since this is a performance improvement PR, manually run the benchmark.

Closes #25787 from databricks-david-lewis/SPARK-29081.

Authored-by: David Lewis <david.lewis@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-15 00:28:32 -07:00
Andy Zhang 956f6e988c [SPARK-29080][CORE][SPARKR] Support R file extension case-insensitively
### What changes were proposed in this pull request?

Make r file extension check case insensitive for spark-submit.

### Why are the changes needed?

spark-submit does not accept `.r` files as R scripts. Some codebases have r files that end with lowercase file extensions. It is inconvenient to use spark-submit with lowercase extension R files. The error is not very clear (https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmitArguments.scala#L232).

```
$ ./bin/spark-submit examples/src/main/r/dataframe.r
Exception in thread "main" org.apache.spark.SparkException: Cannot load main class from JAR file:/Users/dongjoon/APACHE/spark-release/spark-2.4.4-bin-hadoop2.7/examples/src/main/r/dataframe.r
```

### Does this PR introduce any user-facing change?

Yes. spark-submit can now be used to run R scripts with `.r` file extension.

### How was this patch tested?

Manual.

```
$ mv examples/src/main/r/dataframe.R examples/src/main/r/dataframe.r
$ ./bin/spark-submit examples/src/main/r/dataframe.r
```

Closes #25778 from Loquats/r-case.

Authored-by: Andy Zhang <yue.zhang@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-15 00:17:11 -07:00
Pablo Langa d334fee502 [SPARK-28373][DOCS][WEBUI] JDBC/ODBC Server Tab
### What changes were proposed in this pull request?
New documentation to explain in detail JDBC/ODBC server tab. New images are included to better explanation.

![image](https://user-images.githubusercontent.com/12819544/64735402-c4287e00-d4e8-11e9-9366-c8ac0fbfc058.png)
![image](https://user-images.githubusercontent.com/12819544/64735429-cee31300-d4e8-11e9-83f1-0b662037e194.png)

### Does this PR introduce any user-facing change?
Only documentation

### How was this patch tested?
I have generated it using "jekyll build" to ensure that it's ok

Closes #25718 from planga82/SPARK-28373_JDBCServerPage.

Lead-authored-by: Pablo Langa <soypab@gmail.com>
Co-authored-by: Unknown <soypab@gmail.com>
Co-authored-by: Pablo <soypab@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-09-14 10:18:52 -07:00
Dongjoon Hyun 13b77e52d2 Revert "[SPARK-29046][SQL] Fix NPE in SQLConf.get when active SparkContext is stopping"
This reverts commit 850833fa17.
2019-09-14 00:09:45 -07:00
Shanyu Zhao 174d9104cf [SPARK-29003][CORE] Add start method to ApplicationHistoryProvider to avoid deadlock on startup
### What changes were proposed in this pull request?

During Spark History Server startup, there are two things happening simultaneously that call into `java.nio.file.FileSystems.getDefault()` and we sometime hit [JDK-8194653](https://bugs.openjdk.java.net/browse/JDK-8194653).
1) start jetty server
2) start ApplicationHistoryProvider (which reads files from HDFS)

We should do these two things sequentially instead of in parallel.
We introduce a start() method in ApplicationHistoryProvider (and its subclass FsHistoryProvider), and we do initialize inside the start() method instead of the constructor.
In HistoryServer, we explicitly call provider.start() after we call bind() which starts the Jetty server.

### Why are the changes needed?
It is a bug that occasionally starting Spark History Server results in process hang due to deadlock among threads.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
I stress tested this PR with a bash script to stop and start Spark History Server more than 1000 times, it worked fine. Previously I can only do the stop/start loop less than 10 times before I hit the deadlock issue.

Closes #25705 from shanyu/shanyu-29003.

Authored-by: Shanyu Zhao <shzhao@microsoft.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-13 21:09:17 -07:00
Dongjoon Hyun 703fb2b054 [SPARK-29079][INFRA] Enable GitHub Action on PR
### What changes were proposed in this pull request?

This PR enables GitHub Action on PRs.

### Why are the changes needed?

So far, we detect JDK11 compilation error after merging.
This PR aims to prevent JDK11 compilation error at PR stage.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manual. See the GitHub Action on this PR.

Closes #25786 from dongjoon-hyun/SPARK-29079.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2019-09-13 21:50:06 +00:00
Dongjoon Hyun 8174238d55 [SPARK-29075][BUILD] Add enforcer rule to ban duplicated pom dependency
### What changes were proposed in this pull request?

This PR aims to add a new enforcer rule to ban duplicated pom dependency during build stage.

### Why are the changes needed?

This will help us by preventing the extra effort like the followings.
```
e63098b287 [SPARK-29007][MLLIB][FOLLOWUP] Remove duplicated dependency
39e044e3d8 [MINOR][BUILD] Remove duplicate test-jar:test spark-sql dependency from Hive module
d8fefab4d8 [HOTFIX][BUILD][TEST-MAVEN] Remove duplicate dependency
e9445b187e [SPARK-6866][Build] Remove duplicated dependency in launcher/pom.xml
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manually.

If we have something like e63098b287, it will fail at building phase at PR like the following.
```
[WARNING] Rule 0: org.apache.maven.plugins.enforcer.BanDuplicatePomDependencyVersions failed with message:
Found 1 duplicate dependency declaration in this project:
 - dependencies.dependency[org.apache.spark:spark-streaming_${scala.binary.version}:test-jar] ( 2 times )
...
[INFO] ------------------------------------------------------------------------
[INFO] BUILD FAILURE
[ERROR] Failed to execute goal org.apache.maven.plugins:maven-enforcer-plugin:3.0.0-M2:enforce (enforce-no-duplicate-dependencies) on project spark-mllib_2.12: Some Enforcer rules have failed. Look above for specific messages explaining why the rule failed. -> [Help 1]
```

Closes #25784 from dongjoon-hyun/SPARK-29075.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-13 14:35:02 -07:00
Dongjoon Hyun 8f057a9612 [SPARK-29032][CORE] Add PrometheusServlet to monitor Master/Worker/Driver
### What changes were proposed in this pull request?

This PR aims to simplify `Prometheus` support by adding `PrometheusServlet`. The main use cases are `K8s` and `Spark Standalone` cluster environments.

### Why are the changes needed?

Prometheus.io is a CNCF project used widely with K8s.
- https://github.com/prometheus/prometheus

For `Master/Worker/Driver`, `Spark JMX Sink` and `Prometheus JMX Converter` combination is used in many cases. One way to achieve that is having the followings.

**JMX Sink (conf/metrics.properties)**
```
*.sink.jmx.class=org.apache.spark.metrics.sink.JmxSink
```

**JMX Converter(conf/spark-env.sh)**
- https://repo1.maven.org/maven2/io/prometheus/jmx/jmx_prometheus_javaagent/0.12.0/jmx_prometheus_javaagent-0.12.0.jar
```
export SPARK_DAEMON_JAVA_OPTS=
"-javaagent:${PWD}/jmx_prometheus_javaagent-${JMX_PROMETHEUS_VERSION}.jar=
${PORT_AGENT}:jmx_prometheus.yaml"
```

This agent approach requires `PORT_AGENT` additionally. Instead, this PR natively support `Prometheus` format exporting with reusing REST API port for the better UX.

### Does this PR introduce any user-facing change?

Yes. New web interfaces are added along with the existing JSON API.

|              |                         JSON End Point                       |            Prometheus End Point          |
| ------- | ------------------------------------------- | ---------------------------------- |
| Master | /metrics/master/json/                                     | /metrics/master/prometheus/          |
| Master | /metrics/applications/json/                             | /metrics/applications/prometheus/ |
| Worker | /metrics/json/                                                   | /metrics/prometheus/                      |
| Driver   | /metrics/json/                                                   | /metrics/prometheus/                      |

### How was this patch tested?

Manually connect the new end-points with `curl`.

**Setup (Master/Worker/Driver)**
Add the followings at `conf/metrics.properties` (`conf/metrics.properties.template` has these examples)
```
*.sink.prometheusServlet.class=org.apache.spark.metrics.sink.PrometheusServlet
*.sink.prometheusServlet.path=/metrics/prometheus
master.sink.prometheusServlet.path=/metrics/master/prometheus
applications.sink.prometheusServlet.path=/metrics/applications/prometheus
```
```
$ sbin/start-master.sh
$ sbin/start-slave.sh spark://`hostname`:7077
$ bin/spark-shell --master spark://`hostname`:7077
```

```
$ curl -s http://localhost:8080/metrics/master/json/ | jq
{
  "version": "3.1.3",
  "gauges": {
    "master.aliveWorkers": {
      "value": 1
    },
    "master.apps": {
      "value": 1
    },
    "master.waitingApps": {
      "value": 0
    },
    "master.workers": {
      "value": 1
    }
  },
...
$ curl -s http://localhost:8080/metrics/master/prometheus/ | grep master
metrics_master_aliveWorkers_Value 1
metrics_master_apps_Value 1
metrics_master_waitingApps_Value 0
metrics_master_workers_Value 1
```

```
$ curl -s http://localhost:8080/metrics/applications/json/ | jq
{
  "version": "3.1.3",
  "gauges": {
    "application.Spark shell.1568261490667.cores": {
      "value": 16
    },
    "application.Spark shell.1568261490667.runtime_ms": {
      "value": 108966
    },
    "application.Spark shell.1568261490667.status": {
      "value": "RUNNING"
    }
  },
...
$ curl -s http://localhost:8080/metrics/applications/prometheus/ | grep application
metrics_application_Spark_shell_1568261490667_cores_Value 16
metrics_application_Spark_shell_1568261490667_runtime_ms_Value 143174
```

```
$ curl -s http://localhost:8081/metrics/json/ | jq
{
  "version": "3.1.3",
  "gauges": {
    "worker.coresFree": {
      "value": 0
    },
    "worker.coresUsed": {
      "value": 16
    },
    "worker.executors": {
      "value": 1
    },
    "worker.memFree_MB": {
      "value": 30720
    },
    "worker.memUsed_MB": {
      "value": 1024
    }
  },
...
$ curl -s http://localhost:8081/metrics/prometheus/ | grep worker
metrics_worker_coresFree_Value 0
metrics_worker_coresUsed_Value 16
metrics_worker_executors_Value 1
metrics_worker_memFree_MB_Value 30720
metrics_worker_memUsed_MB_Value 1024
```

```
$ curl -s http://localhost:4040/metrics/json/ | jq
{
  "version": "3.1.3",
  "gauges": {
    "app-20190911211130-0000.driver.BlockManager.disk.diskSpaceUsed_MB": {
      "value": 0
    },
    "app-20190911211130-0000.driver.BlockManager.memory.maxMem_MB": {
      "value": 732
    },
    "app-20190911211130-0000.driver.BlockManager.memory.maxOffHeapMem_MB": {
      "value": 0
    },
    "app-20190911211130-0000.driver.BlockManager.memory.maxOnHeapMem_MB": {
      "value": 732
    },
...
$ curl -s http://localhost:4040/metrics/prometheus/ | head -n5
metrics_app_20190911211130_0000_driver_BlockManager_disk_diskSpaceUsed_MB_Value 0
metrics_app_20190911211130_0000_driver_BlockManager_memory_maxMem_MB_Value 732
metrics_app_20190911211130_0000_driver_BlockManager_memory_maxOffHeapMem_MB_Value 0
metrics_app_20190911211130_0000_driver_BlockManager_memory_maxOnHeapMem_MB_Value 732
metrics_app_20190911211130_0000_driver_BlockManager_memory_memUsed_MB_Value 0
```

Closes #25769 from dongjoon-hyun/SPARK-29032-2.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2019-09-13 21:31:21 +00:00
Dongjoon Hyun bbfaadb280 [SPARK-29064][CORE] Add PrometheusResource to export Executor metrics
### What changes were proposed in this pull request?

At Apache Spark 3.0.0, [SPARK-23429](https://github.com/apache/spark/pull/21221) added the ability to collect executor metrics via heartbeats and to expose it as a REST API. This PR aims to extend it to support `Prometheus` format additionally.

### Why are the changes needed?

Prometheus.io is a CNCF project used widely with K8s.
- https://github.com/prometheus/prometheus

### Does this PR introduce any user-facing change?

Yes. New web interfaces are added along with the existing JSON API.

|              |                JSON End Point                    |            Prometheus End Point         |
| ------- | ------------------------------------ | --------------------------------- |
| Driver   | /api/v1/applications/{id}/executors/   | /metrics/executors/prometheus/   |

### How was this patch tested?

Manually connect to the new end-points with `curl` and compare with JSON.

**SETUP**
```
$ sbin/start-master.sh
$ sbin/start-slave.sh spark://`hostname`:7077
$ bin/spark-shell --master spark://`hostname`:7077 --conf spark.ui.prometheus.enabled=true
```

**JSON (existing after SPARK-23429)**
```
$ curl -s http://localhost:4040/api/v1/applications/app-20190911204823-0000/executors
[ {
  "id" : "driver",
  "hostPort" : "localhost:52615",
  "isActive" : true,
  "rddBlocks" : 0,
  "memoryUsed" : 0,
  "diskUsed" : 0,
  "totalCores" : 0,
  "maxTasks" : 0,
  "activeTasks" : 0,
  "failedTasks" : 0,
  "completedTasks" : 0,
  "totalTasks" : 0,
  "totalDuration" : 0,
  "totalGCTime" : 0,
  "totalInputBytes" : 0,
  "totalShuffleRead" : 0,
  "totalShuffleWrite" : 0,
  "isBlacklisted" : false,
  "maxMemory" : 384093388,
  "addTime" : "2019-09-12T03:48:23.875GMT",
  "executorLogs" : { },
  "memoryMetrics" : {
    "usedOnHeapStorageMemory" : 0,
    "usedOffHeapStorageMemory" : 0,
    "totalOnHeapStorageMemory" : 384093388,
    "totalOffHeapStorageMemory" : 0
  },
  "blacklistedInStages" : [ ],
  "peakMemoryMetrics" : {
    "JVMHeapMemory" : 229995952,
    "JVMOffHeapMemory" : 145872280,
    "OnHeapExecutionMemory" : 0,
    "OffHeapExecutionMemory" : 0,
    "OnHeapStorageMemory" : 0,
    "OffHeapStorageMemory" : 0,
    "OnHeapUnifiedMemory" : 0,
    "OffHeapUnifiedMemory" : 0,
    "DirectPoolMemory" : 75891,
    "MappedPoolMemory" : 0,
    "ProcessTreeJVMVMemory" : 0,
    "ProcessTreeJVMRSSMemory" : 0,
    "ProcessTreePythonVMemory" : 0,
    "ProcessTreePythonRSSMemory" : 0,
    "ProcessTreeOtherVMemory" : 0,
    "ProcessTreeOtherRSSMemory" : 0,
    "MinorGCCount" : 8,
    "MinorGCTime" : 82,
    "MajorGCCount" : 3,
    "MajorGCTime" : 128
  },
  "attributes" : { },
  "resources" : { }
}, {
  "id" : "0",
  "hostPort" : "127.0.0.1:52619",
  "isActive" : true,
  "rddBlocks" : 0,
  "memoryUsed" : 0,
  "diskUsed" : 0,
  "totalCores" : 16,
  "maxTasks" : 16,
  "activeTasks" : 0,
  "failedTasks" : 0,
  "completedTasks" : 0,
  "totalTasks" : 0,
  "totalDuration" : 0,
  "totalGCTime" : 0,
  "totalInputBytes" : 0,
  "totalShuffleRead" : 0,
  "totalShuffleWrite" : 0,
  "isBlacklisted" : false,
  "maxMemory" : 384093388,
  "addTime" : "2019-09-12T03:48:25.907GMT",
  "executorLogs" : {
    "stdout" : "http://127.0.0.1:8081/logPage/?appId=app-20190911204823-0000&executorId=0&logType=stdout",
    "stderr" : "http://127.0.0.1:8081/logPage/?appId=app-20190911204823-0000&executorId=0&logType=stderr"
  },
  "memoryMetrics" : {
    "usedOnHeapStorageMemory" : 0,
    "usedOffHeapStorageMemory" : 0,
    "totalOnHeapStorageMemory" : 384093388,
    "totalOffHeapStorageMemory" : 0
  },
  "blacklistedInStages" : [ ],
  "attributes" : { },
  "resources" : { }
} ]
```

**Prometheus**
```
$ curl -s http://localhost:4040/metrics/executors/prometheus
metrics_app_20190911204823_0000_driver_executor_rddBlocks_Count 0
metrics_app_20190911204823_0000_driver_executor_memoryUsed_Count 0
metrics_app_20190911204823_0000_driver_executor_diskUsed_Count 0
metrics_app_20190911204823_0000_driver_executor_totalCores_Count 0
metrics_app_20190911204823_0000_driver_executor_maxTasks_Count 0
metrics_app_20190911204823_0000_driver_executor_activeTasks_Count 0
metrics_app_20190911204823_0000_driver_executor_failedTasks_Count 0
metrics_app_20190911204823_0000_driver_executor_completedTasks_Count 0
metrics_app_20190911204823_0000_driver_executor_totalTasks_Count 0
metrics_app_20190911204823_0000_driver_executor_totalDuration_Value 0
metrics_app_20190911204823_0000_driver_executor_totalGCTime_Value 0
metrics_app_20190911204823_0000_driver_executor_totalInputBytes_Count 0
metrics_app_20190911204823_0000_driver_executor_totalShuffleRead_Count 0
metrics_app_20190911204823_0000_driver_executor_totalShuffleWrite_Count 0
metrics_app_20190911204823_0000_driver_executor_maxMemory_Count 384093388
metrics_app_20190911204823_0000_driver_executor_usedOnHeapStorageMemory_Count 0
metrics_app_20190911204823_0000_driver_executor_usedOffHeapStorageMemory_Count 0
metrics_app_20190911204823_0000_driver_executor_totalOnHeapStorageMemory_Count 384093388
metrics_app_20190911204823_0000_driver_executor_totalOffHeapStorageMemory_Count 0
metrics_app_20190911204823_0000_driver_executor_JVMHeapMemory_Count 230406336
metrics_app_20190911204823_0000_driver_executor_JVMOffHeapMemory_Count 146132592
metrics_app_20190911204823_0000_driver_executor_OnHeapExecutionMemory_Count 0
metrics_app_20190911204823_0000_driver_executor_OffHeapExecutionMemory_Count 0
metrics_app_20190911204823_0000_driver_executor_OnHeapStorageMemory_Count 0
metrics_app_20190911204823_0000_driver_executor_OffHeapStorageMemory_Count 0
metrics_app_20190911204823_0000_driver_executor_OnHeapUnifiedMemory_Count 0
metrics_app_20190911204823_0000_driver_executor_OffHeapUnifiedMemory_Count 0
metrics_app_20190911204823_0000_driver_executor_DirectPoolMemory_Count 97049
metrics_app_20190911204823_0000_driver_executor_MappedPoolMemory_Count 0
metrics_app_20190911204823_0000_driver_executor_ProcessTreeJVMVMemory_Count 0
metrics_app_20190911204823_0000_driver_executor_ProcessTreeJVMRSSMemory_Count 0
metrics_app_20190911204823_0000_driver_executor_ProcessTreePythonVMemory_Count 0
metrics_app_20190911204823_0000_driver_executor_ProcessTreePythonRSSMemory_Count 0
metrics_app_20190911204823_0000_driver_executor_ProcessTreeOtherVMemory_Count 0
metrics_app_20190911204823_0000_driver_executor_ProcessTreeOtherRSSMemory_Count 0
metrics_app_20190911204823_0000_driver_executor_MinorGCCount_Count 8
metrics_app_20190911204823_0000_driver_executor_MinorGCTime_Count 82
metrics_app_20190911204823_0000_driver_executor_MajorGCCount_Count 3
metrics_app_20190911204823_0000_driver_executor_MajorGCTime_Count 128
metrics_app_20190911204823_0000_0_executor_rddBlocks_Count 0
metrics_app_20190911204823_0000_0_executor_memoryUsed_Count 0
metrics_app_20190911204823_0000_0_executor_diskUsed_Count 0
metrics_app_20190911204823_0000_0_executor_totalCores_Count 16
metrics_app_20190911204823_0000_0_executor_maxTasks_Count 16
metrics_app_20190911204823_0000_0_executor_activeTasks_Count 0
metrics_app_20190911204823_0000_0_executor_failedTasks_Count 0
metrics_app_20190911204823_0000_0_executor_completedTasks_Count 0
metrics_app_20190911204823_0000_0_executor_totalTasks_Count 0
metrics_app_20190911204823_0000_0_executor_totalDuration_Value 0
metrics_app_20190911204823_0000_0_executor_totalGCTime_Value 0
metrics_app_20190911204823_0000_0_executor_totalInputBytes_Count 0
metrics_app_20190911204823_0000_0_executor_totalShuffleRead_Count 0
metrics_app_20190911204823_0000_0_executor_totalShuffleWrite_Count 0
metrics_app_20190911204823_0000_0_executor_maxMemory_Count 384093388
metrics_app_20190911204823_0000_0_executor_usedOnHeapStorageMemory_Count 0
metrics_app_20190911204823_0000_0_executor_usedOffHeapStorageMemory_Count 0
metrics_app_20190911204823_0000_0_executor_totalOnHeapStorageMemory_Count 384093388
metrics_app_20190911204823_0000_0_executor_totalOffHeapStorageMemory_Count 0
```

Closes #25770 from dongjoon-hyun/SPARK-29064.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: DB Tsai <d_tsai@apple.com>
2019-09-13 21:28:27 +00:00
Liang-Chi Hsieh c610de6952 [SPARK-29042][CORE] Sampling-based RDD with unordered input should be INDETERMINATE
### What changes were proposed in this pull request?

We already have found and fixed the correctness issue before when RDD output is INDETERMINATE. One missing part is sampling-based RDD. This kind of RDDs is order sensitive to its input. A sampling-based RDD with unordered input, should be INDETERMINATE.

### Why are the changes needed?

A sampling-based RDD with unordered input is just like MapPartitionsRDD with isOrderSensitive parameter as true. The RDD output can be different after a rerun.

It is a problem in ML applications.

In ML, sample is used to prepare training data. ML algorithm fits the model based on the sampled data. If rerun tasks of sample produce different output during model fitting, ML results will be unreliable and also buggy.

Each sample is random output, but once you sampled, the output should be determinate.

### Does this PR introduce any user-facing change?

Previously, a sampling-based RDD can possibly come with different output after a rerun.
After this patch, sampling-based RDD is INDETERMINATE. For an INDETERMINATE map stage, currently Spark scheduler will re-try all the tasks of the failed stage.

### How was this patch tested?

Added test.

Closes #25751 from viirya/sample-order-sensitive.

Authored-by: Liang-Chi Hsieh <liangchi@uber.com>
Signed-off-by: Liang-Chi Hsieh <liangchi@uber.com>
2019-09-13 14:07:00 -07:00
Dongjoon Hyun e63098b287 [SPARK-29007][MLLIB][FOLLOWUP] Remove duplicated dependency
### What changes were proposed in this pull request?

This removes the duplicated dependency which is added by [SPARK-29007](b62ef8f793/mllib/pom.xml (L58-L64)).

### Why are the changes needed?

Maven complains this kind of duplications. We had better be safe in the future Maven versions.

```
$ cd mllib
$ mvn clean package -DskipTests
[INFO] Scanning for projects...
[WARNING]
[WARNING] Some problems were encountered while building the effective model for org.apache.spark:spark-mllib_2.12🫙3.0.0-SNAPSHOT
[WARNING] 'dependencies.dependency.(groupId:artifactId:type:classifier)' must be unique: org.apache.spark:spark-streaming_${scala.binary.version}:test-jar -> duplicate declaration of version ${project.version}  line 119, column 17
[WARNING]
[WARNING] It is highly recommended to fix these problems because they threaten the stability of your build.
[WARNING]
[WARNING] For this reason, future Maven versions might no longer support building such malformed projects.
[WARNING]
...
```

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Manual check since this is a warning.
```
$ cd mllib
$ mvn clean package -DskipTests
```

Closes #25783 from dongjoon-hyun/SPARK-29007.

Authored-by: Dongjoon Hyun <dhyun@apple.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-13 11:54:46 -07:00
Lee Dongjin 1675d5114e [SPARK-23539][SS] Add support for Kafka headers in Structured Streaming
## What changes were proposed in this pull request?

This update adds support for Kafka Headers functionality in Structured Streaming.

## How was this patch tested?

With following unit tests:

- KafkaRelationSuite: "default starting and ending offsets with headers" (new)
- KafkaSinkSuite: "batch - write to kafka" (updated)

Closes #22282 from dongjinleekr/feature/SPARK-23539.

Lead-authored-by: Lee Dongjin <dongjin@apache.org>
Co-authored-by: Jungtaek Lim <kabhwan@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-09-13 12:31:28 -05:00
Huaxin Gao 77e9b58d4f [SPARK-28969][PYTHON][ML] OneVsRestParams parity between scala and python
### What changes were proposed in this pull request?
Follow the scala ```OneVsRestParams``` implementation, move ```setClassifier``` from ```OneVsRestParams``` to ```OneVsRest``` in Pyspark

### Why are the changes needed?
1. Maintain the parity between scala and python code.
2. ```Classifier``` can only be set in the estimator.

### Does this PR introduce any user-facing change?
Yes.
Previous behavior: ```OneVsRestModel``` has method ```setClassifier```
Current behavior:  ```setClassifier``` is removed from ```OneVsRestModel```. ```classifier``` can only be set in ```OneVsRest```.

### How was this patch tested?
Use existing tests

Closes #25715 from huaxingao/spark-28969.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-09-13 12:29:19 -05:00
Juliusz Sompolski fcf9b41b49 [SPARK-29056] ThriftServerSessionPage displays 1970/01/01 finish and close time when unset
### What changes were proposed in this pull request?

ThriftServerSessionPage displays timestamp 0 (1970/01/01) instead of nothing if query finish time and close time are not set.

![image](https://user-images.githubusercontent.com/25019163/64711118-6d578000-d4b9-11e9-9b11-2e3616319a98.png)

Change it to display nothing, like ThriftServerPage.

### Why are the changes needed?

Obvious bug.

### Does this PR introduce any user-facing change?

Finish time and Close time will be displayed correctly on ThriftServerSessionPage in JDBC/ODBC Spark UI.

### How was this patch tested?

Manual test.

Closes #25762 from juliuszsompolski/SPARK-29056.

Authored-by: Juliusz Sompolski <julek@databricks.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
2019-09-13 09:13:57 -07:00
aman_omer d59980783e [SPARK-28795][DOC][SQL] Document CREATE VIEW statement in SQL Reference
### What changes were proposed in this pull request?
Added document for CREATE VIEW command.

### Why are the changes needed?
As a reference to syntax and examples of CREATE VIEW command.

### How was this patch tested?
Documentation update. Verified manually.

Closes #25543 from amanomer/spark-28795.

Lead-authored-by: aman_omer <amanomer1996@gmail.com>
Co-authored-by: Xiao Li <gatorsmile@gmail.com>
Co-authored-by: Aman Omer <amanomer1996@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-09-12 23:44:23 -07:00
sandeep katta b83304fb01 [SPARK-28796][DOC] Document DROP DATABASE statement in SQL Reference
### What changes were proposed in this pull request?
Document DROP DATABASE statement in SQL Reference

### Why are the changes needed?
Currently from spark there is no complete sql guide is present, so it is better to document all the sql commands, this jira is sub part of this task.

### Does this PR introduce any user-facing change?
Yes, Before there was no documentation about drop database syntax

After Fix
![image](https://user-images.githubusercontent.com/35216143/64787097-977a7200-d58d-11e9-911c-d2ff6f3ccff5.png)
![image](https://user-images.githubusercontent.com/35216143/64787122-a6612480-d58d-11e9-978c-9455baff007f.png)

### How was this patch tested?
tested with jenkyll build

Closes #25554 from sandeep-katta/dropDbDoc.

Authored-by: sandeep katta <sandeep.katta2007@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-09-12 23:10:50 -07:00
Kevin Yu ee63031270 [SPARK-28828][DOC] Document REFRESH TABLE command
### What changes were proposed in this pull request?
Document REFRESH TABLE statement in the SQL Reference Guide.

### Why are the changes needed?
Currently there is no documentation in the SPARK SQL to describe how to use this command, it is to address this issue.

### Does this PR introduce any user-facing change?
Yes.
#### Before:
There is no documentation for this.

#### After:
<img width="826" alt="Screen Shot 2019-09-12 at 11 39 21 AM" src="https://user-images.githubusercontent.com/7550280/64811385-01752600-d552-11e9-876d-91ebb005b851.png">

### How was this patch tested?
Using jykll build --serve

Closes #25549 from kevinyu98/spark-28828-refreshTable.

Authored-by: Kevin Yu <qyu@us.ibm.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-09-12 23:00:42 -07:00
WeichenXu 5631a96367 [SPARK-29048] Improve performance on Column.isInCollection() with a large size collection
### What changes were proposed in this pull request?
The `Column.isInCollection()` with a large size collection will generate an expression with large size children expressions. This make analyzer and optimizer take a long time to run.
In this PR, in `isInCollection()` function, directly generate `InSet` expression, avoid generating too many children expressions.

### Why are the changes needed?
`Column.isInCollection()` with a large size collection sometimes become a bottleneck when running sql.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Manually benchmark it in spark-shell:
```
def testExplainTime(collectionSize: Int) = {
        val df = spark.range(10).withColumn("id2", col("id") + 1)
        val list = Range(0, collectionSize).toList
        val startTime = System.currentTimeMillis()
        df.where(col("id").isInCollection(list)).where(col("id2").isInCollection(list)).explain()
        val elapsedTime = System.currentTimeMillis() - startTime
        println(s"cost time: ${elapsedTime}ms")
}
```
Then test on collection size 5, 10, 100, 1000, 10000, test result is:

collection size | explain time (before) | explain time (after)
------ | ------ | ------
5 | 26ms | 29ms
10 | 30ms | 48ms
100 | 104ms | 50ms
1000 | 1202ms | 58ms
10000 | 10012ms | 523ms

Closes #25754 from WeichenXu123/improve_in_collection.

Lead-authored-by: WeichenXu <weichen.xu@databricks.com>
Co-authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Xiao Li <gatorsmile@gmail.com>
2019-09-12 17:23:08 -07:00
maryannxue c56a012bc8 [SPARK-29060][SQL] Add tree traversal helper for adaptive spark plans
### What changes were proposed in this pull request?
This PR adds a utility class `AdaptiveSparkPlanHelper` which provides methods related to tree traversal of an `AdaptiveSparkPlanExec` plan. Unlike their counterparts in `TreeNode` or
`QueryPlan`, these methods traverse down leaf nodes of adaptive plans, i.e., `AdaptiveSparkPlanExec` and `QueryStageExec`.

### Why are the changes needed?
This utility class can greatly simplify tree traversal code for adaptive spark plans.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Refined `AdaptiveQueryExecSuite` with the help of the new utility methods.

Closes #25764 from maryannxue/aqe-utils.

Authored-by: maryannxue <maryannxue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-12 21:49:21 +08:00
Maxim Gekk 8e9fafbb21 [SPARK-29065][SQL][TEST] Extend EXTRACT benchmark
### What changes were proposed in this pull request?

In the PR, I propose to extend `ExtractBenchmark` and add new ones for:
- `EXTRACT` and `DATE` as input column
- the `DATE_PART` function and `DATE`/`TIMESTAMP` input column

### Why are the changes needed?

The `EXTRACT` expression is rebased on the `DATE_PART` expression by the PR https://github.com/apache/spark/pull/25410 where some of sub-expressions take `DATE` column as the input (`Millennium`, `Year` and etc.) but others require `TIMESTAMP` column (`Hour`, `Minute`). Separate benchmarks for `DATE` should exclude overhead of implicit conversions `DATE` <-> `TIMESTAMP`.

### Does this PR introduce any user-facing change?

No, it doesn't.

### How was this patch tested?
- Regenerated results of `ExtractBenchmark`

Closes #25772 from MaxGekk/date_part-benchmark.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Takeshi Yamamuro <yamamuro@apache.org>
2019-09-12 21:32:35 +09:00
Wenchen Fan 053dd858d3 [SPARK-28998][SQL] reorganize the packages of DS v2 interfaces/classes
### What changes were proposed in this pull request?

reorganize the packages of DS v2 interfaces/classes:
1. `org.spark.sql.connector.catalog`: put `TableCatalog`, `Table` and other related interfaces/classes
2. `org.spark.sql.connector.expression`: put `Expression`, `Transform` and other related interfaces/classes
3. `org.spark.sql.connector.read`: put `ScanBuilder`, `Scan` and other related interfaces/classes
4. `org.spark.sql.connector.write`: put `WriteBuilder`, `BatchWrite` and other related interfaces/classes

### Why are the changes needed?

Data Source V2 has evolved a lot. It's a bit weird that `Expression` is in `org.spark.sql.catalog.v2` and `Table` is in `org.spark.sql.sources.v2`.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

existing tests

Closes #25700 from cloud-fan/package.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-12 19:59:34 +08:00
sandeep katta 7e6142591f [SPARK-28840][SQL] conf.getClassLoader in SparkSQLCLIDriver should be avoided as it returns the UDFClassLoader which is created by Hive
### What changes were proposed in this pull request?

Spark loads the jars to custom class loader which is returned by `getSubmitClassLoader` .
 [Spark code](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L337)

**In 1.2.1.spark2 version of Hive**

`HiveConf.getClassLoader` returns same the class loader which is set by the spark

**In Hive 2.3.5**
`HiveConf.getClassLoader` returns the UDFClassLoader which is created by Hive. Because of this spark cannot find the jars as class loader got changed
[Hive code](https://github.com/apache/hive/blob/rel/release-2.3.5/ql/src/java/org/apache/hadoop/hive/ql/session/SessionState.java#L395)

### Why are the changes needed?
Before creating `CliSessionState` object save the current class loader object in some reference.
After SessionState.start() reset back class Loader to the one which saved earlier.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Added Test case and also Manually tested

**Before Fix**
![b4Fix](https://user-images.githubusercontent.com/35216143/63442838-6789f400-c451-11e9-9529-ccf4ea9621b9.png)

**After Fix**
![afterFix](https://user-images.githubusercontent.com/35216143/63442860-707ac580-c451-11e9-8012-2b70934d55f3.png)

Closes #25542 from sandeep-katta/jarIssue.

Lead-authored-by: sandeep katta <sandeep.katta2007@gmail.com>
Co-authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
2019-09-12 03:47:30 -07:00
LantaoJin 6768431c97 [SPARK-29045][SQL][TESTS] Drop table to avoid test failure in SQLMetricsSuite
### What changes were proposed in this pull request?

In method `SQLMetricsTestUtils.testMetricsDynamicPartition()`, there is a CREATE TABLE sentence without `withTable` block. It causes test failure if use same table name in other unit tests.

### Why are the changes needed?
To avoid "table already exists" in tests.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Exist UT

Closes #25752 from LantaoJin/SPARK-29045.

Authored-by: LantaoJin <jinlantao@gmail.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
2019-09-11 23:05:03 -07:00
Jungtaek Lim (HeartSaVioR) 850833fa17 [SPARK-29046][SQL] Fix NPE in SQLConf.get when active SparkContext is stopping
# What changes were proposed in this pull request?

This patch fixes the bug regarding NPE in SQLConf.get, which is only possible when SparkContext._dagScheduler is null due to stopping SparkContext. The logic doesn't seem to consider active SparkContext could be in progress of stopping.

Note that it can't be encountered easily as `SparkContext.stop()` blocks the main thread, but there're many cases which SQLConf.get is accessed concurrently while SparkContext.stop() is executing - users run another threads, or listener is accessing SQLConf.get after dagScheduler is set to null (this is the case what I encountered.)

### Why are the changes needed?

The bug brings NPE.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Added new UT to verify NPE doesn't occur. Without patch, the test fails with throwing NPE.

Closes #25753 from HeartSaVioR/SPARK-29046.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-12 11:16:33 +09:00
dengziming 8f632d7045 [MINOR][DOCS] Fix few typos in the java docs
JIRA :https://issues.apache.org/jira/browse/SPARK-29050
'a hdfs' change into  'an hdfs'
'an unique' change into 'a unique'
'an url' change into 'a url'
'a error' change into 'an error'

Closes #25756 from dengziming/feature_fix_typos.

Authored-by: dengziming <dengziming@growingio.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-12 09:30:03 +09:00
Wenchen Fan eec728a0d4 [SPARK-29057][SQL] remove InsertIntoTable
### What changes were proposed in this pull request?

Remove `InsertIntoTable` and replace it's usage by `InsertIntoStatement`

### Why are the changes needed?

`InsertIntoTable` and `InsertIntoStatement` are almost identical (except some namings). It doesn't make sense to keep 2 identical plans. After the removal of `InsertIntoTable`, the analysis process becomes:
1. parser creates `InsertIntoStatement`
2. v2 rule `ResolveInsertInto` converts `InsertIntoStatement` to v2 commands.
3. v1 rules like `DataSourceAnalysis` and `HiveAnalysis` convert `InsertIntoStatement` to v1 commands.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

existing tests

Closes #25763 from cloud-fan/remove.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-12 09:24:36 +09:00
HyukjinKwon 7ce0f2b499 [SPARK-29041][PYTHON] Allows createDataFrame to accept bytes as binary type
### What changes were proposed in this pull request?

This PR proposes to allow `bytes` as an acceptable type for binary type for `createDataFrame`.

### Why are the changes needed?

`bytes` is a standard type for binary in Python. This should be respected in PySpark side.

### Does this PR introduce any user-facing change?

Yes, _when specified type is binary_, we will allow `bytes` as a binary type. Previously this was not allowed in both Python 2 and Python 3 as below:

```python
spark.createDataFrame([[b"abcd"]], "col binary")
```

in Python 3

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/session.py", line 787, in createDataFrame
    rdd, schema = self._createFromLocal(map(prepare, data), schema)
  File "/.../spark/python/pyspark/sql/session.py", line 442, in _createFromLocal
    data = list(data)
  File "/.../spark/python/pyspark/sql/session.py", line 769, in prepare
    verify_func(obj)
  File "/.../forked/spark/python/pyspark/sql/types.py", line 1403, in verify
    verify_value(obj)
  File "/.../spark/python/pyspark/sql/types.py", line 1384, in verify_struct
    verifier(v)
  File "/.../spark/python/pyspark/sql/types.py", line 1403, in verify
    verify_value(obj)
  File "/.../spark/python/pyspark/sql/types.py", line 1397, in verify_default
    verify_acceptable_types(obj)
  File "/.../spark/python/pyspark/sql/types.py", line 1282, in verify_acceptable_types
    % (dataType, obj, type(obj))))
TypeError: field col: BinaryType can not accept object b'abcd' in type <class 'bytes'>
```

in Python 2:

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/session.py", line 787, in createDataFrame
    rdd, schema = self._createFromLocal(map(prepare, data), schema)
  File "/.../spark/python/pyspark/sql/session.py", line 442, in _createFromLocal
    data = list(data)
  File "/.../spark/python/pyspark/sql/session.py", line 769, in prepare
    verify_func(obj)
  File "/.../spark/python/pyspark/sql/types.py", line 1403, in verify
    verify_value(obj)
  File "/.../spark/python/pyspark/sql/types.py", line 1384, in verify_struct
    verifier(v)
  File "/.../spark/python/pyspark/sql/types.py", line 1403, in verify
    verify_value(obj)
  File "/.../spark/python/pyspark/sql/types.py", line 1397, in verify_default
    verify_acceptable_types(obj)
  File "/.../spark/python/pyspark/sql/types.py", line 1282, in verify_acceptable_types
    % (dataType, obj, type(obj))))
TypeError: field col: BinaryType can not accept object 'abcd' in type <type 'str'>
```

So, it won't break anything.

### How was this patch tested?

Unittests were added and also manually tested as below.

```bash
./run-tests --python-executables=python2,python3 --testnames "pyspark.sql.tests.test_serde"
```

Closes #25749 from HyukjinKwon/SPARK-29041.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-12 08:52:25 +09:00
Jungtaek Lim (HeartSaVioR) c18f849d76 [SPARK-24663][STREAMING][TESTS] StreamingContextSuite: Wait until slow receiver has been initialized, but with hard timeout
### What changes were proposed in this pull request?

This patch fixes the flaky test failure from StreamingContextSuite "stop slow receiver gracefully", via putting flag whether initializing slow receiver is completed, and wait for such flag to be true. As receiver should be submitted via job and initialized in executor, 500ms might not be enough for covering all cases.

### Why are the changes needed?

We got some reports for test failure on this test. Please refer [SPARK-24663](https://issues.apache.org/jira/browse/SPARK-24663)

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Modified UT. I've artificially made delay on handling job submission via adding below code in `DAGScheduler.submitJob`:

```
if (rdd != null && rdd.name != null && rdd.name.startsWith("Receiver")) {
  println(s"Receiver Job! rdd name: ${rdd.name}")
  Thread.sleep(1000)
}
```

and the test "stop slow receiver gracefully" failed on current master and passed on the patch.

Closes #25725 from HeartSaVioR/SPARK-24663.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-09-11 13:31:43 -07:00
Jungtaek Lim (HeartSaVioR) b62ef8f793 [SPARK-29007][STREAMING][MLLIB][TESTS] Enforce not leaking SparkContext in tests which creates new StreamingContext with new SparkContext
### What changes were proposed in this pull request?

This patch enforces tests to prevent leaking newly created SparkContext while is created via initializing StreamingContext. Leaking SparkContext in test would make most of following tests being failed as well, so this patch applies defensive programming, trying its best to ensure SparkContext is cleaned up.

### Why are the changes needed?

We got some case in CI build where SparkContext is being leaked and other tests are affected by leaked SparkContext. Ideally we should isolate the environment among tests if possible.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Modified UTs.

Closes #25709 from HeartSaVioR/SPARK-29007.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-09-11 10:29:13 -07:00
Jungtaek Lim (HeartSaVioR) 2736efa32d [SPARK-26989][CORE][TEST] DAGSchedulerSuite: ensure listeners are fully processed before checking recorded values
### What changes were proposed in this pull request?

This patch ensures accessing recorded values in listener is always after letting listeners fully process all events. To ensure this, this patch adds new class to hide these values and access with methods which will ensure above condition. Without this guard, two threads are running concurrently - 1) listeners process thread 2) test main thread - and race condition would occur.

That's why we also see very odd thing, error message saying condition is met but test failed:
```
- Barrier task failures from the same stage attempt don't trigger multiple stage retries *** FAILED ***
  ArrayBuffer(0) did not equal List(0) (DAGSchedulerSuite.scala:2656)
```
which means verification failed, and condition is met just before constructing error message.

The guard is properly placed in many spots, but missed in some places. This patch enforces that it can't be missed.

### Why are the changes needed?

UT fails intermittently and this patch will address the flakyness.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Modified UT.

Also made the flaky tests artificially failing via applying 50ms of sleep on each onXXX method.

![Screen Shot 2019-09-07 at 7 44 15 AM](https://user-images.githubusercontent.com/1317309/64465178-1747ad00-d146-11e9-92f6-f4ed4a1f4b08.png)

I found 3 methods being failed. (They've marked as X. Just ignore ! as they failed on waiting listener in given timeout and these tests don't deal with these recorded values - it uses other timeout value 1000ms than 10000ms for this listener so affected via side-effect.)

When I applied same in this patch all tests marked as X passed.

Closes #25706 from HeartSaVioR/SPARK-26989.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-09-11 10:24:57 -07:00
Liu,Linhong f263909ee1 [SPARK-23243][CORE][FOLLOWUP] Remove todo added by SPARK-23207
### What changes were proposed in this pull request?
PR #22112 fixed the todo added by PR #20393(SPARK-23207). We can remove it now.

### Why are the changes needed?
In order not to confuse developers.

### Does this PR introduce any user-facing change?
no

### How was this patch tested?
no need to test

Closes #25755 from LinhongLiu/remove-todo.

Authored-by: Liu,Linhong <liulinhong@baidu.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-11 21:51:48 +08:00