Commit graph

25454 commits

Author SHA1 Message Date
Maxim Gekk ece4213176 [SPARK-21914][FOLLOWUP][TEST-HADOOP3.2][TEST-JAVA11] Clone SparkSession per each function example
### What changes were proposed in this pull request?
In the PR, I propose to clone Spark session per-each expression example. Examples can modify SQL settings, and can influence on each other if they run in the same Spark session in parallel.

### Why are the changes needed?
This should fix test failures like [this](https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-maven-hadoop-3.2-jdk-11/478/testReport/junit/org.apache.spark.sql/SQLQuerySuite/check_outputs_of_expression_examples/) checking of the `Like` example:
```
org.apache.spark.sql.AnalysisException: the pattern '\%SystemDrive\%\Users%' is invalid, the escape character is not allowed to precede 'U';
      at org.apache.spark.sql.catalyst.util.StringUtils$.fail$1(StringUtils.scala:48)
      at org.apache.spark.sql.catalyst.util.StringUtils$.escapeLikeRegex(StringUtils.scala:57)
      at org.apache.spark.sql.catalyst.expressions.Like.escape(regexpExpressions.scala:108)
```

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
By running `check outputs of expression examples` in `org.apache.spark.sql.SQLQuerySuite`

Closes #25956 from MaxGekk/fix-expr-examples-checks.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-29 02:57:55 +09:00
Jungtaek Lim (HeartSaVioR) d72f39897b
[SPARK-27254][SS] Cleanup complete but invalid output files in ManifestFileCommitProtocol if job is aborted
## What changes were proposed in this pull request?

SPARK-27210 enables ManifestFileCommitProtocol to clean up incomplete output files in task level if task is aborted.

This patch extends the area of cleaning up, proposes ManifestFileCommitProtocol to clean up complete but invalid output files in job level if job aborts. Please note that this works as 'best-effort', not kind of guarantee, as we have in HadoopMapReduceCommitProtocol.

## How was this patch tested?

Added UT.

Closes #24186 from HeartSaVioR/SPARK-27254.

Lead-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Co-authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan.opensource@gmail.com>
Signed-off-by: Shixiong Zhu <zsxwing@gmail.com>
2019-09-27 12:35:26 -07:00
Jeff Evans 233c214a75 [SPARK-29070][CORE] Make SparkLauncher log full spark-submit command line
Log the full spark-submit command in SparkSubmit#launchApplication

Adding .python-version (pyenv file) to RAT exclusion list

### What changes were proposed in this pull request?

Original motivation [here](http://apache-spark-user-list.1001560.n3.nabble.com/Is-it-possible-to-obtain-the-full-command-to-be-invoked-by-SparkLauncher-td35144.html), expanded in the [Jira](https://issues.apache.org/jira/browse/SPARK-29070)..  In essence, we want to be able to log the full `spark-submit` command being constructed by `SparkLauncher`

### Why are the changes needed?

Currently, it is not possible to directly obtain this information from the `SparkLauncher` instance, which makes debugging and customer support more difficult.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

`core` `sbt` tests were executed.  The `SparkLauncherSuite` (where I added assertions to an existing test) was also checked.  Within that, `testSparkLauncherGetError` is failing, but that appears not to have been caused by this change (failing for me even on the parent commit of c18f849d76).

Closes #25777 from jeff303/SPARK-29070.

Authored-by: Jeff Evans <jeffrey.wayne.evans@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-09-27 11:32:22 -07:00
Juliusz Sompolski 420abb457d [SPARK-29263][SCHEDULER] Update availableSlots in resourceOffers() before checking available slots for barrier taskSet
### What changes were proposed in this pull request?

availableSlots are computed before the for loop looping over all TaskSets in resourceOffers. But the number of slots changes in every iteration, as in every iteration these slots are taken. The number of available slots checked by a barrier task set has therefore to be recomputed in every iteration from availableCpus.

### Why are the changes needed?

Bugfix.
This could make resourceOffer attempt to start a barrier task set, even though it has not enough slots available. That would then be caught by the `require` in line 519, which will throw an exception, which will get caught and ignored by Dispatcher's MessageLoop, so nothing terrible would happen, but the exception would prevent resourceOffers from considering further TaskSets.
Note that launching the barrier TaskSet can still fail if other requirements are not satisfied, and still can be rolled-back by throwing exception in this `require`. Handling it more gracefully remains a TODO in SPARK-24818, but this fix at least should resolve the situation when it's unable to launch because of insufficient slots.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Added UT

Closes #23375

Closes #25946 from juliuszsompolski/SPARK-29263.

Authored-by: Juliusz Sompolski <julek@databricks.com>
Signed-off-by: Xingbo Jiang <xingbo.jiang@databricks.com>
2019-09-27 11:18:32 -07:00
HyukjinKwon fda0e6e48d [SPARK-29240][PYTHON] Pass Py4J column instance to support PySpark column in element_at function
### What changes were proposed in this pull request?

This PR makes `element_at` in PySpark able to take PySpark `Column` instances.

### Why are the changes needed?

To match with Scala side. Seems it was intended but not working correctly as a bug.

### Does this PR introduce any user-facing change?

Yes. See below:

```python
from pyspark.sql import functions as F
x = spark.createDataFrame([([1,2,3],1),([4,5,6],2),([7,8,9],3)],['list','num'])
x.withColumn('aa',F.element_at('list',x.num.cast('int'))).show()
```

Before:

```
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/.../spark/python/pyspark/sql/functions.py", line 2059, in element_at
    return Column(sc._jvm.functions.element_at(_to_java_column(col), extraction))
  File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1277, in __call__
  File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1241, in _build_args
  File "/.../spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_gateway.py", line 1228, in _get_args
  File "/.../forked/spark/python/lib/py4j-0.10.8.1-src.zip/py4j/java_collections.py", line 500, in convert
  File "/.../spark/python/pyspark/sql/column.py", line 344, in __iter__
    raise TypeError("Column is not iterable")
TypeError: Column is not iterable
```

After:

```
+---------+---+---+
|     list|num| aa|
+---------+---+---+
|[1, 2, 3]|  1|  1|
|[4, 5, 6]|  2|  5|
|[7, 8, 9]|  3|  9|
+---------+---+---+
```

### How was this patch tested?

Manually tested against literal, Python native types, and PySpark column.

Closes #25950 from HyukjinKwon/SPARK-29240.

Authored-by: HyukjinKwon <gurwls223@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-27 11:04:55 -07:00
Maxim Gekk 4bffcf5a34 [SPARK-29275][SQL][DOC] Describe special date/timestamp values in the SQL migration guide
### What changes were proposed in this pull request?

Updated the SQL migration guide regarding to recently supported special date and timestamp values, see https://github.com/apache/spark/pull/25716 and https://github.com/apache/spark/pull/25708.

Closes #25834

### Why are the changes needed?
To let users know about new feature in Spark 3.0.

### Does this PR introduce any user-facing change?
No

Closes #25948 from MaxGekk/special-values-migration-guide.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-27 10:36:20 -07:00
angerszhu cc852d4eec [SPARK-29015][SQL][TEST-HADOOP3.2] Reset class loader after initializing SessionState for built-in Hive 2.3
### What changes were proposed in this pull request?

Hive 2.3 will set a new UDFClassLoader to hiveConf.classLoader when initializing SessionState since HIVE-11878,  and
1. ADDJarCommand will add jars to clientLoader.classLoader.
2. --jar passed jar will be added to clientLoader.classLoader
3.  jar passed by hive conf  `hive.aux.jars`  [SPARK-28954](https://github.com/apache/spark/pull/25653) [SPARK-28840](https://github.com/apache/spark/pull/25542) will be added to clientLoader.classLoader too

For these  reason we cannot load the jars added by ADDJarCommand because of class loader got changed. We reset it to clientLoader.ClassLoader here.

### Why are the changes needed?
support for jdk11

### Does this PR introduce any user-facing change?
NO

### How was this patch tested?
UT
```
export JAVA_HOME=/usr/lib/jdk-11.0.3
export PATH=$JAVA_HOME/bin:$PATH

build/sbt -Phive-thriftserver -Phadoop-3.2

hive/test-only *HiveSparkSubmitSuite -- -z "SPARK-8368: includes jars passed in through --jars"
hive-thriftserver/test-only *HiveThriftBinaryServerSuite -- -z "test add jar"
```

Closes #25775 from AngersZhuuuu/SPARK-29015-STS-JDK11.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-09-27 10:23:56 -05:00
Maxim Gekk 4dd0066d40 [SPARK-21914][SQL][TESTS] Check results of expression examples
### What changes were proposed in this pull request?

New test compares outputs of expression examples in comments with results of `hiveResultString()`. Also I fixed existing examples where actual and expected outputs are different.

### Why are the changes needed?
This prevents mistakes in expression examples, and fixes existing mistakes in comments.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Add new test to `SQLQuerySuite`.

Closes #25942 from MaxGekk/run-expr-examples.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-27 21:30:37 +09:00
Wang Shuo bd28e8e179 [SPARK-29213][SQL] Generate extra IsNotNull predicate in FilterExec
### What changes were proposed in this pull request?
Currently the behavior of getting output and generating null checks in `FilterExec` is different. Thus some nullable attribute could be treated as not nullable by mistake.

In `FilterExec.ouput`, an attribute is marked as nullable or not by finding its `exprId` in notNullAttributes:
```
a.nullable && notNullAttributes.contains(a.exprId)
```
But in `FilterExec.doConsume`,  a `nullCheck` is generated or not for a predicate is decided by whether there is semantic equal not null predicate:
```
      val nullChecks = c.references.map { r =>
        val idx = notNullPreds.indexWhere { n => n.asInstanceOf[IsNotNull].child.semanticEquals(r)}
        if (idx != -1 && !generatedIsNotNullChecks(idx)) {
          generatedIsNotNullChecks(idx) = true
          // Use the child's output. The nullability is what the child produced.
          genPredicate(notNullPreds(idx), input, child.output)
        } else {
          ""
        }
      }.mkString("\n").trim
```
NPE will happen when run the SQL below:
```
sql("create table table1(x string)")
sql("create table table2(x bigint)")
sql("create table table3(x string)")
sql("insert into table2 select null as x")
sql(
  """
    |select t1.x
    |from (
    |    select x from table1) t1
    |left join (
    |    select x from (
    |        select x from table2
    |        union all
    |        select substr(x,5) x from table3
    |    ) a
    |    where length(x)>0
    |) t3
    |on t1.x=t3.x
  """.stripMargin).collect()
```
NPE Exception:
```
java.lang.NullPointerException
    at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIteratorForCodegenStage2.processNext(generated.java:40)
    at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
    at org.apache.spark.sql.execution.WholeStageCodegenExec$$anon$1.hasNext(WholeStageCodegenExec.scala:726)
    at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:458)
    at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:135)
    at org.apache.spark.shuffle.ShuffleWriteProcessor.write(ShuffleWriteProcessor.scala:59)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:94)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:52)
    at org.apache.spark.scheduler.Task.run(Task.scala:127)
    at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:449)
    at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1377)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:452)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
```
the generated code:
```
== Subtree 4 / 5 ==
*(2) Project [cast(x#7L as string) AS x#9]
+- *(2) Filter ((length(cast(x#7L as string)) > 0) AND isnotnull(cast(x#7L as string)))
   +- Scan hive default.table2 [x#7L], HiveTableRelation `default`.`table2`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [x#7L]

Generated code:
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIteratorForCodegenStage2(references);
/* 003 */ }
/* 004 */
/* 005 */ // codegenStageId=2
/* 006 */ final class GeneratedIteratorForCodegenStage2 extends org.apache.spark.sql.execution.BufferedRowIterator {
/* 007 */   private Object[] references;
/* 008 */   private scala.collection.Iterator[] inputs;
/* 009 */   private scala.collection.Iterator inputadapter_input_0;
/* 010 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] filter_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[2];
/* 011 */
/* 012 */   public GeneratedIteratorForCodegenStage2(Object[] references) {
/* 013 */     this.references = references;
/* 014 */   }
/* 015 */
/* 016 */   public void init(int index, scala.collection.Iterator[] inputs) {
/* 017 */     partitionIndex = index;
/* 018 */     this.inputs = inputs;
/* 019 */     inputadapter_input_0 = inputs[0];
/* 020 */     filter_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 0);
/* 021 */     filter_mutableStateArray_0[1] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 32);
/* 022 */
/* 023 */   }
/* 024 */
/* 025 */   protected void processNext() throws java.io.IOException {
/* 026 */     while ( inputadapter_input_0.hasNext()) {
/* 027 */       InternalRow inputadapter_row_0 = (InternalRow) inputadapter_input_0.next();
/* 028 */
/* 029 */       do {
/* 030 */         boolean inputadapter_isNull_0 = inputadapter_row_0.isNullAt(0);
/* 031 */         long inputadapter_value_0 = inputadapter_isNull_0 ?
/* 032 */         -1L : (inputadapter_row_0.getLong(0));
/* 033 */
/* 034 */         boolean filter_isNull_2 = inputadapter_isNull_0;
/* 035 */         UTF8String filter_value_2 = null;
/* 036 */         if (!inputadapter_isNull_0) {
/* 037 */           filter_value_2 = UTF8String.fromString(String.valueOf(inputadapter_value_0));
/* 038 */         }
/* 039 */         int filter_value_1 = -1;
/* 040 */         filter_value_1 = (filter_value_2).numChars();
/* 041 */
/* 042 */         boolean filter_value_0 = false;
/* 043 */         filter_value_0 = filter_value_1 > 0;
/* 044 */         if (!filter_value_0) continue;
/* 045 */
/* 046 */         boolean filter_isNull_6 = inputadapter_isNull_0;
/* 047 */         UTF8String filter_value_6 = null;
/* 048 */         if (!inputadapter_isNull_0) {
/* 049 */           filter_value_6 = UTF8String.fromString(String.valueOf(inputadapter_value_0));
/* 050 */         }
/* 051 */         if (!(!filter_isNull_6)) continue;
/* 052 */
/* 053 */         ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] /* numOutputRows */).add(1);
/* 054 */
/* 055 */         boolean project_isNull_0 = false;
/* 056 */         UTF8String project_value_0 = null;
/* 057 */         if (!false) {
/* 058 */           project_value_0 = UTF8String.fromString(String.valueOf(inputadapter_value_0));
/* 059 */         }
/* 060 */         filter_mutableStateArray_0[1].reset();
/* 061 */
/* 062 */         filter_mutableStateArray_0[1].zeroOutNullBytes();
/* 063 */
/* 064 */         if (project_isNull_0) {
/* 065 */           filter_mutableStateArray_0[1].setNullAt(0);
/* 066 */         } else {
/* 067 */           filter_mutableStateArray_0[1].write(0, project_value_0);
/* 068 */         }
/* 069 */         append((filter_mutableStateArray_0[1].getRow()));
/* 070 */
/* 071 */       } while(false);
/* 072 */       if (shouldStop()) return;
/* 073 */     }
/* 074 */   }
/* 075 */
/* 076 */ }

```

This PR proposes to use semantic comparison both in `FilterExec.output` and `FilterExec.doConsume` for nullable attribute.

With this PR, the generated code snippet is below:
```
== Subtree 2 / 5 ==
*(3) Project [substring(x#8, 5, 2147483647) AS x#5]
+- *(3) Filter ((length(substring(x#8, 5, 2147483647)) > 0) AND isnotnull(substring(x#8, 5, 2147483647)))
   +- Scan hive default.table3 [x#8], HiveTableRelation `default`.`table3`, org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, [x#8]

Generated code:
/* 001 */ public Object generate(Object[] references) {
/* 002 */   return new GeneratedIteratorForCodegenStage3(references);
/* 003 */ }
/* 004 */
/* 005 */ // codegenStageId=3
/* 006 */ final class GeneratedIteratorForCodegenStage3 extends org.apache.spark.sql.execution.BufferedRowIterator {
/* 007 */   private Object[] references;
/* 008 */   private scala.collection.Iterator[] inputs;
/* 009 */   private scala.collection.Iterator inputadapter_input_0;
/* 010 */   private org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[] filter_mutableStateArray_0 = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter[2];
/* 011 */
/* 012 */   public GeneratedIteratorForCodegenStage3(Object[] references) {
/* 013 */     this.references = references;
/* 014 */   }
/* 015 */
/* 016 */   public void init(int index, scala.collection.Iterator[] inputs) {
/* 017 */     partitionIndex = index;
/* 018 */     this.inputs = inputs;
/* 019 */     inputadapter_input_0 = inputs[0];
/* 020 */     filter_mutableStateArray_0[0] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 32);
/* 021 */     filter_mutableStateArray_0[1] = new org.apache.spark.sql.catalyst.expressions.codegen.UnsafeRowWriter(1, 32);
/* 022 */
/* 023 */   }
/* 024 */
/* 025 */   protected void processNext() throws java.io.IOException {
/* 026 */     while ( inputadapter_input_0.hasNext()) {
/* 027 */       InternalRow inputadapter_row_0 = (InternalRow) inputadapter_input_0.next();
/* 028 */
/* 029 */       do {
/* 030 */         boolean inputadapter_isNull_0 = inputadapter_row_0.isNullAt(0);
/* 031 */         UTF8String inputadapter_value_0 = inputadapter_isNull_0 ?
/* 032 */         null : (inputadapter_row_0.getUTF8String(0));
/* 033 */
/* 034 */         boolean filter_isNull_0 = true;
/* 035 */         boolean filter_value_0 = false;
/* 036 */         boolean filter_isNull_2 = true;
/* 037 */         UTF8String filter_value_2 = null;
/* 038 */
/* 039 */         if (!inputadapter_isNull_0) {
/* 040 */           filter_isNull_2 = false; // resultCode could change nullability.
/* 041 */           filter_value_2 = inputadapter_value_0.substringSQL(5, 2147483647);
/* 042 */
/* 043 */         }
/* 044 */         boolean filter_isNull_1 = filter_isNull_2;
/* 045 */         int filter_value_1 = -1;
/* 046 */
/* 047 */         if (!filter_isNull_2) {
/* 048 */           filter_value_1 = (filter_value_2).numChars();
/* 049 */         }
/* 050 */         if (!filter_isNull_1) {
/* 051 */           filter_isNull_0 = false; // resultCode could change nullability.
/* 052 */           filter_value_0 = filter_value_1 > 0;
/* 053 */
/* 054 */         }
/* 055 */         if (filter_isNull_0 || !filter_value_0) continue;
/* 056 */         boolean filter_isNull_8 = true;
/* 057 */         UTF8String filter_value_8 = null;
/* 058 */
/* 059 */         if (!inputadapter_isNull_0) {
/* 060 */           filter_isNull_8 = false; // resultCode could change nullability.
/* 061 */           filter_value_8 = inputadapter_value_0.substringSQL(5, 2147483647);
/* 062 */
/* 063 */         }
/* 064 */         if (!(!filter_isNull_8)) continue;
/* 065 */
/* 066 */         ((org.apache.spark.sql.execution.metric.SQLMetric) references[0] /* numOutputRows */).add(1);
/* 067 */
/* 068 */         boolean project_isNull_0 = true;
/* 069 */         UTF8String project_value_0 = null;
/* 070 */
/* 071 */         if (!inputadapter_isNull_0) {
/* 072 */           project_isNull_0 = false; // resultCode could change nullability.
/* 073 */           project_value_0 = inputadapter_value_0.substringSQL(5, 2147483647);
/* 074 */
/* 075 */         }
/* 076 */         filter_mutableStateArray_0[1].reset();
/* 077 */
/* 078 */         filter_mutableStateArray_0[1].zeroOutNullBytes();
/* 079 */
/* 080 */         if (project_isNull_0) {
/* 081 */           filter_mutableStateArray_0[1].setNullAt(0);
/* 082 */         } else {
/* 083 */           filter_mutableStateArray_0[1].write(0, project_value_0);
/* 084 */         }
/* 085 */         append((filter_mutableStateArray_0[1].getRow()));
/* 086 */
/* 087 */       } while(false);
/* 088 */       if (shouldStop()) return;
/* 089 */     }
/* 090 */   }
/* 091 */
/* 092 */ }
```
### Why are the changes needed?
Fix NPE bug in FilterExec.

### Does this PR introduce any user-facing change?
no

### How was this patch tested?
new UT

Closes #25902 from wangshuo128/filter-codegen-npe.

Authored-by: Wang Shuo <wangshuo128@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-27 15:14:17 +08:00
zhengruifeng aed7ff36f7 [SPARK-29258][ML][PYSPARK] parity between ml.evaluator and mllib.metrics
### What changes were proposed in this pull request?
1, expose `BinaryClassificationMetrics.numBins` in `BinaryClassificationEvaluator`
2, expose `RegressionMetrics.throughOrigin` in `RegressionEvaluator`
3, add metric `explainedVariance` in `RegressionEvaluator`

### Why are the changes needed?
existing function in mllib.metrics should also be exposed in ml

### Does this PR introduce any user-facing change?
yes, this PR add two expert params and one metric option

### How was this patch tested?
existing and added tests

Closes #25940 from zhengruifeng/evaluator_add_param.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2019-09-27 13:30:03 +08:00
Yuanjian Li ada3ad34c6 [SPARK-29175][SQL] Make additional remote maven repository in IsolatedClientLoader configurable
### What changes were proposed in this pull request?
Added a new config "spark.sql.additionalRemoteRepositories", a comma-delimited string config of the optional additional remote maven mirror.

### Why are the changes needed?
We need to connect the Maven repositories in IsolatedClientLoader for downloading Hive jars,
end-users can set this config if the default maven central repo is unreachable.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing UT.

Closes #25849 from xuanyuanking/SPARK-29175.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-26 20:57:44 -07:00
uncleGen 570525f886 [SPARK-27715][SQL][UI] SQL query details in UI does not show in correct format
## What changes were proposed in this pull request?

before pr:
![image](https://user-images.githubusercontent.com/7402327/57752168-bb7e9180-771a-11e9-8757-63236ecab753.png)

after pr:
![image](https://user-images.githubusercontent.com/7402327/57752175-c802ea00-771a-11e9-96fd-aef1890b7985.png)

## How was this patch tested?

manual test

Closes #24609 from uncleGen/SPARK-27715.

Authored-by: uncleGen <hustyugm@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-09-26 22:52:22 -05:00
sev7e0 cd04607785 [SPARK-29246][CORE] Remove unnecessary imports in core module
### What changes were proposed in this pull request?

Remove unnecessary imports in `core` module.

### Why are the changes needed?

Clean  code for Apache Spark 3.0.0.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Local test.

Closes #25927 from sev7e0/dev_0925.

Authored-by: sev7e0 <sev7e0@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-26 20:43:08 -07:00
Huaxin Gao bdc4943b9e [SPARK-29142][PYTHON][ML] Pyspark clustering models support column setters/getters/predict
### What changes were proposed in this pull request?
Add the following Params classes in Pyspark clustering
```GaussianMixtureParams```
```KMeansParams```
```BisectingKMeansParams```
```LDAParams```
```PowerIterationClusteringParams```

### Why are the changes needed?
To be consistent with scala side

### Does this PR introduce any user-facing change?
Yes. Add the following changes:
```
GaussianMixtureModel
- get/setMaxIter
- get/setFeaturesCol
- get/setSeed
- get/setPredictionCol
- get/setProbabilityCol
- get/setTol
- predict
```

```
KMeansModel
- get/setMaxIter
- get/setFeaturesCol
- get/setSeed
- get/setPredictionCol
- get/setDistanceMeasure
- get/setTol
- predict
```

```
BisectingKMeansModel
- get/setMaxIter
- get/setFeaturesCol
- get/setSeed
- get/setPredictionCol
- get/setDistanceMeasure
- predict
```

```
LDAModel(HasMaxIter, HasFeaturesCol, HasSeed, HasCheckpointInterval):
- get/setMaxIter
- get/setFeaturesCol
- get/setSeed
- get/setCheckpointInterval
```

### How was this patch tested?
Add doctests

Closes #25859 from huaxingao/spark-29142.

Authored-by: Huaxin Gao <huaxing@us.ibm.com>
Signed-off-by: zhengruifeng <ruifengz@foxmail.com>
2019-09-27 11:19:02 +08:00
Rahij Ramsharan 9f3c82163a [SPARK-29259][SQL] call fs.exists only when necessary
### What changes were proposed in this pull request?

Call fs.exists only when necessary in InsertIntoHadoopFsRelationCommand.

### Why are the changes needed?

When saving a dataframe into Hadoop, spark first checks if the file exists before inspecting the SaveMode to determine if it should actually insert data. However, the pathExists variable is actually not used in the case of SaveMode.Append. In some file systems, the exists call can be expensive and hence this PR makes that call only when necessary.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing unit tests should cover it since this doesn't change the behavior.

Closes #25928 from rahij/rr/exists-upstream.

Authored-by: Rahij Ramsharan <rramsharan@palantir.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-26 15:46:31 -07:00
sandeep katta 103de96059 [SPARK-29202][DEPLOY] Driver java options are not passed to driver process in Yarn client mode
### What changes were proposed in this pull request?
`--driver-java-options` is not passed to driver process if the user runs the application in **Yarn client** mode

Run the below command

```
./bin/spark-sql --master yarn \
--driver-java-options="-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5555"
```

**In Spark 2.4.4**
```
java ... -agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5555
org.apache.spark.deploy.SparkSubmit --master yarn --conf spark.driver.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5555 ...
```

**In Spark 3.0**
```
java ...
org.apache.spark.deploy.SparkSubmit --master yarn --conf spark.driver.extraJavaOptions=-agentlib:jdwp=transport=dt_socket,server=y,suspend=n,address=5556 ...
```

This issue is caused by [SPARK-28980](https://github.com/apache/spark/pull/25684/files#diff-75e0f814aa3717db995fa701883dc4e1R395)

### Why are the changes needed?
Corrected the `isClientMode`  API implementation

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Manually,

![image](https://user-images.githubusercontent.com/35216143/65383114-c92dce80-dd2d-11e9-86c1-60e6d7e09f1e.png)

Closes #25889 from sandeep-katta/yarnmode.

Authored-by: sandeep katta <sandeep.katta2007@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-26 15:37:20 -07:00
Jungtaek Lim (HeartSaVioR) d3679a9782 [SPARK-27748][SS][FOLLOWUP] Correct the order of logging token as debug log
### What changes were proposed in this pull request?

This patch fixes the order of elements while logging token. Header columns are printed as

```
"TOKENID", "HMAC", "OWNER", "RENEWERS", "ISSUEDATE", "EXPIRYDATE", "MAXDATE"
```

whereas the code prints out actual information as

```
"HMAC"(redacted), "TOKENID", "OWNER", "RENEWERS", "ISSUEDATE", "EXPIRYDATE", "MAXDATE"
```

This patch fixes this.

### Why are the changes needed?

Not critical but it doesn't line up with header columns.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

N/A, as it's only logged as debug and it's obvious what/where is the problem and how it can be fixed.

Closes #25935 from HeartSaVioR/SPARK-27748-FOLLOWUP.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Marcelo Vanzin <vanzin@cloudera.com>
2019-09-26 08:32:03 -07:00
Tomoko Komiyama 8beb736a00 [SPARK-29256][DOCS] Fix typo in building document
### What changes were proposed in this pull request?
 Changed 'Phive-thriftserver' to ' -Phive-thriftserver'.

### Why are the changes needed?
 Typo

### Does this PR introduce any user-facing change?
Yes.

### How was this patch tested?
Manually tested.

Closes #25937 from TomokoKomiyama/fix-build-doc.

Authored-by: Tomoko Komiyama <btkomiyamatm@oss.nttdata.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-09-26 08:23:43 -05:00
Gengliang Wang a1213d5f96 [SPARK-28997][SQL] Add spark.sql.dialect
### What changes were proposed in this pull request?

After https://github.com/apache/spark/pull/25158 and https://github.com/apache/spark/pull/25458, SQL features of PostgreSQL are introduced into Spark. AFAIK, both features are implementation-defined behaviors, which are not specified in ANSI SQL.
In such a case, this proposal is to add a configuration `spark.sql.dialect` for choosing a database dialect.
After this PR, Spark supports two database dialects, `Spark` and `PostgreSQL`. With `PostgreSQL` dialect, Spark will:
1. perform integral division with the / operator if both sides are integral types;
2. accept "true", "yes", "1", "false", "no", "0", and unique prefixes as input and trim input for the boolean data type.

### Why are the changes needed?

Unify the external database dialect with one configuration, instead of small flags.

### Does this PR introduce any user-facing change?

A new configuration `spark.sql.dialect` for choosing a database dialect.

### How was this patch tested?

Existing tests.

Closes #25697 from gengliangwang/dialect.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-26 21:00:27 +08:00
Gengliang Wang 66c9dc316a [SPARK-29255][SQL][TESTS] Rename package pgSQL to postgreSQL
### What changes were proposed in this pull request?

Rename the package pgSQL to postgreSQL

### Why are the changes needed?

To address the comment in https://github.com/apache/spark/pull/25697#discussion_r328431070 . The official full name seems more reasonable.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing unit tests.

Closes #25936 from gengliangwang/renamePGSQL.

Authored-by: Gengliang Wang <gengliang.wang@databricks.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
2019-09-26 05:36:15 -07:00
Burak Yavuz c8159c7941 [SPARK-29197][SQL] Remove saveModeForDSV2 from DataFrameWriter
### What changes were proposed in this pull request?

It is very confusing that the default save mode is different between the internal implementation of a Data source. The reason that we had to have saveModeForDSV2 was that there was no easy way to check the existence of a Table in DataSource v2. Now, we have catalogs for that. Therefore we should be able to remove the different save modes. We also have a plan forward for `save`, where we can't really check the existence of a table, and therefore create one. That will come in a future PR.

### Why are the changes needed?

Because it is confusing that the internal implementation of a data source (which is generally non-obvious to users) decides which default save mode is used within Spark.

### Does this PR introduce any user-facing change?

It changes the default save mode for V2 Tables in the DataFrameWriter APIs

### How was this patch tested?

Existing tests

Closes #25876 from brkyvz/removeSM.

Lead-authored-by: Burak Yavuz <brkyvz@gmail.com>
Co-authored-by: Burak Yavuz <burak@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-26 15:20:04 +08:00
Liang-Chi Hsieh b8b59d6fa3 [SPARK-29239][SPARK-29221][SQL] Subquery should not cause NPE when eliminating subexpression
### What changes were proposed in this pull request?

This patch proposes to skip PlanExpression when doing subexpression elimination on executors.

### Why are the changes needed?

Subexpression elimination can possibly cause NPE when applying on execution subquery expression like ScalarSubquery on executors. It is because PlanExpression wraps query plan. To compare query plan on executor when eliminating subexpression, can cause unexpected error, like NPE when accessing transient fields.

The NPE looks like:
```
[info] - SPARK-29239: Subquery should not cause NPE when eliminating subexpression *** FAILED *** (175 milliseconds)
[info]   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1395.0 failed 1 times, most recent failure: Lost task 0.0 in stage 1395.0 (TID   3447, 10.0.0.196, executor driver): java.lang.NullPointerException
[info]  at org.apache.spark.sql.execution.LocalTableScanExec.stringArgs(LocalTableScanExec.scala:62)
[info]  at org.apache.spark.sql.catalyst.trees.TreeNode.argString(TreeNode.scala:506)
[info]  at org.apache.spark.sql.catalyst.trees.TreeNode.simpleString(TreeNode.scala:534)
[info]  at org.apache.spark.sql.catalyst.plans.QueryPlan.simpleString(QueryPlan.scala:179)
[info]  at org.apache.spark.sql.catalyst.plans.QueryPlan.verboseString(QueryPlan.scala:181)
[info]  at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:647)
[info]  at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:675)
[info]  at org.apache.spark.sql.catalyst.trees.TreeNode.generateTreeString(TreeNode.scala:675)
[info]  at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:569)
[info]  at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:559)
[info]  at org.apache.spark.sql.catalyst.trees.TreeNode.treeString(TreeNode.scala:551)
[info]  at org.apache.spark.sql.catalyst.trees.TreeNode.toString(TreeNode.scala:548)
[info]  at org.apache.spark.sql.catalyst.errors.package$TreeNodeException.<init>(package.scala:36)
[info]  at org.apache.spark.sql.catalyst.errors.package$.attachTree(package.scala:56)
[info]  at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:436)
[info]  at org.apache.spark.sql.catalyst.trees.TreeNode.makeCopy(TreeNode.scala:425)
[info]  at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:102)
[info]  at org.apache.spark.sql.execution.SparkPlan.makeCopy(SparkPlan.scala:63)
[info]  at org.apache.spark.sql.catalyst.plans.QueryPlan.mapExpressions(QueryPlan.scala:132)
[info]  at org.apache.spark.sql.catalyst.plans.QueryPlan.doCanonicalize(QueryPlan.scala:261)
```

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Added unit test.

Closes #25925 from viirya/SPARK-29239.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-26 13:55:01 +08:00
Ryan Blue 6a4235aee7 [SPARK-29249][SQL] V2 writer: Don't allow tableProperty for existing tables
### What changes were proposed in this pull request?

Don't allow calling append, overwrite, or overwritePartitions after tableProperty is used in DataFrameWriterV2 because table properties are not set as part of operations on existing tables. Only tables that are created or replaced can set table properties.

### Why are the changes needed?

The properties are discarded otherwise, so this avoids confusing behavior.

### Does this PR introduce any user-facing change?

Yes, but to a new API, DataFrameWriterV2.

### How was this patch tested?

Removed test cases that used this method and the append, etc. methods because they no longer compile.

Closes #25931 from rdblue/fix-dfw-v2-table-properties.

Authored-by: Ryan Blue <blue@apache.org>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-26 12:41:34 +08:00
Maxim Gekk 21db2f86f7 [SPARK-29237][SQL] Prevent real function names in expression example template
### What changes were proposed in this pull request?

In the PR, I propose to replace function names in some expression examples by `_FUNC_`, and add a test to check that `_FUNC_` always present in all examples.

### Why are the changes needed?
Binding of a function name to an expression is performed in `FunctionRegistry` which is single source of truth. Expression examples should avoid using function name directly because this can make the examples invalid in the future.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Added new test to `SQLQuerySuite` which analyses expression example, and check presence of `_FUNC_`.

Closes #25924 from MaxGekk/fix-func-examples.

Authored-by: Maxim Gekk <max.gekk@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-25 15:16:00 -07:00
Jungtaek Lim (HeartSaVioR) a1b90bfc0f [SPARK-23197][STREAMING][TESTS] Fix ReceiverSuite."receiver_life_cycle" to not rely on timing
### What changes were proposed in this pull request?

This patch changes ReceiverSuite."receiver_life_cycle" to record actual calls with timestamp in FakeReceiver/FakeReceiverSupervisor, which doesn't rely on timing of stopping and starting receiver in restarting receiver. It enables us to give enough huge timeout on verification of restart as we can verify both stopping and starting together.

### Why are the changes needed?

The test is flaky without this patch. We increased timeout to fix flakyness of this test (15adcc8273) but even with longer timeout it has been still failing intermittently.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

I've reproduced test failure artificially via below diff:

```
diff --git a/streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisor.scala b/streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisor.scala
index faf6db82d5..d8977543c0 100644
--- a/streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisor.scala
+++ b/streaming/src/main/scala/org/apache/spark/streaming/receiver/ReceiverSupervisor.scala
 -191,9 +191,11  private[streaming] abstract class ReceiverSupervisor(
       // thread pool.
       logWarning("Restarting receiver with delay " + delay + " ms: " + message,
         error.getOrElse(null))
+      Thread.sleep(1000)
       stopReceiver("Restarting receiver with delay " + delay + "ms: " + message, error)
       logDebug("Sleeping for " + delay)
       Thread.sleep(delay)
+      Thread.sleep(1000)
       logInfo("Starting receiver again")
       startReceiver()
       logInfo("Receiver started again")
```

and confirmed this patch doesn't fail with the change.

Closes #25862 from HeartSaVioR/SPARK-23197-v2.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-25 10:59:08 -07:00
Xianyang Liu e07cbbe9c9 [SPARK-29236][CORE] Access 'executorDataMap' out of 'DriverEndpoint' should be protected by lock
### What changes were proposed in this pull request?

Protected the `executorDataMap` under lock when accessing it out of 'DriverEndpoint''s methods.

### Why are the changes needed?

Just as the comments:

>

// Accessing `executorDataMap` in `DriverEndpoint.receive/receiveAndReply` doesn't need any
// protection. But accessing `executorDataMap` out of `DriverEndpoint.receive/receiveAndReply`
// must be protected by `CoarseGrainedSchedulerBackend.this`. Besides, `executorDataMap` should
// only be modified in `DriverEndpoint.receive/receiveAndReply` with protection by
// `CoarseGrainedSchedulerBackend.this`.

`executorDataMap` is not threadsafe, it should be protected by lock when accessing it out of `DriverEndpoint`

### Does this PR introduce any user-facing change?

NO

### How was this patch tested?

Existed UT.

Closes #25922 from ConeyLiu/executorDataMap.

Authored-by: Xianyang Liu <xianyang.liu@intel.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-25 22:38:59 +08:00
Tomoko Komiyama 58989cd1b0 [SPARK-29168][WEBUI][FOLLOW-UP] Use a dark colors on selected Executor removed on timeline view
### What changes were proposed in this pull request?
Changed Executor color settings in timeline-view.css

### Why are the changes needed?
In WebUI, color of executor changes to dark blue when you click it. It might be confused user because of the color.

[ Before ]
![r_before](https://user-images.githubusercontent.com/55128575/65562403-48671080-df81-11e9-98ff-193e3f058cec.png)

[ After ]
![r_after](https://user-images.githubusercontent.com/55128575/65562427-62085800-df81-11e9-8465-6fb9a966d0bc.png)

### Does this PR introduce any user-facing change?
Yes.

### How was this patch tested?
Manually test.

Closes #25921 from TomokoKomiyama/fix-js-2.

Authored-by: Tomoko Komiyama <btkomiyamatm@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-25 05:15:29 -07:00
Wenchen Fan a36a7235db [SPARK-29215][SQL] current namespace should be tracked in SessionCatalog if the current catalog is session catalog
### What changes were proposed in this pull request?

when the current catalog is session catalog, get/set the current namespace from/to the `SessionCatalog`.

### Why are the changes needed?

It's super confusing that we don't have a single source of truth for the current namespace of the session catalog. It can be in `CatalogManager` or `SessionCatalog`.

Ideally, we should always track the current catalog/namespace in `CatalogManager`. However, there are many commands that do not support v2 catalog API. They ignore the current catalog in `CatalogManager` and blindly go to `SessionCatalog`. This means, we must keep track of the current namespace of session catalog even if the current catalog is not session catalog.

Thus, we can't use `CatalogManager` to track the current namespace of session catalog because it changes when the current catalog is changed. To keep single source of truth, we should only track the current namespace of session catalog in `SessionCatalog`.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Newly added and updated test cases.

Closes #25903 from cloud-fan/current.

Authored-by: Wenchen Fan <wenchen@databricks.com>
Signed-off-by: Gengliang Wang <gengliang.wang@databricks.com>
2019-09-25 17:01:36 +08:00
WeichenXu d8b0914c2e [SPARK-28957][SQL] Copy any "spark.hive.foo=bar" spark properties into hadoop conf as "hive.foo=bar"
### What changes were proposed in this pull request?

Copy any "spark.hive.foo=bar" spark properties into hadoop conf as "hive.foo=bar"

### Why are the changes needed?
Providing spark side config entry for hive configurations.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
UT.

Closes #25661 from WeichenXu123/add_hive_conf.

Authored-by: WeichenXu <weichen.xu@databricks.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-25 15:54:44 +08:00
gengjiaan eef3abbb90 [SPARK-29226][BUILD] Upgrade jackson-databind to 2.9.10 and fix vulnerabilities
### What changes were proposed in this pull request?
The current code uses com.fasterxml.jackson.core:jackson-databind:jar:2.9.9.3 and it will cause a security vulnerabilities. We could get some security info from https://www.tenable.com/cve/CVE-2019-16335 and https://www.tenable.com/cve/CVE-2019-14540

This reference remind to upgrate the version of `jackson-databind` to 2.9.10 or later.

This PR also upgrade the version of jackson to 2.9.10.

### Why are the changes needed?
This PR fix the security vulnerabilities.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Exists UT.

Closes #25912 from beliefer/upgrade-jackson.

Authored-by: gengjiaan <gengjiaan@360.cn>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-24 22:05:13 -07:00
sev7e0 e650f8fdba [SPARK-29230][CORE][TEST] Fix NPE in ProcfsMetricsGetterSuite
### What changes were proposed in this pull request?

When I use `ProcfsMetricsGetterSuite for` testing, always throw out `java.lang.NullPointerException`. I think there is a problem with locating `new ProcfsMetricsGetter`, which will lead to `SparkEnv` not being initialized in time. This leads to `java.lang.NullPointerException` when the method is executed.

### Why are the changes needed?
For test.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Local testing

Closes #25918 from sev7e0/dev_0924.

Authored-by: sev7e0 <sev7e0@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-24 14:09:40 -07:00
Gabor Somogyi d75588c57a [SPARK-29082][CORE] Skip delegation token generation if no credentials are available
This PR is an enhanced version of https://github.com/apache/spark/pull/25805 so I've kept the original text. The problem with the original PR can be found in comment.

This situation can happen when an external system (e.g. Oozie) generates
delegation tokens for a Spark application. The Spark driver will then run
against secured services, have proper credentials (the tokens), but no
kerberos credentials. So trying to do things that requires a kerberos
credential fails.

Instead, if no kerberos credentials are detected, just skip the whole
delegation token code.

Tested with an application that simulates Oozie; fails before the fix,
passes with the fix. Also with other DT-related tests to make sure other
functionality keeps working.

Closes #25901 from gaborgsomogyi/SPARK-29082.

Authored-by: Gabor Somogyi <gabor.g.somogyi@gmail.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-24 11:12:26 -07:00
Yuanjian Li b3e9be470c [SPARK-29229][SQL] Change the additional remote repository in IsolatedClientLoader to google minor
### What changes were proposed in this pull request?
Change the remote repo used in IsolatedClientLoader from datanucleus to google mirror.

### Why are the changes needed?
We need to connect the Maven repositories in IsolatedClientLoader for downloading Hive jars. The repository currently used is "http://www.datanucleus.org/downloads/maven2", which is [no longer maintained](http://www.datanucleus.org:15080/downloads/maven2/README.txt). This will cause downloading failure and make hive test cases flaky while Jenkins host is blocked by maven central repo.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Existing UT.

Closes #25915 from xuanyuanking/SPARK-29229.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-25 00:49:50 +08:00
zhengruifeng fff2e847c2 [SPARK-29095][ML] add extractInstances
### What changes were proposed in this pull request?
common methods support extract weights

### Why are the changes needed?
today more and more ML algs support weighting, add this method will make impls simple

### Does this PR introduce any user-facing change?
no

### How was this patch tested?
existing testsuites

Closes #25802 from zhengruifeng/add_extractInstances.

Authored-by: zhengruifeng <ruifengz@foxmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-09-24 09:24:10 -05:00
Xiao Li 7c02c143aa [SPARK-28292][SQL] Enable Injection of User-defined Hint
### What changes were proposed in this pull request?
Move the rule `RemoveAllHints` after the batch `Resolution`.

### Why are the changes needed?
User-defined hints can be resolved by the rules injected via `extendedResolutionRules` or `postHocResolutionRules`.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Added a test case

Closes #25746 from gatorsmile/moveRemoveAllHints.

Authored-by: Xiao Li <gatorsmile@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-24 18:04:17 +08:00
sheepstop 81de9d3c29 [SPARK-28678][DOC] Specify that array indices start at 1 for function slice in R Scala Python
### What changes were proposed in this pull request?
Added "array indices start at 1" in annotation to make it clear for the usage of function slice, in R Scala Python component

### Why are the changes needed?
It will throw exception if the value stare is 0, but array indices start at 0 most of times in other scenarios.

### Does this PR introduce any user-facing change?
Yes, more info provided to user.

### How was this patch tested?
No tests added, only doc change.

Closes #25704 from sheepstop/master.

Authored-by: sheepstop <yangting617@gmail.com>
Signed-off-by: HyukjinKwon <gurwls223@apache.org>
2019-09-24 18:57:54 +09:00
Yuming Wang b8b67ae92d [SPARK-28527][SQL][TEST] Enable ThriftServerQueryTestSuite
### What changes were proposed in this pull request?

This PR enable `ThriftServerQueryTestSuite` and fix previously flaky test by:
1. Start thriftserver in `beforeAll()`.
2. Disable `spark.sql.hive.thriftServer.async`.

### Why are the changes needed?

Improve test coverage.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?

```shell
build/sbt "hive-thriftserver/test-only *.ThriftServerQueryTestSuite "  -Phive-thriftserver
build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.hive.thriftserver.ThriftServerQueryTestSuite test  -Phive-thriftserver
```

Closes #25868 from wangyum/SPARK-28527-enable.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
2019-09-24 00:44:33 -07:00
TomokoKomiyama cb72b10b91 [SPARK-29168][WEBUI] Use a unique color on selected item on timeline view
### What changes were proposed in this pull request?

Changed color settings in .vis-timeline .vis-item.executor.vis-selected (timeline-view.css)

### Why are the changes needed?

In WebUI, executor bar's color changes blue to green when you click it. It might be confused user because of the color.

[ Before ]
![before](https://user-images.githubusercontent.com/55128575/65487629-40a45f00-dee2-11e9-8974-dc7027824b52.png)

[ After ]
![after](https://user-images.githubusercontent.com/55128575/65487674-5580f280-dee2-11e9-8e70-28f4ddcf56c3.png)

### Does this PR introduce any user-facing change?

Yes.

### How was this patch tested?

Manually test.

Closes #25846 from TomokoKomiyama/fix-js.

Authored-by: TomokoKomiyama <btkomiyamatm@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-24 00:15:54 -07:00
Kousuke Saruta 7c8596823a [SPARK-29218][WEBUI] Increase Show Additional Metrics checkbox width in StagePage
### What changes were proposed in this pull request?

Modified widths of some checkboxes in StagePage.

### Why are the changes needed?

When we increase the font size of the browsers or the default font size is big, the labels of checkbox of `Show Additional Metrics` in `StagePage` are wrapped like as follows.

![before-modified1](https://user-images.githubusercontent.com/4736016/65449180-634c5e80-de75-11e9-9f27-88f4cc1313b7.png)
![before-modified2](https://user-images.githubusercontent.com/4736016/65449182-63e4f500-de75-11e9-96b8-46e92a61f40c.png)

### Does this PR introduce any user-facing change?

Yes.

### How was this patch tested?

Run the following and visit the `Stage Detail` page. Then, increase the font size.
```
$ bin/spark-shell
...
scala> spark.range(100000).groupBy("id").count.collect
```

Closes #25905 from sarutak/adjust-checkbox-width.

Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-23 23:57:08 -07:00
windpiger da7e5c4ffb [SPARK-19917][SQL] qualified partition path stored in catalog
## What changes were proposed in this pull request?

partition path should be qualified to store in catalog.
There are some scenes:
1. ALTER TABLE t PARTITION(b=1) SET LOCATION '/path/x'
   should be qualified: file:/path/x
  **Hive 2.0.0 does not support for location without schema here.**
```
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask. {0}  is not absolute or has no scheme information.  Please specify a complete absolute uri with scheme information.
```

2. ALTER TABLE t PARTITION(b=1) SET LOCATION 'x'
    should be qualified: file:/tablelocation/x
  **Hive 2.0.0 does not support for relative location here.**
3. ALTER TABLE t ADD PARTITION(b=1) LOCATION '/path/x'
    should be qualified: file:/path/x
   **the same with Hive 2.0.0**
4. ALTER TABLE t ADD PARTITION(b=1) LOCATION 'x'
     should be qualified: file:/tablelocation/x
   **the same with Hive 2.0.0**

Currently only  ALTER TABLE t ADD PARTITION(b=1) LOCATION for hive serde table has the expected qualified path. we should make other scenes to be consist with it.

Another change is for alter table location.

## How was this patch tested?
add / modify existing TestCases

Closes #17254 from windpiger/qualifiedPartitionPath.

Authored-by: windpiger <songjun@outlook.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-24 14:48:47 +08:00
Jungtaek Lim (HeartSaVioR) 4513f1c0dc [SPARK-26848][SQL][SS] Introduce new option to Kafka source: offset by timestamp (starting/ending)
## What changes were proposed in this pull request?

This patch introduces new options "startingOffsetsByTimestamp" and "endingOffsetsByTimestamp" to set specific timestamp per topic (since we're unlikely to set the different value per partition) to let source starts reading from offsets which have equal of greater timestamp, and ends reading until offsets which have equal of greater timestamp.

The new option would be optional of course, and take preference over existing offset options.

## How was this patch tested?

New unit tests added. Also manually tested basic functionality with Kafka 2.0.0 server.

Running query below

```
val df = spark.read.format("kafka")
  .option("kafka.bootstrap.servers", "localhost:9092")
  .option("subscribe", "spark_26848_test_v1,spark_26848_test_2_v1")
  .option("startingOffsetsByTimestamp", """{"spark_26848_test_v1": 1549669142193, "spark_26848_test_2_v1": 1549669240965}""")
  .option("endingOffsetsByTimestamp", """{"spark_26848_test_v1": 1549669265676, "spark_26848_test_2_v1": 1549699265676}""")
  .load().selectExpr("CAST(value AS STRING)")

df.show()
```

with below records (one string which number part remarks when they're put after such timestamp) in

topic `spark_26848_test_v1`
```
hello1 1549669142193
world1 1549669142193
hellow1 1549669240965
world1 1549669240965
hello1 1549669265676
world1 1549669265676
```

topic `spark_26848_test_2_v1`

```
hello2 1549669142193
world2 1549669142193
hello2 1549669240965
world2 1549669240965
hello2 1549669265676
world2 1549669265676
```

the result of `df.show()` follows:
```
+--------------------+
|               value|
+--------------------+
|world1 1549669240965|
|world1 1549669142193|
|world2 1549669240965|
|hello2 1549669240965|
|hellow1 154966924...|
|hello2 1549669265676|
|hello1 1549669142193|
|world2 1549669265676|
+--------------------+
```

Note that endingOffsets (as well as endingOffsetsByTimestamp) are exclusive.

Closes #23747 from HeartSaVioR/SPARK-26848.

Authored-by: Jungtaek Lim (HeartSaVioR) <kabhwan@gmail.com>
Signed-off-by: Sean Owen <sean.owen@databricks.com>
2019-09-23 19:25:36 -05:00
Yuming Wang c38f459059 [SPARK-29016][BUILD] Update LICENSE and NOTICE for Hive 2.3
### What changes were proposed in this pull request?
This PR update LICENSE and NOTICE for Hive 2.3. Hive 2.3 newly added jars:
```
dropwizard-metrics-hadoop-metrics2-reporter.jar
HikariCP-2.5.1.jar
hive-common-2.3.6.jar
hive-llap-common-2.3.6.jar
hive-serde-2.3.6.jar
hive-service-rpc-2.3.6.jar
hive-shims-0.23-2.3.6.jar
hive-shims-2.3.6.jar
hive-shims-common-2.3.6.jar
hive-shims-scheduler-2.3.6.jar
hive-storage-api-2.6.0.jar
hive-vector-code-gen-2.3.6.jar
javax.jdo-3.2.0-m3.jar
json-1.8.jar
transaction-api-1.1.jar
velocity-1.5.jar
```

### Why are the changes needed?
We will distribute a binary release based on Hadoop 3.2 / Hive 2.3 in future.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
N/A

Closes #25896 from wangyum/SPARK-29016.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-23 09:19:04 -07:00
Liang-Chi Hsieh d50f6e6344 [SPARK-25903][CORE] TimerTask should be synchronized on ContextBarrierState
### What changes were proposed in this pull request?

BarrierCoordinator sets up a TimerTask for a round of global sync. Currently the run method is synchronized on the created TimerTask. But to be synchronized with handleRequest, it should be synchronized on the ContextBarrierState object, not TimerTask object.

### Why are the changes needed?

ContextBarrierState.handleRequest and TimerTask.run both access the internal status of a ContextBarrierState object. If TimerTask doesn't be synchronized on the same ContextBarrierState object, when the timer task is triggered, handleRequest still accepts new request and modify  requesters field in the ContextBarrierState object. It makes the behavior inconsistency.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Test locally

Closes #25897 from viirya/SPARK-25903.

Authored-by: Liang-Chi Hsieh <viirya@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-24 00:13:38 +08:00
Yuming Wang 0c40b94ae5 [SPARK-29203][SQL][TESTS] Reduce shuffle partitions in SQLQueryTestSuite
### What changes were proposed in this pull request?
This PR reduce shuffle partitions from 200 to 4 in `SQLQueryTestSuite` to reduce testing time.

### Why are the changes needed?
Reduce testing time.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Manually tested in my local:
Before:
```
...
[info] - subquery/in-subquery/in-joins.sql (6 minutes, 19 seconds)
[info] - subquery/in-subquery/not-in-joins.sql (2 minutes, 17 seconds)
[info] - subquery/scalar-subquery/scalar-subquery-predicate.sql (45 seconds, 763 milliseconds)
...
Run completed in 1 hour, 22 minutes.
```
After:
```
...
[info] - subquery/in-subquery/in-joins.sql (1 minute, 12 seconds)
[info] - subquery/in-subquery/not-in-joins.sql (27 seconds, 541 milliseconds)
[info] - subquery/scalar-subquery/scalar-subquery-predicate.sql (17 seconds, 360 milliseconds)
...
Run completed in 47 minutes.
```

Closes #25891 from wangyum/SPARK-29203.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
2019-09-23 08:38:40 -07:00
angerszhu d22768a6be [SPARK-29036][SQL] SparkThriftServer cancel job after execute() thread interrupted
### What changes were proposed in this pull request?
Discuss in https://github.com/apache/spark/pull/25611

If cancel() and close() is called very quickly after the query is started, then they may both call cleanup() before Spark Jobs are started. Then sqlContext.sparkContext.cancelJobGroup(statementId) does nothing.
But then the execute thread can start the jobs, and only then get interrupted and exit through here. But then it will exit here, and no-one will cancel these jobs and they will keep running even though this execution has exited.

So  when execute() was interrupted by `cancel()`, when get into catch block, we should call canJobGroup again to make sure the job was canceled.

### Why are the changes needed?

### Does this PR introduce any user-facing change?
NO

### How was this patch tested?
MT

Closes #25743 from AngersZhuuuu/SPARK-29036.

Authored-by: angerszhu <angers.zhu@gmail.com>
Signed-off-by: Yuming Wang <wgyumg@gmail.com>
2019-09-23 05:47:25 -07:00
Daoyuan Wang c08bc37281 [SPARK-29177][CORE] fix zombie tasks after stage abort
### What changes were proposed in this pull request?
Do task handling even the task exceeds maxResultSize configured. More details are in the jira description https://issues.apache.org/jira/browse/SPARK-29177 .

### Why are the changes needed?
Without this patch, the zombie tasks will prevent yarn from recycle those containers running these tasks, which will affect other applications.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
unit test and production test with a very large `SELECT` in spark thriftserver.

Closes #25850 from adrian-wang/zombie.

Authored-by: Daoyuan Wang <me@daoyuan.wang>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-23 19:46:01 +08:00
xy_xin 655356e825 [SPARK-28892][SQL] support UPDATE in the parser and add the corresponding logical plan
### What changes were proposed in this pull request?

This PR supports UPDATE in the parser and add the corresponding logical plan. The SQL syntax is a standard UPDATE statement:
```
UPDATE tableName tableAlias SET colName=value [, colName=value]+ WHERE predicate?
```

### Why are the changes needed?

With this change, we can start to implement UPDATE in builtin sources and think about how to design the update API in DS v2.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

New test cases added.

Closes #25626 from xianyinxin/SPARK-28892.

Authored-by: xy_xin <xianyin.xxy@alibaba-inc.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-23 19:25:56 +08:00
Yuanjian Li f725d472f5 [SPARK-25341][CORE] Support rolling back a shuffle map stage and re-generate the shuffle files
After the newly added shuffle block fetching protocol in #24565, we can keep this work by extending the FetchShuffleBlocks message.

### What changes were proposed in this pull request?
In this patch, we achieve the indeterminate shuffle rerun by reusing the task attempt id(unique id within an application) in shuffle id, so that each shuffle write attempt has a different file name. For the indeterministic stage, when the stage resubmits, we'll clear all existing map status and rerun all partitions.

All changes are summarized as follows:
- Change the mapId to mapTaskAttemptId in shuffle related id.
- Record the mapTaskAttemptId in MapStatus.
- Still keep mapId in ShuffleFetcherIterator for fetch failed scenario.
- Add the determinate flag in Stage and use it in DAGScheduler and the cleaning work for the intermediate stage.

### Why are the changes needed?
This is a follow-up work for #22112's future improvment[1]: `Currently we can't rollback and rerun a shuffle map stage, and just fail.`

Spark will rerun a finished shuffle write stage while meeting fetch failures, currently, the rerun shuffle map stage will only resubmit the task for missing partitions and reuse the output of other partitions. This logic is fine in most scenarios, but for indeterministic operations(like repartition), multiple shuffle write attempts may write different data, only rerun the missing partition will lead a correctness bug. So for the shuffle map stage of indeterministic operations, we need to support rolling back the shuffle map stage and re-generate the shuffle files.

### Does this PR introduce any user-facing change?
Yes, after this PR, the indeterminate stage rerun will be accepted by rerunning the whole stage. The original behavior is aborting the stage and fail the job.

### How was this patch tested?
- UT: Add UT for all changing code and newly added function.
- Manual Test: Also providing a manual test to verify the effect.
```
import scala.sys.process._
import org.apache.spark.TaskContext

val determinateStage0 = sc.parallelize(0 until 1000 * 1000 * 100, 10)
val indeterminateStage1 = determinateStage0.repartition(200)
val indeterminateStage2 = indeterminateStage1.repartition(200)
val indeterminateStage3 = indeterminateStage2.repartition(100)
val indeterminateStage4 = indeterminateStage3.repartition(300)
val fetchFailIndeterminateStage4 = indeterminateStage4.map { x =>
if (TaskContext.get.attemptNumber == 0 && TaskContext.get.partitionId == 190 &&
  TaskContext.get.stageAttemptNumber == 0) {
  throw new Exception("pkill -f -n java".!!)
  }
  x
}
val indeterminateStage5 = fetchFailIndeterminateStage4.repartition(200)
val finalStage6 = indeterminateStage5.repartition(100).collect().distinct.length
```
It's a simple job with multi indeterminate stage, it will get a wrong answer while using old Spark version like 2.2/2.3, and will be killed after #22112. With this fix, the job can retry all indeterminate stage as below screenshot and get the right result.
![image](https://user-images.githubusercontent.com/4833765/63948434-3477de00-caab-11e9-9ed1-75abfe6d16bd.png)

Closes #25620 from xuanyuanking/SPARK-25341-8.27.

Authored-by: Yuanjian Li <xyliyuanjian@gmail.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
2019-09-23 16:16:52 +08:00
Takeshi Yamamuro 7a2ea58e78 [SPARK-29084][SQL][TESTS] Check method bytecode size in BenchmarkQueryTest
### What changes were proposed in this pull request?

This pr proposes to check method bytecode size in `BenchmarkQueryTest`. This metric is critical for performance numbers.

### Why are the changes needed?

For performance checks

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

N/A

Closes #25788 from maropu/CheckMethodSize.

Authored-by: Takeshi Yamamuro <yamamuro@apache.org>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-22 14:47:42 -07:00
Yuming Wang 51d3509428 [SPARK-28599][SQL] Fix Execution Time and Duration column sorting for ThriftServerSessionPage
### What changes were proposed in this pull request?

This PR add support sorting `Execution Time` and `Duration` columns for `ThriftServerSessionPage`.

### Why are the changes needed?

Previously, it's not sorted correctly.

### Does this PR introduce any user-facing change?

Yes.

### How was this patch tested?

Manually do the following and test sorting on those columns in the Spark Thrift Server Session Page.
```
$ sbin/start-thriftserver.sh
$ bin/beeline -u jdbc:hive2://localhost:10000
0: jdbc:hive2://localhost:10000> create table t(a int);
+---------+--+
| Result  |
+---------+--+
+---------+--+
No rows selected (0.521 seconds)
0: jdbc:hive2://localhost:10000> select * from t;
+----+--+
| a  |
+----+--+
+----+--+
No rows selected (0.772 seconds)
0: jdbc:hive2://localhost:10000> show databases;
+---------------+--+
| databaseName  |
+---------------+--+
| default       |
+---------------+--+
1 row selected (0.249 seconds)
```

**Sorted by `Execution Time` column**:
![image](https://user-images.githubusercontent.com/5399861/65387476-53038900-dd7a-11e9-885c-fca80287f550.png)

**Sorted by `Duration` column**:
![image](https://user-images.githubusercontent.com/5399861/65387481-6e6e9400-dd7a-11e9-9318-f917247efaa8.png)

Closes #25892 from wangyum/SPARK-28599.

Authored-by: Yuming Wang <yumwang@ebay.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
2019-09-22 14:12:06 -07:00