Commit graph

17220 commits

Author SHA1 Message Date
Shivansh 6c1ecb191b [SPARK-16911] Fix the links in the programming guide
## What changes were proposed in this pull request?

 Fix the broken links in the programming guide of the Graphx Migration and understanding closures

## How was this patch tested?

By running the test cases  and checking the links.

Author: Shivansh <shiv4nsh@gmail.com>

Closes #14503 from shiv4nsh/SPARK-16911.
2016-08-07 09:30:18 +01:00
keliang 1275f64696 [SPARK-16870][DOCS] Summary:add "spark.sql.broadcastTimeout" into docs/sql-programming-gu…
## What changes were proposed in this pull request?
default value for spark.sql.broadcastTimeout is 300s. and this property do not show in any docs of spark. so add "spark.sql.broadcastTimeout" into docs/sql-programming-guide.md to help people to how to fix this timeout error when it happenned

## How was this patch tested?

not need

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

…ide.md

JIRA_ID:SPARK-16870
Description:default value for spark.sql.broadcastTimeout is 300s. and this property do not show in any docs of spark. so add "spark.sql.broadcastTimeout" into docs/sql-programming-guide.md to help people to how to fix this timeout error when it happenned
Test:done

Author: keliang <keliang@cmss.chinamobile.com>

Closes #14477 from biglobster/keliang.
2016-08-07 09:28:32 +01:00
Bryan Cutler b1ebe182ca [SPARK-16932][DOCS] Changed programming guide to not reference old accumulator API in Scala
## What changes were proposed in this pull request?

In the programming guide, the accumulator section mixes up both the old and new APIs causing it to be confusing.  This is not necessary for Scala, so all references to the old API are removed.  For Java, it is somewhat fixed up except for the example of a custom accumulator because I don't think an API exists yet.  Python has not currently implemented the new API.

## How was this patch tested?
built doc locally

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #14516 from BryanCutler/fixup-accumulator-programming-guide-SPARK-15702.
2016-08-07 09:06:59 +01:00
Michael Gummelt 7aaa5a01c1 document that Mesos cluster mode supports python
update docs to be consistent with SPARK-14645 https://issues.apache.org/jira/browse/SPARK-14645

Author: Michael Gummelt <mgummelt@mesosphere.io>

Closes #14514 from mgummelt/fix-docs.
2016-08-07 08:59:04 +01:00
Josh Rosen 4f5f9b670e [SPARK-16925] Master should call schedule() after all executor exit events, not only failures
## What changes were proposed in this pull request?

This patch fixes a bug in Spark's standalone Master which could cause applications to hang if tasks cause executors to exit with zero exit codes.

As an example of the bug, run

```
sc.parallelize(1 to 1, 1).foreachPartition { _ => System.exit(0) }
```

on a standalone cluster which has a single Spark application. This will cause all executors to die but those executors won't be replaced unless another Spark application or worker joins or leaves the cluster (or if an executor exits with a non-zero exit code). This behavior is caused by a bug in how the Master handles the `ExecutorStateChanged` event: the current implementation calls `schedule()` only if the executor exited with a non-zero exit code, so a task which causes a JVM to unexpectedly exit "cleanly" will skip the `schedule()` call.

This patch addresses this by modifying the `ExecutorStateChanged` to always unconditionally call `schedule()`. This should be safe because it should always be safe to call `schedule()`; adding extra `schedule()` calls can only affect performance and should not introduce correctness bugs.

## How was this patch tested?

I added a regression test in `DistributedSuite`.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #14510 from JoshRosen/SPARK-16925.
2016-08-06 19:29:19 -07:00
Nicholas Chammas 2dd0388617 [SPARK-16772][PYTHON][DOCS] Fix API doc references to UDFRegistration + Update "important classes"
## Proposed Changes

* Update the list of "important classes" in `pyspark.sql` to match 2.0.
* Fix references to `UDFRegistration` so that the class shows up in the docs. It currently [doesn't](http://spark.apache.org/docs/latest/api/python/pyspark.sql.html).
* Remove some unnecessary whitespace in the Python RST doc files.

I reused the [existing JIRA](https://issues.apache.org/jira/browse/SPARK-16772) I created last week for similar API doc fixes.

## How was this patch tested?

* I ran `lint-python` successfully.
* I ran `make clean build` on the Python docs and confirmed the results are as expected locally in my browser.

Author: Nicholas Chammas <nicholas.chammas@gmail.com>

Closes #14496 from nchammas/SPARK-16772-UDFRegistration.
2016-08-06 05:02:59 +01:00
Artur Sukhenko 14dba45208 [SPARK-16796][WEB UI] Mask spark.authenticate.secret on Spark environ…
## What changes were proposed in this pull request?

Mask `spark.authenticate.secret` on Spark environment page (Web UI).
This is addition to https://github.com/apache/spark/pull/14409

## How was this patch tested?
`./dev/run-tests`
[info] ScalaTest
[info] Run completed in 1 hour, 8 minutes, 38 seconds.
[info] Total number of tests run: 2166
[info] Suites: completed 65, aborted 0
[info] Tests: succeeded 2166, failed 0, canceled 0, ignored 590, pending 0
[info] All tests passed.

Author: Artur Sukhenko <artur.sukhenko@gmail.com>

Closes #14484 from Devian-ua/SPARK-16796.
2016-08-06 04:41:47 +01:00
hyukjinkwon 55d6dad6f2 [SPARK-16847][SQL] Prevent to potentially read corrupt statstics on binary in Parquet vectorized reader
## What changes were proposed in this pull request?

This problem was found in [PARQUET-251](https://issues.apache.org/jira/browse/PARQUET-251) and we disabled filter pushdown on binary columns in Spark before. We enabled this after upgrading Parquet but it seems there is potential incompatibility for Parquet files written in lower Spark versions.

Currently, this does not happen in normal Parquet reader. However, In Spark, we implemented a vectorized reader, separately with Parquet's standard API. For normal Parquet reader this is being handled but not in the vectorized reader.

It is okay to just pass `FileMetaData`. This is being handled in parquet-mr (See e3b95020f7). This will prevent loading corrupt statistics in each page in Parquet.

This PR replaces the deprecated usage of constructor.

## How was this patch tested?

N/A

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14450 from HyukjinKwon/SPARK-16847.
2016-08-06 04:40:24 +01:00
Yin Huai e679bc3c1c [SPARK-16901] Hive settings in hive-site.xml may be overridden by Hive's default values
## What changes were proposed in this pull request?
When we create the HiveConf for metastore client, we use a Hadoop Conf as the base, which may contain Hive settings in hive-site.xml (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/internal/SharedState.scala#L49). However, HiveConf's initialize function basically ignores the base Hadoop Conf and always its default values (i.e. settings with non-null default values) as the base (https://github.com/apache/hive/blob/release-1.2.1/common/src/java/org/apache/hadoop/hive/conf/HiveConf.java#L2687). So, even a user put javax.jdo.option.ConnectionURL in hive-site.xml, it is not used and Hive will use its default, which is jdbc:derby:;databaseName=metastore_db;create=true.

This issue only shows up when `spark.sql.hive.metastore.jars` is not set to builtin.

## How was this patch tested?
New test in HiveSparkSubmitSuite.

Author: Yin Huai <yhuai@databricks.com>

Closes #14497 from yhuai/SPARK-16901.
2016-08-05 15:52:02 -07:00
Yanbo Liang 6cbde337a5 [SPARK-16750][FOLLOW-UP][ML] Add transformSchema for StringIndexer/VectorAssembler and fix failed tests.
## What changes were proposed in this pull request?
This is follow-up for #14378. When we add ```transformSchema``` for all estimators and transformers, I found there are tests failed for ```StringIndexer``` and ```VectorAssembler```. So I moved these parts of work separately in this PR, to make it more clear to review.
The corresponding tests should throw ```IllegalArgumentException``` at schema validation period after we add ```transformSchema```. It's efficient that to throw exception at the start of ```fit``` or ```transform``` rather than during the process.

## How was this patch tested?
Modified unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #14455 from yanboliang/transformSchema.
2016-08-05 22:07:59 +01:00
Ekasit Kijsipongse 1f96c97f23 [SPARK-13238][CORE] Add ganglia dmax parameter
The current ganglia reporter doesn't set metric expiration time (dmax). The metrics of all finished applications are indefinitely left displayed in ganglia web. The dmax parameter allows user to set the lifetime of the metrics. The default value is 0 for compatibility with previous versions.

Author: Ekasit Kijsipongse <ekasitk@gmail.com>

Closes #11127 from ekasitk/ganglia-dmax.
2016-08-05 13:07:52 -07:00
Bryan Cutler 180fd3e0a3 [SPARK-16421][EXAMPLES][ML] Improve ML Example Outputs
## What changes were proposed in this pull request?
Improve example outputs to better reflect the functionality that is being presented.  This mostly consisted of modifying what was printed at the end of the example, such as calling show() with truncate=False, but sometimes required minor tweaks in the example data to get relevant output.  Explicitly set parameters when they are used as part of the example.  Fixed Java examples that failed to run because of using old-style MLlib Vectors or problem with schema.  Synced examples between different APIs.

## How was this patch tested?
Ran each example for Scala, Python, and Java and made sure output was legible on a terminal of width 100.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #14308 from BryanCutler/ml-examples-improve-output-SPARK-16260.
2016-08-05 20:57:46 +01:00
Sylvain Zimmer 2460f03ffe [SPARK-16826][SQL] Switch to java.net.URI for parse_url()
## What changes were proposed in this pull request?
The java.net.URL class has a globally synchronized Hashtable, which limits the throughput of any single executor doing lots of calls to parse_url(). Tests have shown that a 36-core machine can only get to 10% CPU use because the threads are locked most of the time.

This patch switches to java.net.URI which has less features than java.net.URL but focuses on URI parsing, which is enough for parse_url().

New tests were added to make sure a few common edge cases didn't change behaviour.
https://issues.apache.org/jira/browse/SPARK-16826

## How was this patch tested?
I've kept the old URL code commented for now, so that people can verify that the new unit tests do pass with java.net.URL.

Thanks to srowen for the help!

Author: Sylvain Zimmer <sylvain@sylvainzimmer.com>

Closes #14488 from sylvinus/master.
2016-08-05 20:55:58 +01:00
Yuming Wang 39a2b2ea74 [SPARK-16625][SQL] General data types to be mapped to Oracle
## What changes were proposed in this pull request?

Spark will convert **BooleanType** to **BIT(1)**, **LongType** to **BIGINT**, **ByteType**  to **BYTE** when saving DataFrame to Oracle, but Oracle does not support BIT, BIGINT and BYTE types.

This PR is convert following _Spark Types_ to _Oracle types_ refer to [Oracle Developer's Guide](https://docs.oracle.com/cd/E19501-01/819-3659/gcmaz/)

Spark Type | Oracle
----|----
BooleanType | NUMBER(1)
IntegerType | NUMBER(10)
LongType | NUMBER(19)
FloatType | NUMBER(19, 4)
DoubleType | NUMBER(19, 4)
ByteType | NUMBER(3)
ShortType | NUMBER(5)

## How was this patch tested?

Add new tests in [JDBCSuite.scala](22b0c2a422 (diff-dc4b58851b084b274df6fe6b189db84d)) and [OracleDialect.scala](22b0c2a422 (diff-5e0cadf526662f9281aa26315b3750ad))

Author: Yuming Wang <wgyumg@gmail.com>

Closes #14377 from wangyum/SPARK-16625.
2016-08-05 16:11:54 +01:00
petermaxlee e026064143 [MINOR] Update AccumulatorV2 doc to not mention "+=".
## What changes were proposed in this pull request?
As reported by Bryan Cutler on the mailing list, AccumulatorV2 does not have a += method, yet the documentation still references it.

## How was this patch tested?
N/A

Author: petermaxlee <petermaxlee@gmail.com>

Closes #14466 from petermaxlee/accumulator.
2016-08-05 11:06:36 +01:00
cody koeninger c9f2501af2 [SPARK-16312][STREAMING][KAFKA][DOC] Doc for Kafka 0.10 integration
## What changes were proposed in this pull request?
Doc for the Kafka 0.10 integration

## How was this patch tested?
Scala code examples were taken from my example repo, so hopefully they compile.

Author: cody koeninger <cody@koeninger.org>

Closes #14385 from koeninger/SPARK-16312.
2016-08-05 10:13:32 +01:00
Wenchen Fan 5effc016c8 [SPARK-16879][SQL] unify logical plans for CREATE TABLE and CTAS
## What changes were proposed in this pull request?

we have various logical plans for CREATE TABLE and CTAS: `CreateTableUsing`, `CreateTableUsingAsSelect`, `CreateHiveTableAsSelectLogicalPlan`. This PR unifies them to reduce the complexity and centralize the error handling.

## How was this patch tested?

existing tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14482 from cloud-fan/table.
2016-08-05 10:50:26 +02:00
Hiroshi Inoue faaefab26f [SPARK-15726][SQL] Make DatasetBenchmark fairer among Dataset, DataFrame and RDD
## What changes were proposed in this pull request?

DatasetBenchmark compares the performances of RDD, DataFrame and Dataset while running the same operations. However, there are two problems that make the comparisons unfair.

1) In backToBackMap test case, only DataFrame implementation executes less work compared to RDD or Dataset implementations. This test case processes Long+String pairs, but the output from the DataFrame implementation does not include String part while RDD or Dataset generates Long+String pairs as output. This difference significantly changes the performance characteristics due to the String manipulation and creation overheads.

2) In back-to-back map and back-to-back filter test cases, `map` or `filter` operation is executed only once regardless of `numChains` parameter for RDD. Hence the execution times for RDD have been largely underestimated.

Of course, these issues do not affect Spark users, but it may confuse Spark developers.

## How was this patch tested?
By executing the DatasetBenchmark

Author: Hiroshi Inoue <inouehrs@jp.ibm.com>

Closes #13459 from inouehrs/fix_benchmark_fairness.
2016-08-05 16:00:25 +08:00
Sean Zhong 1fa644497a [SPARK-16907][SQL] Fix performance regression for parquet table when vectorized parquet record reader is not being used
## What changes were proposed in this pull request?

For non-partitioned parquet table, if the vectorized parquet record reader is not being used, Spark 2.0 adds an extra unnecessary memory copy to append partition values for each row.

There are several typical cases that vectorized parquet record reader is not being used:
1. When the table schema is not flat, like containing nested fields.
2. When `spark.sql.parquet.enableVectorizedReader = false`

By fixing this bug, we get about 20% - 30% performance gain in test case like this:

```
// Generates parquet table with nested columns
spark.range(100000000).select(struct($"id").as("nc")).write.parquet("/tmp/data4")

def time[R](block: => R): Long = {
    val t0 = System.nanoTime()
    val result = block    // call-by-name
    val t1 = System.nanoTime()
    println("Elapsed time: " + (t1 - t0)/1000000 + "ms")
    (t1 - t0)/1000000
}

val x = ((0 until 20).toList.map(x => time(spark.read.parquet("/tmp/data4").filter($"nc.id" < 100).collect()))).sum/20
```

## How was this patch tested?

After a few times warm up, we get 26% performance improvement

Before fix:
```
Average: 4584ms, raw data (10 tries): 4726ms 4509ms 4454ms 4879ms 4586ms 4733ms 4500ms 4361ms 4456ms 4640ms
```

After fix:
```
Average: 3614ms, raw data(10 tries): 3554ms 3740ms 4019ms 3439ms 3460ms 3664ms 3557ms 3584ms 3612ms 3531ms
```

Test env: Intel(R) Core(TM) i7-6700 CPU  3.40GHz, Intel SSD SC2KW24

Author: Sean Zhong <seanzhong@databricks.com>

Closes #14445 from clockfly/fix_parquet_regression_2.
2016-08-05 11:19:20 +08:00
Marcelo Vanzin 53e766cfe2 MAINTENANCE. Cleaning up stale PRs.
Closing the following PRs due to requests or unresponsive users.

Closes #13923
Closes #14462
Closes #13123
Closes #14423 (requested by srowen)
Closes #14424 (requested by srowen)
Closes #14101 (requested by jkbradley)
Closes #10676 (requested by srowen)
Closes #10943 (requested by yhuai)
Closes #9936
Closes #10701
Closes #10474
Closes #13248
Closes #14347
Closes #10356
Closes #9866
Closes #14310 (requested by srowen)
Closes #14390 (requested by srowen)
Closes #14343 (requested by srowen)
Closes #14402 (requested by srowen)
Closes #14437 (requested by srowen)
Closes #12000 (already merged)
2016-08-04 16:33:03 -07:00
Josh Rosen d91c6755ae [HOTFIX] Remove unnecessary imports from #12944 that broke build
Author: Josh Rosen <joshrosen@databricks.com>

Closes #14499 from JoshRosen/hotfix.
2016-08-04 15:26:27 -07:00
Sital Kedia 9c15d079df [SPARK-15074][SHUFFLE] Cache shuffle index file to speedup shuffle fetch
## What changes were proposed in this pull request?

Shuffle fetch on large intermediate dataset is slow because the shuffle service open/close the index file for each shuffle fetch. This change introduces a cache for the index information so that we can avoid accessing the index files for each block fetch

## How was this patch tested?

Tested by running a job on the cluster and the shuffle read time was reduced by 50%.

Author: Sital Kedia <skedia@fb.com>

Closes #12944 from sitalkedia/shuffle_service.
2016-08-04 14:54:38 -07:00
Zheng RuiFeng 0e2e5d7d0b [SPARK-16863][ML] ProbabilisticClassifier.fit check threshoulds' length
## What changes were proposed in this pull request?

Add threshoulds' length checking for Classifiers which extends ProbabilisticClassifier

## How was this patch tested?

unit tests and manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #14470 from zhengruifeng/classifier_check_setThreshoulds_length.
2016-08-04 21:44:54 +01:00
hyukjinkwon 1d781572e8 [SPARK-16877][BUILD] Add rules for preventing to use Java annotations (Deprecated and Override)
## What changes were proposed in this pull request?

This PR adds both rules for preventing to use `Deprecated` and `Override`.

- Java's `Override`
  It seems Scala compiler just ignores this. Apparently, `override` modifier is only mandatory for " that override some other **concrete member definition** in a parent class" but not for for **incomplete member definition** (such as ones from trait or abstract), see (http://www.scala-lang.org/files/archive/spec/2.11/05-classes-and-objects.html#override)

  For a simple example,

  - Normal class - needs `override` modifier

  ```bash
  scala> class A { def say = {}}
  defined class A

  scala> class B extends A { def say = {}}
  <console>:8: error: overriding method say in class A of type => Unit;
   method say needs `override' modifier
         class B extends A { def say = {}}
                                 ^
  ```

  - Trait - does not need `override` modifier

  ```bash
  scala> trait A { def say }
  defined trait A

  scala> class B extends A { def say = {}}
  defined class B
  ```

  To cut this short, this case below is possible,

  ```bash
  scala> class B extends A {
       |    Override
       |    def say = {}
       | }
  defined class B
  ```
  we can write `Override` annotation (meaning nothing) which might confuse engineers that Java's annotation is working fine. It might be great if we prevent those potential confusion.

- Java's `Deprecated`
  When `Deprecated` is used,  it seems Scala compiler recognises this correctly but it seems we use Scala one `deprecated` across codebase.

## How was this patch tested?

Manually tested, by inserting both `Override` and `Deprecated`. This will shows the error messages as below:

```bash
Scalastyle checks failed at following occurrences:
[error] ... : deprecated should be used instead of java.lang.Deprecated.
```

```basg
Scalastyle checks failed at following occurrences:
[error] ... : override modifier should be used instead of java.lang.Override.
```

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #14490 from HyukjinKwon/SPARK-16877.
2016-08-04 21:43:05 +01:00
WeichenXu 462784ffad [SPARK-16880][ML][MLLIB] make ann training data persisted if needed
## What changes were proposed in this pull request?

To Make sure ANN layer input training data to be persisted,
so that it can avoid overhead cost if the RDD need to be computed from lineage.

## How was this patch tested?

Existing Tests.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #14483 from WeichenXu123/add_ann_persist_training_data.
2016-08-04 21:41:35 +01:00
Zheng RuiFeng be8ea4b2f7 [SPARK-16875][SQL] Add args checking for DataSet randomSplit and sample
## What changes were proposed in this pull request?

Add the missing args-checking for randomSplit and sample

## How was this patch tested?
unit tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #14478 from zhengruifeng/fix_randomSplit.
2016-08-04 21:39:45 +01:00
Eric Liang ac2a26d09e [SPARK-16884] Move DataSourceScanExec out of ExistingRDD.scala file
## What changes were proposed in this pull request?

This moves DataSourceScanExec out so it's more discoverable, and now that it doesn't necessarily depend on an existing RDD.  cc davies

## How was this patch tested?

Existing tests.

Author: Eric Liang <ekl@databricks.com>

Closes #14487 from ericl/split-scan.
2016-08-04 11:22:55 -07:00
Davies Liu 9d4e6212fa [SPARK-16802] [SQL] fix overflow in LongToUnsafeRowMap
## What changes were proposed in this pull request?

This patch fix the overflow in LongToUnsafeRowMap when the range of key is very wide (the key is much much smaller then minKey, for example, key is Long.MinValue, minKey is > 0).

## How was this patch tested?

Added regression test (also for SPARK-16740)

Author: Davies Liu <davies@databricks.com>

Closes #14464 from davies/fix_overflow.
2016-08-04 11:20:17 -07:00
Sean Zhong 9d7a47406e [SPARK-16853][SQL] fixes encoder error in DataSet typed select
## What changes were proposed in this pull request?

For DataSet typed select:
```
def select[U1: Encoder](c1: TypedColumn[T, U1]): Dataset[U1]
```
If type T is a case class or a tuple class that is not atomic, the resulting logical plan's schema will mismatch with `Dataset[T]` encoder's schema, which will cause encoder error and throw AnalysisException.

### Before change:
```
scala> case class A(a: Int, b: Int)
scala> Seq((0, A(1,2))).toDS.select($"_2".as[A])
org.apache.spark.sql.AnalysisException: cannot resolve '`a`' given input columns: [_2];
..
```

### After change:
```
scala> case class A(a: Int, b: Int)
scala> Seq((0, A(1,2))).toDS.select($"_2".as[A]).show
+---+---+
|  a|  b|
+---+---+
|  1|  2|
+---+---+
```

## How was this patch tested?

Unit test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #14474 from clockfly/SPARK-16853.
2016-08-04 19:45:47 +08:00
Wenchen Fan 43f4fd6f9b [SPARK-16867][SQL] createTable and alterTable in ExternalCatalog should not take db
## What changes were proposed in this pull request?

These 2 methods take `CatalogTable` as parameter, which already have the database information.

## How was this patch tested?

existing test

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14476 from cloud-fan/minor5.
2016-08-04 16:48:30 +08:00
Sean Zhong 27e815c31d [SPARK-16888][SQL] Implements eval method for expression AssertNotNull
## What changes were proposed in this pull request?

Implements `eval()` method for expression `AssertNotNull` so that we can convert local projection on LocalRelation to another LocalRelation.

### Before change:
```
scala> import org.apache.spark.sql.catalyst.dsl.expressions._
scala> import org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull
scala> import org.apache.spark.sql.Column
scala> case class A(a: Int)
scala> Seq((A(1),2)).toDS().select(new Column(AssertNotNull("_1".attr, Nil))).explain

java.lang.UnsupportedOperationException: Only code-generated evaluation is supported.
  at org.apache.spark.sql.catalyst.expressions.objects.AssertNotNull.eval(objects.scala:850)
  ...
```

### After the change:
```
scala> Seq((A(1),2)).toDS().select(new Column(AssertNotNull("_1".attr, Nil))).explain(true)

== Parsed Logical Plan ==
'Project [assertnotnull('_1) AS assertnotnull(_1)#5]
+- LocalRelation [_1#2, _2#3]

== Analyzed Logical Plan ==
assertnotnull(_1): struct<a:int>
Project [assertnotnull(_1#2) AS assertnotnull(_1)#5]
+- LocalRelation [_1#2, _2#3]

== Optimized Logical Plan ==
LocalRelation [assertnotnull(_1)#5]

== Physical Plan ==
LocalTableScan [assertnotnull(_1)#5]
```

## How was this patch tested?

Unit test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #14486 from clockfly/assertnotnull_eval.
2016-08-04 13:43:25 +08:00
Cheng Lian 780c7224a5 [MINOR][SQL] Fix minor formatting issue of SortAggregateExec.toString
## What changes were proposed in this pull request?

This PR fixes a minor formatting issue (missing space after comma) of `SorgAggregateExec.toString`.

Before:

```
SortAggregate(key=[a#76,b#77], functions=[max(c#78),min(c#78)], output=[a#76,b#77,max(c)#89,min(c)#90])
+- *Sort [a#76 ASC, b#77 ASC], false, 0
   +- Exchange hashpartitioning(a#76, b#77, 200)
      +- SortAggregate(key=[a#76,b#77], functions=[partial_max(c#78),partial_min(c#78)], output=[a#76,b#77,max#99,min#100])
         +- *Sort [a#76 ASC, b#77 ASC], false, 0
            +- LocalTableScan <empty>, [a#76, b#77, c#78]
```

After:

```
SortAggregate(key=[a#76, b#77], functions=[max(c#78), min(c#78)], output=[a#76, b#77, max(c)#89, min(c)#90])
+- *Sort [a#76 ASC, b#77 ASC], false, 0
   +- Exchange hashpartitioning(a#76, b#77, 200)
      +- SortAggregate(key=[a#76, b#77], functions=[partial_max(c#78), partial_min(c#78)], output=[a#76, b#77, max#99, min#100])
         +- *Sort [a#76 ASC, b#77 ASC], false, 0
            +- LocalTableScan <empty>, [a#76, b#77, c#78]
```

## How was this patch tested?

Manually tested.

Author: Cheng Lian <lian@databricks.com>

Closes #14480 from liancheng/fix-sort-based-agg-string-format.
2016-08-04 13:32:43 +08:00
sharkd 583d91a195 [SPARK-16873][CORE] Fix SpillReader NPE when spillFile has no data
## What changes were proposed in this pull request?

SpillReader NPE when spillFile has no data. See follow logs:

16/07/31 20:54:04 INFO collection.ExternalSorter: spill memory to file:/data4/yarnenv/local/usercache/tesla/appcache/application_1465785263942_56138/blockmgr-db5f46c3-d7a4-4f93-8b77-565e469696fb/09/temp_shuffle_ec3ece08-4569-4197-893a-4a5dfcbbf9fa, fileSize:0.0 B
16/07/31 20:54:04 WARN memory.TaskMemoryManager: leak 164.3 MB memory from org.apache.spark.util.collection.ExternalSorter3db4b52d
16/07/31 20:54:04 ERROR executor.Executor: Managed memory leak detected; size = 190458101 bytes, TID = 2358516/07/31 20:54:04 ERROR executor.Executor: Exception in task 1013.0 in stage 18.0 (TID 23585)
java.lang.NullPointerException
	at org.apache.spark.util.collection.ExternalSorter$SpillReader.cleanup(ExternalSorter.scala:624)
	at org.apache.spark.util.collection.ExternalSorter$SpillReader.nextBatchStream(ExternalSorter.scala:539)
	at org.apache.spark.util.collection.ExternalSorter$SpillReader.<init>(ExternalSorter.scala:507)
	at org.apache.spark.util.collection.ExternalSorter$SpillableIterator.spill(ExternalSorter.scala:816)
	at org.apache.spark.util.collection.ExternalSorter.forceSpill(ExternalSorter.scala:251)
	at org.apache.spark.util.collection.Spillable.spill(Spillable.scala:109)
	at org.apache.spark.memory.TaskMemoryManager.acquireExecutionMemory(TaskMemoryManager.java:154)
	at org.apache.spark.memory.TaskMemoryManager.allocatePage(TaskMemoryManager.java:249)
	at org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:112)
	at org.apache.spark.shuffle.sort.ShuffleExternalSorter.acquireNewPageIfNecessary(ShuffleExternalSorter.java:346)
	at org.apache.spark.shuffle.sort.ShuffleExternalSorter.insertRecord(ShuffleExternalSorter.java:367)
	at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.insertRecordIntoSorter(UnsafeShuffleWriter.java:237)
	at org.apache.spark.shuffle.sort.UnsafeShuffleWriter.write(UnsafeShuffleWriter.java:164)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
	at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
	at org.apache.spark.scheduler.Task.run(Task.scala:89)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
	at java.lang.Thread.run(Thread.java:744)
16/07/31 20:54:30 INFO executor.Executor: Executor is trying to kill task 1090.1 in stage 18.0 (TID 23793)
16/07/31 20:54:30 INFO executor.CoarseGrainedExecutorBackend: Driver commanded a shutdown

## How was this patch tested?

Manual test.

Author: sharkd <sharkd.tu@gmail.com>
Author: sharkdtu <sharkdtu@tencent.com>

Closes #14479 from sharkdtu/master.
2016-08-03 19:20:34 -07:00
Holden Karau c5eb1df72f [SPARK-16814][SQL] Fix deprecated parquet constructor usage
## What changes were proposed in this pull request?

Replace deprecated ParquetWriter with the new builders

## How was this patch tested?

Existing tests

Author: Holden Karau <holden@us.ibm.com>

Closes #14419 from holdenk/SPARK-16814-fix-deprecated-parquet-constructor-usage.
2016-08-03 17:08:51 -07:00
Stefan Schulze 4775eb414f [SPARK-16770][BUILD] Fix JLine dependency management and version (Sca…
## What changes were proposed in this pull request?
As of Scala 2.11.x there is no longer a org.scala-lang:jline version aligned to the scala version itself. Scala console now uses the plain jline:jline module. Spark's  dependency management did not reflect this change properly, causing Maven to pull in Jline via transitive dependency. Unfortunately Jline 2.12 contained a minor but very annoying bug rendering the shell almost useless for developers with german keyboard layout. This request contains the following chages:
- Exclude transitive dependency 'jline:jline' from hive-exec module
- Remove global properties 'jline.version' and 'jline.groupId'
- Add both properties and dependency to 'scala-2.11' profile
- Add explicit dependency on 'jline:jline' to  module 'spark-repl'

## How was this patch tested?
- Running mvn dependency:tree and checking for correct Jline version 2.12.1
- Running full builds with assembly and checking for jline-2.12.1.jar in 'lib' folder of generated tarball

Author: Stefan Schulze <stefan.schulze@pentasys.de>

Closes #14429 from stsc-pentasys/SPARK-16770.
2016-08-03 17:07:10 -07:00
Kevin McHale 685b08e261 [SPARK-14204][SQL] register driverClass rather than user-specified class
This is a pull request that was originally merged against branch-1.6 as #12000, now being merged into master as well.  srowen zzcclp JoshRosen

This pull request fixes an issue in which cluster-mode executors fail to properly register a JDBC driver when the driver is provided in a jar by the user, but the driver class name is derived from a JDBC URL (rather than specified by the user). The consequence of this is that all JDBC accesses under the described circumstances fail with an IllegalStateException. I reported the issue here: https://issues.apache.org/jira/browse/SPARK-14204

My proposed solution is to have the executors register the JDBC driver class under all circumstances, not only when the driver is specified by the user.

This patch was tested manually. I built an assembly jar, deployed it to a cluster, and confirmed that the problem was fixed.

Author: Kevin McHale <kevin@premise.com>

Closes #14420 from mchalek/mchalek-jdbc_driver_registration.
2016-08-03 13:15:13 -07:00
Eric Liang e6f226c567 [SPARK-16596] [SQL] Refactor DataSourceScanExec to do partition discovery at execution instead of planning time
## What changes were proposed in this pull request?

Partition discovery is rather expensive, so we should do it at execution time instead of during physical planning. Right now there is not much benefit since ListingFileCatalog will read scan for all partitions at planning time anyways, but this can be optimized in the future. Also, there might be more information for partition pruning not available at planning time.

This PR moves a lot of the file scan logic from planning to execution time. All file scan operations are handled by `FileSourceScanExec`, which handles both batched and non-batched file scans. This requires some duplication with `RowDataSourceScanExec`, but is probably worth it so that `FileSourceScanExec` does not need to depend on an input RDD.

TODO: In another pr, move DataSourceScanExec to it's own file.

## How was this patch tested?

Existing tests (it might be worth adding a test that catalog.listFiles() is delayed until execution, but this can be delayed until there is an actual benefit to doing so).

Author: Eric Liang <ekl@databricks.com>

Closes #14241 from ericl/refactor.
2016-08-03 11:19:55 -07:00
Wenchen Fan b55f34370f [SPARK-16714][SPARK-16735][SPARK-16646] array, map, greatest, least's type coercion should handle decimal type
## What changes were proposed in this pull request?

Here is a table about the behaviours of `array`/`map` and `greatest`/`least` in Hive, MySQL and Postgres:

|    |Hive|MySQL|Postgres|
|---|---|---|---|---|
|`array`/`map`|can find a wider type with decimal type arguments, and will truncate the wider decimal type if necessary|can find a wider type with decimal type arguments, no truncation problem|can find a wider type with decimal type arguments, no truncation problem|
|`greatest`/`least`|can find a wider type with decimal type arguments, and truncate if necessary, but can't do string promotion|can find a wider type with decimal type arguments, no truncation problem, but can't do string promotion|can find a wider type with decimal type arguments, no truncation problem, but can't do string promotion|

I think these behaviours makes sense and Spark SQL should follow them.

This PR fixes `array` and `map` by using `findWiderCommonType` to get the wider type.
This PR fixes `greatest` and `least` by add a `findWiderTypeWithoutStringPromotion`, which provides similar semantic of `findWiderCommonType`, but without string promotion.

## How was this patch tested?

new tests in `TypeCoersionSuite`

Author: Wenchen Fan <wenchen@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #14439 from cloud-fan/bug.
2016-08-03 11:15:09 -07:00
=^_^= 639df046a2 [SPARK-16831][PYTHON] Fixed bug in CrossValidator.avgMetrics
## What changes were proposed in this pull request?

avgMetrics was summed, not averaged, across folds

Author: =^_^= <maxmoroz@gmail.com>

Closes #14456 from pkch/pkch-patch-1.
2016-08-03 04:18:28 -07:00
Wenchen Fan ae226283e1 [SQL][MINOR] use stricter type parameter to make it clear that parquet reader returns UnsafeRow
## What changes were proposed in this pull request?

a small code style change, it's better to make the type parameter more accurate.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14458 from cloud-fan/parquet.
2016-08-03 08:23:26 +08:00
Artur Sukhenko 3861273771 [SPARK-16796][WEB UI] Visible passwords on Spark environment page
## What changes were proposed in this pull request?

Mask spark.ssl.keyPassword, spark.ssl.keyStorePassword, spark.ssl.trustStorePassword in Web UI environment page.
(Changes their values to ***** in env. page)

## How was this patch tested?

I've built spark, run spark shell and checked that this values have been masked with *****.

Also run tests:
./dev/run-tests

[info] ScalaTest
[info] Run completed in 1 hour, 9 minutes, 5 seconds.
[info] Total number of tests run: 2166
[info] Suites: completed 65, aborted 0
[info] Tests: succeeded 2166, failed 0, canceled 0, ignored 590, pending 0
[info] All tests passed.

![mask](https://cloud.githubusercontent.com/assets/15244468/17262154/7641e132-55e2-11e6-8a6c-30ead77c7372.png)

Author: Artur Sukhenko <artur.sukhenko@gmail.com>

Closes #14409 from Devian-ua/maskpass.
2016-08-02 16:13:12 -07:00
gatorsmile b73a570603 [SPARK-16858][SQL][TEST] Removal of TestHiveSharedState
### What changes were proposed in this pull request?
This PR is to remove `TestHiveSharedState`.

Also, this is also associated with the Hive refractoring for removing `HiveSharedState`.

### How was this patch tested?
The existing test cases

Author: gatorsmile <gatorsmile@gmail.com>

Closes #14463 from gatorsmile/removeTestHiveSharedState.
2016-08-02 14:17:45 -07:00
Josh Rosen e9fc0b6a8b [SPARK-16787] SparkContext.addFile() should not throw if called twice with the same file
## What changes were proposed in this pull request?

The behavior of `SparkContext.addFile()` changed slightly with the introduction of the Netty-RPC-based file server, which was introduced in Spark 1.6 (where it was disabled by default) and became the default / only file server in Spark 2.0.0.

Prior to 2.0, calling `SparkContext.addFile()` with files that have the same name and identical contents would succeed. This behavior was never explicitly documented but Spark has behaved this way since very early 1.x versions.

In 2.0 (or 1.6 with the Netty file server enabled), the second `addFile()` call will fail with a requirement error because NettyStreamManager tries to guard against duplicate file registration.

This problem also affects `addJar()` in a more subtle way: the `fileServer.addJar()` call will also fail with an exception but that exception is logged and ignored; I believe that the problematic exception-catching path was mistakenly copied from some old code which was only relevant to very old versions of Spark and YARN mode.

I believe that this change of behavior was unintentional, so this patch weakens the `require` check so that adding the same filename at the same path will succeed.

At file download time, Spark tasks will fail with exceptions if an executor already has a local copy of a file and that file's contents do not match the contents of the file being downloaded / added. As a result, it's important that we prevent files with the same name and different contents from being served because allowing that can effectively brick an executor by preventing it from successfully launching any new tasks. Before this patch's change, this was prevented by forbidding `addFile()` from being called twice on files with the same name. Because Spark does not defensively copy local files that are passed to `addFile` it is vulnerable to files' contents changing, so I think it's okay to rely on an implicit assumption that these files are intended to be immutable (since if they _are_ mutable then this can lead to either explicit task failures or implicit incorrectness (in case new executors silently get newer copies of the file while old executors continue to use an older version)). To guard against this, I have decided to only update the file addition timestamps on the first call to `addFile()`; duplicate calls will succeed but will not update the timestamp. This behavior is fine as long as we assume files are immutable, which seems reasonable given the behaviors described above.

As part of this change, I also improved the thread-safety of the `addedJars` and `addedFiles` maps; this is important because these maps may be concurrently read by a task launching thread and written by a driver thread in case the user's driver code is multi-threaded.

## How was this patch tested?

I added regression tests in `SparkContextSuite`.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #14396 from JoshRosen/SPARK-16787.
2016-08-02 12:02:11 -07:00
Wenchen Fan a9beeaaaeb [SPARK-16855][SQL] move Greatest and Least from conditionalExpressions.scala to arithmetic.scala
## What changes were proposed in this pull request?

`Greatest` and `Least` are not conditional expressions, but arithmetic expressions.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #14460 from cloud-fan/move.
2016-08-02 11:08:32 -07:00
sandy cbdff49357 [SPARK-16816] Modify java example which is also reflect in documentation exmaple
## What changes were proposed in this pull request?

Modify java example which is also reflect in document.

## How was this patch tested?

run test cases.

Author: sandy <phalodi@gmail.com>

Closes #14436 from phalodi/SPARK-16816.
2016-08-02 10:34:01 -07:00
Herman van Hovell 2330f3ecbb [SPARK-16836][SQL] Add support for CURRENT_DATE/CURRENT_TIMESTAMP literals
## What changes were proposed in this pull request?
In Spark 1.6 (with Hive support) we could use `CURRENT_DATE` and `CURRENT_TIMESTAMP` functions as literals (without adding braces), for example:
```SQL
select /* Spark 1.6: */ current_date, /* Spark 1.6  & Spark 2.0: */ current_date()
```
This was accidentally dropped in Spark 2.0. This PR reinstates this functionality.

## How was this patch tested?
Added a case to ExpressionParserSuite.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #14442 from hvanhovell/SPARK-16836.
2016-08-02 10:09:47 -07:00
Liang-Chi Hsieh 146001a9ff [SPARK-16062] [SPARK-15989] [SQL] Fix two bugs of Python-only UDTs
## What changes were proposed in this pull request?

There are two related bugs of Python-only UDTs. Because the test case of second one needs the first fix too. I put them into one PR. If it is not appropriate, please let me know.

### First bug: When MapObjects works on Python-only UDTs

`RowEncoder` will use `PythonUserDefinedType.sqlType` for its deserializer expression. If the sql type is `ArrayType`, we will have `MapObjects` working on it. But `MapObjects` doesn't consider `PythonUserDefinedType` as its input data type. It causes error like:

    import pyspark.sql.group
    from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
    from pyspark.sql.types import *

    schema = StructType().add("key", LongType()).add("val", PythonOnlyUDT())
    df = spark.createDataFrame([(i % 3, PythonOnlyPoint(float(i), float(i))) for i in range(10)], schema=schema)
    df.show()

    File "/home/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 312, in get_return_value py4j.protocol.Py4JJavaError: An error occurred while calling o36.showString.
    : java.lang.RuntimeException: Error while decoding: scala.MatchError: org.apache.spark.sql.types.PythonUserDefinedTypef4ceede8 (of class org.apache.spark.sql.types.PythonUserDefinedType)
    ...

### Second bug: When Python-only UDTs is the element type of ArrayType

    import pyspark.sql.group
    from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
    from pyspark.sql.types import *

    schema = StructType().add("key", LongType()).add("val", ArrayType(PythonOnlyUDT()))
    df = spark.createDataFrame([(i % 3, [PythonOnlyPoint(float(i), float(i))]) for i in range(10)], schema=schema)
    df.show()

## How was this patch tested?
PySpark's sql tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #13778 from viirya/fix-pyudt.
2016-08-02 10:08:18 -07:00
Tom Magrino 1dab63d8d3 [SPARK-16837][SQL] TimeWindow incorrectly drops slideDuration in constructors
## What changes were proposed in this pull request?

Fix of incorrect arguments (dropping slideDuration and using windowDuration) in constructors for TimeWindow.

The JIRA this addresses is here: https://issues.apache.org/jira/browse/SPARK-16837

## How was this patch tested?

Added a test to TimeWindowSuite to check that the results of TimeWindow object apply and TimeWindow class constructor are equivalent.

Author: Tom Magrino <tmagrino@fb.com>

Closes #14441 from tmagrino/windowing-fix.
2016-08-02 09:16:44 -07:00
Shuai Lin 36827ddafe [SPARK-16822][DOC] Support latex in scaladoc.
## What changes were proposed in this pull request?

Support using latex in scaladoc by adding MathJax javascript to the js template.

## How was this patch tested?

Generated scaladoc.  Preview:

- LogisticGradient: [before](https://spark.apache.org/docs/2.0.0/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient) and [after](https://sparkdocs.lins05.pw/spark-16822/api/scala/index.html#org.apache.spark.mllib.optimization.LogisticGradient)

- MinMaxScaler: [before](https://spark.apache.org/docs/2.0.0/api/scala/index.html#org.apache.spark.ml.feature.MinMaxScaler) and [after](https://sparkdocs.lins05.pw/spark-16822/api/scala/index.html#org.apache.spark.ml.feature.MinMaxScaler)

Author: Shuai Lin <linshuai2012@gmail.com>

Closes #14438 from lins05/spark-16822-support-latex-in-scaladoc.
2016-08-02 09:14:08 -07:00
Maciej Brynski 511dede111 [SPARK-15541] Casting ConcurrentHashMap to ConcurrentMap (master branch)
## What changes were proposed in this pull request?

Casting ConcurrentHashMap to ConcurrentMap allows to run code compiled with Java 8 on Java 7

## How was this patch tested?

Compilation. Existing automatic tests

Author: Maciej Brynski <maciej.brynski@adpilot.pl>

Closes #14459 from maver1ck/spark-15541-master.
2016-08-02 08:07:08 -07:00