Commit graph

16207 commits

Author SHA1 Message Date
wm624@hotmail.com bebe5f9811 [SPARK-15318][ML][EXAMPLE] spark.ml Collaborative Filtering example does not work in spark-shell
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

copy & paste example in ml-collaborative-filtering.html into spark-shell, we see the following errors.
scala> case class Rating(userId: Int, movieId: Int, rating: Float, timestamp: Long)
defined class Rating

scala> object Rating {
def parseRating(str: String): Rating = { | val fields = str.split("::") | assert(fields.size == 4) | Rating(fields(0).toInt, fields(1).toInt, fields(2).toFloat, fields(3).toLong) | }
}
<console>:29: error: Rating.type does not take parameters
Rating(fields(0).toInt, fields(1).toInt, fields(2).toFloat, fields(3).toLong)
^
In standard scala repl, it has the same error.

Scala/spark-shell repl has some quirks (e.g. packages are also not well supported).

The reason of errors is that scala/spark-shell repl discards previous definitions when we define the Object with the same class name. Solution: We can rename the Object Rating.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Manually test it: 1). ./bin/run-example ALSExample
2). copy & paste example in the generated document. It works fine.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #13110 from wangmiao1981/repl.
2016-05-17 16:51:01 +01:00
Sean Owen 932d800293 [SPARK-15333][DOCS] Reorganize building-spark.md; rationalize vs wiki
## What changes were proposed in this pull request?

See JIRA for the motivation. The changes are almost entirely movement of text and edits to sections. Minor changes to text include:

- Copying in / merging text from the "Useful Developer Tools" wiki, in areas of
  - Docker
  - R
  - Running one test
- standardizing on ./build/mvn not mvn, and likewise for ./build/sbt
- correcting some typos
- standardizing code block formatting

No text has been removed from this doc; text has been imported from the https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools wiki

## How was this patch tested?

Jekyll doc build and inspection of resulting HTML in browser.

Author: Sean Owen <sowen@cloudera.com>

Closes #13124 from srowen/SPARK-15333.
2016-05-17 16:40:38 +01:00
wm624@hotmail.com 4134ff0c65 [SPARK-14434][ML] User guide doc and examples for GaussianMixture in spark.ml
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

Add guide doc and examples for GaussianMixture in Spark.ml in Java, Scala and Python.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Manual compile and test all examples

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #12788 from wangmiao1981/example.
2016-05-17 15:20:47 +02:00
Wenchen Fan c36ca651f9 [SPARK-15351][SQL] RowEncoder should support array as the external type for ArrayType
## What changes were proposed in this pull request?

This PR improves `RowEncoder` and `MapObjects`, to support array as the external type for `ArrayType`. The idea is straightforward, we use `Object` as the external input type for `ArrayType`, and determine its type at runtime in `MapObjects`.

## How was this patch tested?

new test in `RowEncoderSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13138 from cloud-fan/map-object.
2016-05-17 17:02:52 +08:00
Sean Owen 122302cbf5 [SPARK-15290][BUILD] Move annotations, like @Since / @DeveloperApi, into spark-tags
## What changes were proposed in this pull request?

(See https://github.com/apache/spark/pull/12416 where most of this was already reviewed and committed; this is just the module structure and move part. This change does not move the annotations into test scope, which was the apparently problem last time.)

Rename `spark-test-tags` -> `spark-tags`; move common annotations like `Since` to `spark-tags`

## How was this patch tested?

Jenkins tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #13074 from srowen/SPARK-15290.
2016-05-17 09:55:53 +01:00
Xiangrui Meng 8ad9f08c94 [SPARK-14906][ML] Copy linalg in PySpark to new ML package
## What changes were proposed in this pull request?

Copy the linalg (Vector/Matrix and VectorUDT/MatrixUDT) in PySpark to new ML package.

## How was this patch tested?
Existing tests.

Author: Xiangrui Meng <meng@databricks.com>
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #13099 from viirya/move-pyspark-vector-matrix-udt4.
2016-05-17 00:08:02 -07:00
Liwei Lin 95f4fbae52 [SPARK-14942][SQL][STREAMING] Reduce delay between batch construction and execution
## Problem

Currently in `StreamExecution`, [we first run the batch, then construct the next](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L165):
```scala
if (dataAvailable) runBatch()
constructNextBatch()
```

This is good when we run batches ASAP, where data would get processed in the **very next batch**:

![1](https://cloud.githubusercontent.com/assets/15843379/14779964/2786e698-0b0d-11e6-9d2c-bb41513488b2.png)

However, when we run batches at trigger like `ProcessTime("1 minute")`, data - such as _y_ below - may not get processed in the very next batch i.e. _batch 1_, but in _batch 2_:

![2](https://cloud.githubusercontent.com/assets/15843379/14779818/6f3bb064-0b0c-11e6-9f16-c1ce4897186b.png)

## What changes were proposed in this pull request?

This patch reverses the order of `constructNextBatch()` and `runBatch()`. After this patch, data would get processed in the **very next batch**, i.e. _batch 1_:

![3](https://cloud.githubusercontent.com/assets/15843379/14779816/6f36ee62-0b0c-11e6-9e53-bc8397fade18.png)

In addition, this patch alters when we do `currentBatchId += 1`: let's do that when the processing of the current batch's data is completed, so we won't bother passing `currentBatchId + 1` or  `currentBatchId - 1` to states or sinks.

## How was this patch tested?

New added test case. Also this should be covered by existing test suits, e.g. stress tests and others.

Author: Liwei Lin <lwlin7@gmail.com>

Closes #12725 from lw-lin/construct-before-run-3.
2016-05-16 12:59:55 -07:00
Sean Owen fabc8e5b12 [SPARK-12972][CORE][TEST-MAVEN][TEST-HADOOP2.2] Update org.apache.httpcomponents.httpclient, commons-io
## What changes were proposed in this pull request?

This is sort of a hot-fix for https://github.com/apache/spark/pull/13117, but, the problem is limited to Hadoop 2.2. The change is to manage `commons-io` to 2.4 for all Hadoop builds, which is only a net change for Hadoop 2.2, which was using 2.1.

## How was this patch tested?

Jenkins tests -- normal PR builder, then the `[test-hadoop2.2] [test-maven]` if successful.

Author: Sean Owen <sowen@cloudera.com>

Closes #13132 from srowen/SPARK-12972.3.
2016-05-16 16:27:04 +01:00
Yanbo Liang f116a84ef8 [SPARK-14979][ML][PYSPARK] Add examples for GeneralizedLinearRegression
## What changes were proposed in this pull request?
Add Scala/Java/Python examples for ```GeneralizedLinearRegression```.

## How was this patch tested?
They are examples and have been tested offline.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12754 from yanboliang/spark-14979.
2016-05-16 09:55:35 +02:00
wm624@hotmail.com c1836d66bd [SPARK-15305][ML][DOC] spark.ml document Bisectiong k-means has the incorrect format
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)
The generated document has the incorrect format for biseckmeans.

![bug](https://cloud.githubusercontent.com/assets/5033592/15233120/d910098a-185a-11e6-901d-44aeafc8a011.jpg)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Fix the formatting.
![fix](https://cloud.githubusercontent.com/assets/5033592/15233136/fce2ccd0-185a-11e6-9ded-14d71da4bdab.jpg)

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #13083 from wangmiao1981/doc.
2016-05-16 08:22:16 +02:00
Sean Zhong 4a5ee1954a [SPARK-15253][SQL] Support old table schema config key "spark.sql.sources.schema" for DESCRIBE TABLE
## What changes were proposed in this pull request?

"DESCRIBE table" is broken when table schema is stored at key "spark.sql.sources.schema".

Originally, we used spark.sql.sources.schema to store the schema of a data source table.
After SPARK-6024, we removed this flag. Although we are not using spark.sql.sources.schema any more, we need to still support it.

## How was this patch tested?

Unit test.

When using spark2.0 to load a table generated by spark 1.2.
Before change:
`DESCRIBE table` => Schema of this table is inferred at runtime,,

After change:
`DESCRIBE table` => correct output.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13073 from clockfly/spark-15253.
2016-05-16 10:41:20 +08:00
Zheng RuiFeng c7efc56c7b [MINOR] Fix Typos
## What changes were proposed in this pull request?
1,Rename matrix args in BreezeUtil to upper to match the doc
2,Fix several typos in ML and SQL

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13078 from zhengruifeng/fix_ann.
2016-05-15 15:59:49 +01:00
Sean Owen f5576a052d [SPARK-12972][CORE] Update org.apache.httpcomponents.httpclient
## What changes were proposed in this pull request?

(Retry of https://github.com/apache/spark/pull/13049)

- update to httpclient 4.5 / httpcore 4.4
- remove some defunct exclusions
- manage httpmime version to match
- update selenium / httpunit to support 4.5 (possible now that Jetty 9 is used)

## How was this patch tested?

Jenkins tests. Also, locally running the same test command of one Jenkins profile that failed: `mvn -Phadoop-2.6 -Pyarn -Phive -Phive-thriftserver -Pkinesis-asl ...`

Author: Sean Owen <sowen@cloudera.com>

Closes #13117 from srowen/SPARK-12972.2.
2016-05-15 15:56:46 +01:00
wm624@hotmail.com 354f8f11bd [SPARK-15096][ML] LogisticRegression MultiClassSummarizer numClasses can fail if no valid labels are found
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)
Throw better exception when numClasses is empty and empty.max is thrown.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Add a new unit test, which calls histogram with empty numClasses.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #12969 from wangmiao1981/logisticR.
2016-05-14 09:45:56 +01:00
Nicholas Tietz 0f1f31d3a6 [SPARK-15197][DOCS] Added Scaladoc for countApprox and countByValueApprox parameters
This pull request simply adds Scaladoc documentation of the parameters for countApprox and countByValueApprox.

This is an important documentation change, as it clarifies what should be passed in for the timeout. Without units, this was previously unclear.

I did not open a JIRA ticket per my understanding of the project contribution guidelines; as they state, the description in the ticket would be essentially just what is in the PR. If I should open one, let me know and I will do so.

Author: Nicholas Tietz <nicholas.tietz@crosschx.com>

Closes #12955 from ntietz/rdd-countapprox-docs.
2016-05-14 09:44:20 +01:00
Tejas Patil 4210e2a6b7 [TRIVIAL] Add () to SparkSession's builder function
## What changes were proposed in this pull request?

Was trying out `SparkSession` for the first time and the given class doc (when copied as is) did not work over Spark shell:

```
scala> SparkSession.builder().master("local").appName("Word Count").getOrCreate()
<console>:27: error: org.apache.spark.sql.SparkSession.Builder does not take parameters
       SparkSession.builder().master("local").appName("Word Count").getOrCreate()
```

Adding () to the builder method in SparkSession.

## How was this patch tested?

```
scala> SparkSession.builder().master("local").appName("Word Count").getOrCreate()
res0: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession65c17e38

scala> SparkSession.builder.master("local").appName("Word Count").getOrCreate()
res1: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession65c17e38
```

Author: Tejas Patil <tejasp@fb.com>

Closes #13086 from tejasapatil/doc_correction.
2016-05-13 18:10:22 -07:00
hyukjinkwon 3ded5bc4db [SPARK-15267][SQL] Refactor options for JDBC and ORC data sources and change default compression for ORC
## What changes were proposed in this pull request?

Currently, Parquet, JSON and CSV data sources have a class for thier options, (`ParquetOptions`, `JSONOptions` and `CSVOptions`).

It is convenient to manage options for sources to gather options into a class. Currently, `JDBC`, `Text`, `libsvm` and `ORC` datasources do not have this class. This might be nicer if these options are in a unified format so that options can be added and

This PR refactors the options in Spark internal data sources adding new classes, `OrcOptions`, `TextOptions`, `JDBCOptions` and `LibSVMOptions`.

Also, this PR change the default compression codec for ORC from `NONE` to `SNAPPY`.

## How was this patch tested?

Existing tests should cover this for refactoring and unittests in `OrcHadoopFsRelationSuite` for changing the default compression codec for ORC.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #13048 from HyukjinKwon/SPARK-15267.
2016-05-13 09:04:37 -07:00
Sean Owen 10a8389674 Revert "[SPARK-12972][CORE] Update org.apache.httpcomponents.httpclient"
This reverts commit c74a6c3f23.
2016-05-13 13:50:26 +01:00
Sean Owen c74a6c3f23 [SPARK-12972][CORE] Update org.apache.httpcomponents.httpclient
## What changes were proposed in this pull request?

- update httpcore/httpclient to latest
- centralize version management
- remove excludes that are no longer relevant according to SBT/Maven dep graphs
- also manage httpmime to match httpclient

## How was this patch tested?

Jenkins tests, plus review of dependency graphs from SBT/Maven, and review of test-dependencies.sh  output

Author: Sean Owen <sowen@cloudera.com>

Closes #13049 from srowen/SPARK-12972.
2016-05-13 09:00:50 +01:00
Holden Karau 382dbc12bb [SPARK-15061][PYSPARK] Upgrade to Py4J 0.10.1
## What changes were proposed in this pull request?

This upgrades to Py4J 0.10.1 which reduces syscal overhead in Java gateway ( see https://github.com/bartdag/py4j/issues/201 ). Related https://issues.apache.org/jira/browse/SPARK-6728 .

## How was this patch tested?

Existing doctests & unit tests pass

Author: Holden Karau <holden@us.ibm.com>

Closes #13064 from holdenk/SPARK-15061-upgrade-to-py4j-0.10.1.
2016-05-13 08:59:18 +01:00
wm624@hotmail.com bdff299f9e [SPARK-14900][ML] spark.ml classification metrics should include accuracy
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)
Add accuracy to MulticlassMetrics class and add corresponding code in MulticlassClassificationEvaluator.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Scala Unit tests in ml.evaluation

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #12882 from wangmiao1981/accuracy.
2016-05-13 08:29:37 +01:00
Reynold Xin e1dc853737 [SPARK-15310][SQL] Rename HiveTypeCoercion -> TypeCoercion
## What changes were proposed in this pull request?
We originally designed the type coercion rules to match Hive, but over time we have diverged. It does not make sense to call it HiveTypeCoercion anymore. This patch renames it TypeCoercion.

## How was this patch tested?
Updated unit tests to reflect the rename.

Author: Reynold Xin <rxin@databricks.com>

Closes #13091 from rxin/SPARK-15310.
2016-05-13 00:15:39 -07:00
BenFradet 31f1aebbeb [SPARK-13961][ML] spark.ml ChiSqSelector and RFormula should support other numeric types for label
## What changes were proposed in this pull request?

Made ChiSqSelector and RFormula accept all numeric types for label

## How was this patch tested?

Unit tests

Author: BenFradet <benjamin.fradet@gmail.com>

Closes #12467 from BenFradet/SPARK-13961.
2016-05-13 09:08:04 +02:00
sethah 5b849766ab [SPARK-15181][ML][PYSPARK] Python API for GLR summaries.
## What changes were proposed in this pull request?

This patch adds a python API for generalized linear regression summaries (training and test). This helps provide feature parity for Python GLMs.

## How was this patch tested?

Added a unit test to `pyspark.ml.tests`

Author: sethah <seth.hendrickson16@gmail.com>

Closes #12961 from sethah/GLR_summary.
2016-05-13 09:01:20 +02:00
Zheng RuiFeng 87d69a01f0 [MINOR][PYSPARK] update _shared_params_code_gen.py
## What changes were proposed in this pull request?

1, add arg-checkings for `tol` and `stepSize` to  keep in line with `SharedParamsCodeGen.scala`
2, fix one typo

## How was this patch tested?
local build

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12996 from zhengruifeng/py_args_checking.
2016-05-13 08:52:06 +02:00
Holden Karau d1aadea05a [SPARK-15188] Add missing thresholds param to NaiveBayes in PySpark
## What changes were proposed in this pull request?

Add missing thresholds param to NiaveBayes

## How was this patch tested?
doctests

Author: Holden Karau <holden@us.ibm.com>

Closes #12963 from holdenk/SPARK-15188-add-missing-naive-bayes-param.
2016-05-13 08:39:59 +02:00
hyukjinkwon 51841d77d9 [SPARK-13866] [SQL] Handle decimal type in CSV inference at CSV data source.
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-13866

This PR adds the support to infer `DecimalType`.
Here are the rules between `IntegerType`, `LongType` and `DecimalType`.

#### Infering Types

1. `IntegerType` and then `LongType`are tried first.

  ```scala
  Int.MaxValue => IntegerType
  Long.MaxValue => LongType
  ```

2. If it fails, try `DecimalType`.

  ```scala
  (Long.MaxValue + 1) => DecimalType(20, 0)
  ```
  This does not try to infer this as `DecimalType` when scale is less than 0.

3. if it fails, try `DoubleType`
  ```scala
  0.1 => DoubleType // This is failed to be inferred as `DecimalType` because it has the scale, 1.
  ```

#### Compatible Types (Merging Types)

For merging types, this is the same with JSON data source. If `DecimalType` is not capable, then it becomes `DoubleType`

## How was this patch tested?

Unit tests were used and `./dev/run_tests` for code style test.

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>

Closes #11724 from HyukjinKwon/SPARK-13866.
2016-05-12 22:31:14 -07:00
Reynold Xin eda2800d44 [SPARK-14541][SQL] Support IFNULL, NULLIF, NVL and NVL2
## What changes were proposed in this pull request?
This patch adds support for a few SQL functions to improve compatibility with other databases: IFNULL, NULLIF, NVL and NVL2. In order to do this, this patch introduced a RuntimeReplaceable expression trait that allows replacing an unevaluable expression in the optimizer before evaluation.

Note that the semantics are not completely identical to other databases in esoteric cases.

## How was this patch tested?
Added a new test suite SQLCompatibilityFunctionSuite.

Closes #12373.

Author: Reynold Xin <rxin@databricks.com>

Closes #13084 from rxin/SPARK-14541.
2016-05-12 22:18:39 -07:00
Reynold Xin ba169c3230 [SPARK-15306][SQL] Move object expressions into expressions.objects package
## What changes were proposed in this pull request?
This patch moves all the object related expressions into expressions.objects package, for better code organization.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #13085 from rxin/SPARK-15306.
2016-05-12 21:35:14 -07:00
Sun Rui b3930f74a0 [SPARK-15202][SPARKR] add dapplyCollect() method for DataFrame in SparkR.
## What changes were proposed in this pull request?

dapplyCollect() applies an R function on each partition of a SparkDataFrame and collects the result back to R as a data.frame.
```
dapplyCollect(df, function(ldf) {...})
```

## How was this patch tested?
SparkR unit tests.

Author: Sun Rui <sunrui2016@gmail.com>

Closes #12989 from sun-rui/SPARK-15202.
2016-05-12 17:50:55 -07:00
Herman van Hovell bb1362eb3b [SPARK-10605][SQL] Create native collect_list/collect_set aggregates
## What changes were proposed in this pull request?
We currently use the Hive implementations for the collect_list/collect_set aggregate functions. This has a few major drawbacks: the use of HiveUDAF (which has quite a bit of overhead) and the lack of support for struct datatypes. This PR adds native implementation of these functions to Spark.

The size of the collected list/set may vary, this means we cannot use the fast, Tungsten, aggregation path to perform the aggregation, and that we fallback to the slower sort based path. Another big issue with these operators is that when the size of the collected list/set grows too large, we can start experiencing large GC pauzes and OOMEs.

This `collect*` aggregates implemented in this PR rely on the sort based aggregate path for correctness. They maintain their own internal buffer which holds the rows for one group at a time. The sortbased aggregation path is triggered by disabling `partialAggregation` for these aggregates (which is kinda funny); this technique is also employed in `org.apache.spark.sql.hiveHiveUDAFFunction`.

I have done some performance testing:
```scala
import org.apache.spark.sql.{Dataset, Row}

sql("create function collect_list2 as 'org.apache.hadoop.hive.ql.udf.generic.GenericUDAFCollectList'")

val df = range(0, 10000000).select($"id", (rand(213123L) * 100000).cast("int").as("grp"))
df.select(countDistinct($"grp")).show

def benchmark(name: String, plan: Dataset[Row], maxItr: Int = 5): Unit = {
   // Do not measure planning.
   plan1.queryExecution.executedPlan

   // Execute the plan a number of times and average the result.
   val start = System.nanoTime
   var i = 0
   while (i < maxItr) {
     plan.rdd.foreach(row => Unit)
     i += 1
   }
   val time = (System.nanoTime - start) / (maxItr * 1000000L)
   println(s"[$name] $maxItr iterations completed in an average time of $time ms.")
}

val plan1 = df.groupBy($"grp").agg(collect_list($"id"))
val plan2 = df.groupBy($"grp").agg(callUDF("collect_list2", $"id"))

benchmark("Spark collect_list", plan1)
...
> [Spark collect_list] 5 iterations completed in an average time of 3371 ms.

benchmark("Hive collect_list", plan2)
...
> [Hive collect_list] 5 iterations completed in an average time of 9109 ms.
```
Performance is improved by a factor 2-3.

## How was this patch tested?
Added tests to `DataFrameAggregateSuite`.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #12874 from hvanhovell/implode.
2016-05-12 13:56:00 -07:00
Takuya UESHIN a57aadae84 [SPARK-13902][SCHEDULER] Make DAGScheduler not to create duplicate stage.
## What changes were proposed in this pull request?

`DAGScheduler`sometimes generate incorrect stage graph.

Suppose you have the following DAG:

```
[A] <--(s_A)-- [B] <--(s_B)-- [C] <--(s_C)-- [D]
            \                /
              <-------------
```

Note: [] means an RDD, () means a shuffle dependency.

Here, RDD `B` has a shuffle dependency on RDD `A`, and RDD `C` has shuffle dependency on both `B` and `A`. The shuffle dependency IDs are numbers in the `DAGScheduler`, but to make the example easier to understand, let's call the shuffled data from `A` shuffle dependency ID `s_A` and the shuffled data from `B` shuffle dependency ID `s_B`.
The `getAncestorShuffleDependencies` method in `DAGScheduler` (incorrectly) does not check for duplicates when it's adding ShuffleDependencies to the parents data structure, so for this DAG, when `getAncestorShuffleDependencies` gets called on `C` (previous of the final RDD), `getAncestorShuffleDependencies` will return `s_A`, `s_B`, `s_A` (`s_A` gets added twice: once when the method "visit"s RDD `C`, and once when the method "visit"s RDD `B`). This is problematic because this line of code: https://github.com/apache/spark/blob/8ef3399/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L289 then generates a new shuffle stage for each dependency returned by `getAncestorShuffleDependencies`, resulting in duplicate map stages that compute the map output from RDD `A`.

As a result, `DAGScheduler` generates the following stages and their parents for each shuffle:

| | stage | parents |
|----|----|----|
| s_A | ShuffleMapStage 2 | List() |
| s_B | ShuffleMapStage 1 | List(ShuffleMapStage 0) |
| s_C | ShuffleMapStage 3 | List(ShuffleMapStage 1, ShuffleMapStage 2) |
| - | ResultStage 4 | List(ShuffleMapStage 3) |

The stage for s_A should be `ShuffleMapStage 0`, but the stage for `s_A` is generated twice as `ShuffleMapStage 2` and `ShuffleMapStage 0` is overwritten by `ShuffleMapStage 2`, and the stage `ShuffleMap Stage1` keeps referring the old stage `ShuffleMapStage 0`.

This patch is fixing it.

## How was this patch tested?

I added the sample RDD graph to show the illegal stage graph to `DAGSchedulerSuite`.

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #12655 from ueshin/issues/SPARK-13902.
2016-05-12 12:36:18 -07:00
Brian O'Neill 81e3bfc16c [SPARK-14421] Upgrades protobuf dependency to 2.6.1 for the new version of KCL, and…
## What changes were proposed in this pull request?

When running with Kinesis Consumer Library (KCL), against a stream that contains aggregated data, the KCL needs access to protobuf to de-aggregate the records.   Without this patch, that results in the following error message:

```
   Caused by: java.lang.ClassNotFoundException: com.google.protobuf.ProtocolStringList
```

This PR upgrades the protobuf dependency within the kinesis-asl-assembly, and relocates that package (as not to conflict with Spark's use of 2.5.0), which fixes the above CNFE.

## How was this patch tested?

Used kinesis word count example against a stream containing aggregated data.

See: SPARK-14421

Author: Brian O'Neill <bone@alumni.brown.edu>

Closes #13054 from boneill42/protobuf-relocation-for-kcl.
2016-05-12 20:10:33 +01:00
bomeng 81bf870848 [SPARK-14897][SQL] upgrade to jetty 9.2.16
## What changes were proposed in this pull request?

Since Jetty 8 is EOL (end of life) and has critical security issue [http://www.securityweek.com/critical-vulnerability-found-jetty-web-server], I think upgrading to 9 is necessary. I am using latest 9.2 since 9.3 requires Java 8+.

`javax.servlet` and `derby` were also upgraded since Jetty 9.2 needs corresponding version.

## How was this patch tested?

Manual test and current test cases should cover it.

Author: bomeng <bmeng@us.ibm.com>

Closes #12916 from bomeng/SPARK-14897.
2016-05-12 20:07:44 +01:00
gatorsmile be617f3d06 [SPARK-14684][SPARK-15277][SQL] Partition Spec Validation in SessionCatalog and Checking Partition Spec Existence Before Dropping
#### What changes were proposed in this pull request?
~~Currently, multiple partitions are allowed to drop by using a single DDL command: Alter Table Drop Partition. However, the internal implementation could break atomicity. That means, we could just drop a subset of qualified partitions, if hitting an exception when dropping one of qualified partitions~~

~~This PR contains the following behavior changes:~~
~~- disallow dropping multiple partitions by a single command ~~
~~- allow users to input predicates in partition specification and issue a nicer error message if the predicate's comparison operator is not `=`.~~
~~- verify the partition spec in SessionCatalog. This can ensure each partition spec in `Drop Partition` does not correspond to multiple partitions.~~

This PR has two major parts:
- Verify the partition spec in SessionCatalog for fixing the following issue:
  ```scala
  sql(s"ALTER TABLE $externalTab DROP PARTITION (ds='2008-04-09', unknownCol='12')")
  ```
  Above example uses an invalid partition spec. Without this PR, we will drop all the partitions. The reason is Hive megastores getPartitions API returns all the partitions if we provide an invalid spec.

- Re-implemented the `dropPartitions` in `HiveClientImpl`. Now, we always check if all the user-specified partition specs exist before attempting to drop the partitions. Previously, we start drop the partition before completing checking the existence of all the partition specs. If any failure happened after we start to drop the partitions, we will log an error message to indicate which partitions have been dropped and which partitions have not been dropped.

#### How was this patch tested?
Modified the existing test cases and added new test cases.

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #12801 from gatorsmile/banDropMultiPart.
2016-05-12 11:14:40 -07:00
Liang-Chi Hsieh 470de743ec [SPARK-15094][SPARK-14803][SQL] Remove extra Project added in EliminateSerialization
## What changes were proposed in this pull request?

We will eliminate the pair of `DeserializeToObject` and `SerializeFromObject` in `Optimizer` and add extra `Project`. However, when DeserializeToObject's outputObjectType is ObjectType and its cls can't be processed by unsafe project, it will be failed.

To fix it, we can simply remove the extra `Project` and replace the output attribute of `DeserializeToObject` in another rule.

## How was this patch tested?
`DatasetSuite`.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #12926 from viirya/fix-eliminate-serialization-projection.
2016-05-12 10:11:12 -07:00
Sean Owen 5bb62b893b [BUILD] Test closing stale PRs
## What changes were proposed in this pull request?

Here I'm seeing if we can close stale PRs via a PR message, as I'd expect.
See thread https://www.mail-archive.com/devspark.apache.org/msg14149.html

Closes #9354
Closes #9451
Closes #10507
Closes #10486
Closes #10460
Closes #10967
Closes #10681
Closes #11766
Closes #9907
Closes #10209
Closes #10379
Closes #10403
Closes #10842
Closes #11036
Closes #13003
Closes #10887

## How was this patch tested?

(No changes)

Author: Sean Owen <sowen@cloudera.com>

Closes #13052 from srowen/TestClosingPRs.
2016-05-12 10:25:20 +01:00
Sean Zhong 33c6eb5218 [SPARK-15171][SQL] Deprecate registerTempTable and add dataset.createTempView
## What changes were proposed in this pull request?

Deprecates registerTempTable and add dataset.createTempView, dataset.createOrReplaceTempView.

## How was this patch tested?

Unit tests.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #12945 from clockfly/spark-15171.
2016-05-12 15:51:53 +08:00
Holden Karau 5207a005cc [SPARK-15281][PYSPARK][ML][TRIVIAL] Add impurity param to GBTRegressor & add experimental inside of regression.py
## What changes were proposed in this pull request?

Add impurity param to  GBTRegressor and mark the of the models & regressors in regression.py as experimental to match Scaladoc.

## How was this patch tested?

Added default value to init, tested with unit/doc tests.

Author: Holden Karau <holden@us.ibm.com>

Closes #13071 from holdenk/SPARK-15281-GBTRegressor-impurity.
2016-05-12 09:19:27 +02:00
Wenchen Fan 46991448aa [SPARK-15160][SQL] support data source table in InMemoryCatalog
## What changes were proposed in this pull request?

This PR adds a new rule to convert `SimpleCatalogRelation` to data source table if its table property contains data source information.

## How was this patch tested?

new test in SQLQuerySuite

Author: Wenchen Fan <wenchen@databricks.com>

Closes #12935 from cloud-fan/ds-table.
2016-05-11 23:55:42 -07:00
Zheng RuiFeng 9e266d07a4 [SPARK-15031][SPARK-15134][EXAMPLE][DOC] Use SparkSession and update indent in examples
## What changes were proposed in this pull request?
1, Use `SparkSession` according to [SPARK-15031](https://issues.apache.org/jira/browse/SPARK-15031)
2, Update indent for `SparkContext` according to [SPARK-15134](https://issues.apache.org/jira/browse/SPARK-15134)
3, BTW, remove some duplicate space and add missing '.'

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13050 from zhengruifeng/use_sparksession.
2016-05-11 22:45:30 -07:00
Yin Huai ba5487c061 [SPARK-15072][SQL][PYSPARK][HOT-FIX] Remove SparkSession.withHiveSupport from readwrite.py
## What changes were proposed in this pull request?
Seems db573fc743 did not remove withHiveSupport from readwrite.py

Author: Yin Huai <yhuai@databricks.com>

Closes #13069 from yhuai/fixPython.
2016-05-11 21:43:56 -07:00
Cheng Lian f036dd7ce7 [SPARK-14346] SHOW CREATE TABLE for data source tables
## What changes were proposed in this pull request?

This PR adds native `SHOW CREATE TABLE` DDL command for data source tables. Support for Hive tables will be added in follow-up PR(s).

To show table creation DDL for data source tables created by CTAS statements, this PR also added partitioning and bucketing support for normal `CREATE TABLE ... USING ...` syntax.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

A new test suite `ShowCreateTableSuite` is added in sql/hive package to test the new feature.

Author: Cheng Lian <lian@databricks.com>

Closes #12781 from liancheng/spark-14346-show-create-table.
2016-05-11 20:44:04 -07:00
Sandeep Singh ff92eb2e80 [SPARK-15080][CORE] Break copyAndReset into copy and reset
## What changes were proposed in this pull request?
Break copyAndReset into two methods copy and reset instead of just one.

## How was this patch tested?
Existing Tests

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #12936 from techaddict/SPARK-15080.
2016-05-12 11:12:09 +08:00
Sandeep Singh db573fc743 [SPARK-15072][SQL][PYSPARK] FollowUp: Remove SparkSession.withHiveSupport in PySpark
## What changes were proposed in this pull request?
This is a followup of https://github.com/apache/spark/pull/12851
Remove `SparkSession.withHiveSupport` in PySpark and instead use `SparkSession.builder. enableHiveSupport`

## How was this patch tested?
Existing tests.

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #13063 from techaddict/SPARK-15072-followup.
2016-05-11 17:44:00 -07:00
Bill Chambers 603f4453a1 [SPARK-15264][SPARK-15274][SQL] CSV Reader Error on Blank Column Names
## What changes were proposed in this pull request?

When a CSV begins with:
- `,,`
OR
- `"","",`

meaning that the first column names are either empty or blank strings and `header` is specified to be `true`, then the column name is replaced with `C` + the index number of that given column. For example, if you were to read in the CSV:
```
"","second column"
"hello", "there"
```
Then column names would become `"C0", "second column"`.

This behavior aligns with what currently happens when `header` is specified to be `false` in recent versions of Spark.

### Current Behavior in Spark <=1.6
In Spark <=1.6, a CSV with a blank column name becomes a blank string, `""`, meaning that this column cannot be accessed. However the CSV reads in without issue.

### Current Behavior in Spark 2.0
Spark throws a NullPointerError and will not read in the file.

#### Reproduction in 2.0
https://databricks-prod-cloudfront.cloud.databricks.com/public/4027ec902e239c93eaaa8714f173bcfc/346304/2828750690305044/484361/latest.html

## How was this patch tested?
A new test was added to `CSVSuite` to account for this issue. We then have asserts that test for being able to select both the empty column names as well as the regular column names.

Author: Bill Chambers <bill@databricks.com>
Author: Bill Chambers <wchambers@ischool.berkeley.edu>

Closes #13041 from anabranch/master.
2016-05-11 17:42:13 -07:00
Andrew Or f14c4ba001 [SPARK-15276][SQL] CREATE TABLE with LOCATION should imply EXTERNAL
## What changes were proposed in this pull request?

Before:
```sql
-- uses that location but issues a warning
CREATE TABLE my_tab LOCATION /some/path
-- deletes any existing data in the specified location
DROP TABLE my_tab
```

After:
```sql
-- uses that location but creates an EXTERNAL table instead
CREATE TABLE my_tab LOCATION /some/path
-- does not delete the data at /some/path
DROP TABLE my_tab
```

This patch essentially makes the `EXTERNAL` field optional. This is related to #13032.

## How was this patch tested?

New test in `DDLCommandSuite`.

Author: Andrew Or <andrew@databricks.com>

Closes #13060 from andrewor14/location-implies-external.
2016-05-11 17:29:58 -07:00
Nicholas Chammas b9cf617a6f [SPARK-15256] [SQL] [PySpark] Clarify DataFrameReader.jdbc() docstring
This PR:
* Corrects the documentation for the `properties` parameter, which is supposed to be a dictionary and not a list.
* Generally clarifies the Python docstring for DataFrameReader.jdbc() by pulling from the [Scala docstrings](b281377647/sql/core/src/main/scala/org/apache/spark/sql/DataFrameReader.scala (L201-L251)) and rephrasing things.
* Corrects minor Sphinx typos.

Author: Nicholas Chammas <nicholas.chammas@gmail.com>

Closes #13034 from nchammas/SPARK-15256.
2016-05-11 15:31:16 -07:00
Andrew Or 8881765ac7 [SPARK-15257][SQL] Require CREATE EXTERNAL TABLE to specify LOCATION
## What changes were proposed in this pull request?

Before:
```sql
-- uses warehouse dir anyway
CREATE EXTERNAL TABLE my_tab
-- doesn't actually delete the data
DROP TABLE my_tab
```
After:
```sql
-- no location is provided, throws exception
CREATE EXTERNAL TABLE my_tab
-- creates an external table using that location
CREATE EXTERNAL TABLE my_tab LOCATION '/path/to/something'
-- doesn't delete the data, which is expected
DROP TABLE my_tab
```

## How was this patch tested?

New test in `DDLCommandSuite`

Author: Andrew Or <andrew@databricks.com>

Closes #13032 from andrewor14/create-external-table-location.
2016-05-11 15:30:53 -07:00
Reynold Xin 40ba87f769 [SPARK-15278] [SQL] Remove experimental tag from Python DataFrame
## What changes were proposed in this pull request?
Earlier we removed experimental tag for Scala/Java DataFrames, but haven't done so for Python. This patch removes the experimental flag for Python and declares them stable.

## How was this patch tested?
N/A.

Author: Reynold Xin <rxin@databricks.com>

Closes #13062 from rxin/SPARK-15278.
2016-05-11 15:12:27 -07:00