Commit graph

18830 commits

Author SHA1 Message Date
Yanbo Liang 3ef6d98a80 [SPARK-17847][ML] Reduce shuffled data size of GaussianMixture & copy the implementation from mllib to ml
## What changes were proposed in this pull request?

Copy `GaussianMixture` implementation from mllib to ml, then we can add new features to it.
I left mllib `GaussianMixture` untouched, unlike some other algorithms to wrap the ml implementation. For the following reasons:
- mllib `GaussianMixture` allows k == 1, but ml does not.
- mllib `GaussianMixture` supports setting initial model, but ml does not support currently. (We will definitely add this feature for ml in the future)

We can get around these issues to make mllib as a wrapper calling into ml, but I'd prefer to leave mllib untouched which can make ml clean.

Meanwhile, There is a big performance improvement for `GaussianMixture` in this PR. Since the covariance matrix of multivariate gaussian distribution is symmetric, we can only store the upper triangular part of the matrix and it will greatly reduce the shuffled data size. In my test, this change will reduce shuffled data size by about 50% and accelerate the job execution.

Before this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/19641622/4bb017ac-9996-11e6-8ece-83db184b620a.png)
After this PR:
![image](https://cloud.githubusercontent.com/assets/1962026/19641635/629c21fe-9996-11e6-91e9-83ab74ae0126.png)
## How was this patch tested?

Existing tests and added new tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #15413 from yanboliang/spark-17847.
2017-01-09 21:38:46 -08:00
Burak Yavuz faabe69cc0 [SPARK-18952] Regex strings not properly escaped in codegen for aggregations
## What changes were proposed in this pull request?

If I use the function regexp_extract, and then in my regex string, use `\`, i.e. escape character, this fails codegen, because the `\` character is not properly escaped when codegen'd.

Example stack trace:
```
/* 059 */     private int maxSteps = 2;
/* 060 */     private int numRows = 0;
/* 061 */     private org.apache.spark.sql.types.StructType keySchema = new org.apache.spark.sql.types.StructType().add("date_format(window#325.start, yyyy-MM-dd HH:mm)", org.apache.spark.sql.types.DataTypes.StringType)
/* 062 */     .add("regexp_extract(source#310.description, ([a-zA-Z]+)\[.*, 1)", org.apache.spark.sql.types.DataTypes.StringType);
/* 063 */     private org.apache.spark.sql.types.StructType valueSchema = new org.apache.spark.sql.types.StructType().add("sum", org.apache.spark.sql.types.DataTypes.LongType);
/* 064 */     private Object emptyVBase;

...

org.codehaus.commons.compiler.CompileException: File 'generated.java', Line 62, Column 58: Invalid escape sequence
	at org.codehaus.janino.Scanner.scanLiteralCharacter(Scanner.java:918)
	at org.codehaus.janino.Scanner.produce(Scanner.java:604)
	at org.codehaus.janino.Parser.peekRead(Parser.java:3239)
	at org.codehaus.janino.Parser.parseArguments(Parser.java:3055)
	at org.codehaus.janino.Parser.parseSelector(Parser.java:2914)
	at org.codehaus.janino.Parser.parseUnaryExpression(Parser.java:2617)
	at org.codehaus.janino.Parser.parseMultiplicativeExpression(Parser.java:2573)
	at org.codehaus.janino.Parser.parseAdditiveExpression(Parser.java:2552)
```

In the codegend expression, the literal should use `\\` instead of `\`

A similar problem was solved here: https://github.com/apache/spark/pull/15156.

## How was this patch tested?

Regression test in `DataFrameAggregationSuite`

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #16361 from brkyvz/reg-break.
2017-01-09 14:25:38 -08:00
Zhenhua Wang 15c2bd01b0 [SPARK-19020][SQL] Cardinality estimation of aggregate operator
## What changes were proposed in this pull request?

Support cardinality estimation of aggregate operator

## How was this patch tested?

Add test cases

Author: Zhenhua Wang <wzh_zju@163.com>
Author: wangzhenhua <wangzhenhua@huawei.com>

Closes #16431 from wzhfy/aggEstimation.
2017-01-09 11:29:42 -08:00
Zhenhua Wang 3ccabdfb4d [SPARK-17077][SQL] Cardinality estimation for project operator
## What changes were proposed in this pull request?

Support cardinality estimation for project operator.

## How was this patch tested?

Add a test suite and a base class in the catalyst package.

Author: Zhenhua Wang <wzh_zju@163.com>

Closes #16430 from wzhfy/projectEstimation.
2017-01-08 21:15:52 -08:00
anabranch 19d9d4c855 [SPARK-19126][DOCS] Update Join Documentation Across Languages
## What changes were proposed in this pull request?

- [X] Make sure all join types are clearly mentioned
- [X] Make join labeling/style consistent
- [X] Make join label ordering docs the same
- [X] Improve join documentation according to above for Scala
- [X] Improve join documentation according to above for Python
- [X] Improve join documentation according to above for R

## How was this patch tested?
No tests b/c docs.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: anabranch <wac.chambers@gmail.com>

Closes #16504 from anabranch/SPARK-19126.
2017-01-08 20:37:46 -08:00
anabranch 1f6ded6455 [SPARK-19127][DOCS] Update Rank Function Documentation
## What changes were proposed in this pull request?

- [X] Fix inconsistencies in function reference for dense rank and dense
- [X] Make all languages equivalent in their reference to `dense_rank` and `rank`.

## How was this patch tested?

N/A for docs.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: anabranch <wac.chambers@gmail.com>

Closes #16505 from anabranch/SPARK-19127.
2017-01-08 17:53:53 -08:00
Dilip Biswal 4351e62207 [SPARK-19093][SQL] Cached tables are not used in SubqueryExpression
## What changes were proposed in this pull request?
Consider the plans inside subquery expressions while looking up cache manager to make
use of cached data. Currently CacheManager.useCachedData does not consider the
subquery expressions in the plan.

SQL
```
select * from rows where not exists (select * from rows)
```
Before the fix
```
== Optimized Logical Plan ==
Join LeftAnti
:- InMemoryRelation [_1#3775, _2#3776], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
:     +- *FileScan parquet [_1#3775,_2#3776] Batched: true, Format: Parquet, Location: InMemoryFileIndex[dbfs:/tmp/rows], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_1:string,_2:string>
+- Project [_1#3775 AS _1#3775#4001, _2#3776 AS _2#3776#4002]
   +- Relation[_1#3775,_2#3776] parquet
```

After
```
== Optimized Logical Plan ==
Join LeftAnti
:- InMemoryRelation [_1#256, _2#257], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
:     +- *FileScan parquet [_1#256,_2#257] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/tmp/rows], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_1:string,_2:string>
+- Project [_1#256 AS _1#256#298, _2#257 AS _2#257#299]
   +- InMemoryRelation [_1#256, _2#257], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas)
         +- *FileScan parquet [_1#256,_2#257] Batched: true, Format: Parquet, Location: InMemoryFileIndex[file:/tmp/rows], PartitionFilters: [], PushedFilters: [], ReadSchema: struct<_1:string,_2:string>
```

Query2
```
 SELECT * FROM t1
 WHERE
 c1 IN (SELECT c1 FROM t2 WHERE c1 IN (SELECT c1 FROM t3 WHERE c1 = 1))
```
Before
```
== Analyzed Logical Plan ==
c1: int
Project [c1#3]
+- Filter predicate-subquery#47 [(c1#3 = c1#10)]
   :  +- Project [c1#10]
   :     +- Filter predicate-subquery#46 [(c1#10 = c1#17)]
   :        :  +- Project [c1#17]
   :        :     +- Filter (c1#17 = 1)
   :        :        +- SubqueryAlias t3, `t3`
   :        :           +- Project [value#15 AS c1#17]
   :        :              +- LocalRelation [value#15]
   :        +- SubqueryAlias t2, `t2`
   :           +- Project [value#8 AS c1#10]
   :              +- LocalRelation [value#8]
   +- SubqueryAlias t1, `t1`
      +- Project [value#1 AS c1#3]
         +- LocalRelation [value#1]

== Optimized Logical Plan ==
Join LeftSemi, (c1#3 = c1#10)
:- InMemoryRelation [c1#3], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), t1
:     +- LocalTableScan [c1#3]
+- Project [value#8 AS c1#10]
   +- Join LeftSemi, (value#8 = c1#17)
      :- LocalRelation [value#8]
      +- Project [value#15 AS c1#17]
         +- Filter (value#15 = 1)
            +- LocalRelation [value#15]

```
After
```
== Analyzed Logical Plan ==
c1: int
Project [c1#3]
+- Filter predicate-subquery#47 [(c1#3 = c1#10)]
   :  +- Project [c1#10]
   :     +- Filter predicate-subquery#46 [(c1#10 = c1#17)]
   :        :  +- Project [c1#17]
   :        :     +- Filter (c1#17 = 1)
   :        :        +- SubqueryAlias t3, `t3`
   :        :           +- Project [value#15 AS c1#17]
   :        :              +- LocalRelation [value#15]
   :        +- SubqueryAlias t2, `t2`
   :           +- Project [value#8 AS c1#10]
   :              +- LocalRelation [value#8]
   +- SubqueryAlias t1, `t1`
      +- Project [value#1 AS c1#3]
         +- LocalRelation [value#1]

== Optimized Logical Plan ==
Join LeftSemi, (c1#3 = c1#10)
:- InMemoryRelation [c1#3], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), t1
:     +- LocalTableScan [c1#3]
+- Join LeftSemi, (c1#10 = c1#17)
   :- InMemoryRelation [c1#10], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), t2
   :     +- LocalTableScan [c1#10]
   +- Filter (c1#17 = 1)
      +- InMemoryRelation [c1#17], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), t1
            +- LocalTableScan [c1#3]
```
## How was this patch tested?
Added new tests in CachedTableSuite.

Author: Dilip Biswal <dbiswal@us.ibm.com>

Closes #16493 from dilipbiswal/SPARK-19093.
2017-01-08 23:09:07 +01:00
zuotingbing cd1d00adaf
[SPARK-19026] SPARK_LOCAL_DIRS(multiple directories on different disks) cannot be deleted
JIRA Issue: https://issues.apache.org/jira/browse/SPARK-19026

SPARK_LOCAL_DIRS (Standalone) can  be a comma-separated list of multiple directories on different disks, e.g. SPARK_LOCAL_DIRS=/dir1,/dir2,/dir3, if there is a IOExecption when create sub directory on dir3 , the sub directory which have been created successfully on dir1 and dir2 cannot be deleted anymore when the application finishes.
So we should catch the IOExecption at Utils.createDirectory  , otherwise the variable "appDirectories(appId)" which the function maybeCleanupApplication calls will not be set then dir1 and dir2 will not be cleaned up .

Author: zuotingbing <zuo.tingbing9@zte.com.cn>

Closes #16439 from zuotingbing/master.
2017-01-08 09:29:01 +00:00
Yanbo Liang 6b6b555a1e [SPARK-18862][SPARKR][ML] Split SparkR mllib.R into multiple files
## What changes were proposed in this pull request?
SparkR ```mllib.R``` is getting bigger as we add more ML wrappers, I'd like to split it into multiple files to make us easy to maintain:
* mllib_classification.R
* mllib_clustering.R
* mllib_recommendation.R
* mllib_regression.R
* mllib_stat.R
* mllib_tree.R
* mllib_utils.R

Note: Only reorg, no actual code change.

## How was this patch tested?
Existing tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #16312 from yanboliang/spark-18862.
2017-01-08 01:10:36 -08:00
Dongjoon Hyun 923e594844 [SPARK-18941][SQL][DOC] Add a new behavior document on CREATE/DROP TABLE with LOCATION
## What changes were proposed in this pull request?

This PR adds a new behavior change description on `CREATE TABLE ... LOCATION` at `sql-programming-guide.md` clearly under `Upgrading From Spark SQL 1.6 to 2.0`. This change is introduced at Apache Spark 2.0.0 as [SPARK-15276](https://issues.apache.org/jira/browse/SPARK-15276).

## How was this patch tested?

```
SKIP_API=1 jekyll build
```

**Newly Added Description**
<img width="913" alt="new" src="https://cloud.githubusercontent.com/assets/9700541/21743606/7efe2b12-d4ba-11e6-8a0d-551222718ea2.png">

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #16400 from dongjoon-hyun/SPARK-18941.
2017-01-07 18:55:01 -08:00
Sean Owen 54138f6e89
[SPARK-19106][DOCS] Styling for the configuration docs is broken
## What changes were proposed in this pull request?

configuration.html section headings were not specified correctly in markdown and weren't rendering, being recognized correctly. Removed extra p tags and pulled level 4 titles up to level 3, since level 3 had been skipped. This improves the TOC.

## How was this patch tested?

Doc build, manual check.

Author: Sean Owen <sowen@cloudera.com>

Closes #16490 from srowen/SPARK-19106.
2017-01-07 19:15:51 +00:00
wm624@hotmail.com 036b50347c [SPARK-19110][ML][MLLIB] DistributedLDAModel returns different logPrior for original and loaded model
## What changes were proposed in this pull request?

While adding DistributedLDAModel training summary for SparkR, I found that the logPrior for original and loaded model is different.
For example, in the test("read/write DistributedLDAModel"), I add the test:
val logPrior = model.asInstanceOf[DistributedLDAModel].logPrior
val logPrior2 = model2.asInstanceOf[DistributedLDAModel].logPrior
assert(logPrior === logPrior2)
The test fails:
-4.394180878889078 did not equal -4.294290536919573

The reason is that `graph.vertices.aggregate(0.0)(seqOp, _ + _)` only returns the value of a single vertex instead of the aggregation of all vertices. Therefore, when the loaded model does the aggregation in a different order, it returns different `logPrior`.

Please refer to #16464 for details.
## How was this patch tested?
Add a new unit test for testing logPrior.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #16491 from wangmiao1981/ldabug.
2017-01-07 11:07:49 -08:00
Wenchen Fan b3d39620c5 [SPARK-19085][SQL] cleanup OutputWriterFactory and OutputWriter
## What changes were proposed in this pull request?

`OutputWriterFactory`/`OutputWriter` are internal interfaces and we can remove some unnecessary APIs:
1. `OutputWriterFactory.newWriter(path: String)`: no one calls it and no one implements it.
2. `OutputWriter.write(row: Row)`: during execution we only call `writeInternal`, which is weird as `OutputWriter` is already an internal interface. We should rename `writeInternal` to `write` and remove `def write(row: Row)` and it's related converter code. All implementations should just implement `def write(row: InternalRow)`

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #16479 from cloud-fan/hive-writer.
2017-01-08 00:42:09 +08:00
Yanbo Liang cdda3372a3
[MINOR] Bump R version to 2.2.0.
## What changes were proposed in this pull request?
#16126 bumps master branch version to 2.2.0-SNAPSHOT, but it seems R version was omitted.

## How was this patch tested?
N/A

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #16488 from yanboliang/r-version.
2017-01-07 14:33:17 +00:00
hyukjinkwon 68ea290b3a
[SPARK-13748][PYSPARK][DOC] Add the description for explictly setting None for a named argument for a Row
## What changes were proposed in this pull request?

It seems allowed to not set a key and value for a dict to represent the value is `None` or missing as below:

``` python
spark.createDataFrame([{"x": 1}, {"y": 2}]).show()
```

```
+----+----+
|   x|   y|
+----+----+
|   1|null|
|null|   2|
+----+----+
```

However,  it seems it is not for `Row` as below:

``` python
spark.createDataFrame([Row(x=1), Row(y=2)]).show()
```

``` scala
16/06/19 16:25:56 ERROR Executor: Exception in task 6.0 in stage 66.0 (TID 316)
java.lang.IllegalStateException: Input row doesn't have expected number of values required by the schema. 2 fields are required while 1 values are provided.
    at org.apache.spark.sql.execution.python.EvaluatePython$.fromJava(EvaluatePython.scala:147)
    at org.apache.spark.sql.SparkSession$$anonfun$7.apply(SparkSession.scala:656)
    at org.apache.spark.sql.SparkSession$$anonfun$7.apply(SparkSession.scala:656)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:247)
    at org.apache.spark.sql.execution.SparkPlan$$anonfun$4.apply(SparkPlan.scala:240)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:780)
```

The behaviour seems right but it seems it might confuse users just like this JIRA was reported.

This PR adds the explanation for `Row` class.
## How was this patch tested?

N/A

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #13771 from HyukjinKwon/SPARK-13748.
2017-01-07 12:52:41 +00:00
sueann d60f6f62d0 [SPARK-18194][ML] Log instrumentation in OneVsRest, CrossValidator, TrainValidationSplit
## What changes were proposed in this pull request?

Added instrumentation logging for OneVsRest classifier, CrossValidator, TrainValidationSplit fit() functions.

## How was this patch tested?

Ran unit tests and checked the log file (see output in comments).

Author: sueann <sueann@databricks.com>

Closes #16480 from sueann/SPARK-18194.
2017-01-06 18:53:16 -08:00
Tathagata Das b59cddaba0 [SPARK-19074][SS][DOCS] Updated Structured Streaming Programming Guide for update mode and source/sink options
## What changes were proposed in this pull request?

Updates
- Updated Late Data Handling section by adding a figure for Update Mode. Its more intuitive to explain late data handling with Update Mode, so I added the new figure before the Append Mode figure.
- Updated Output Modes section with Update mode
- Added options for all the sources and sinks

---------------------------
---------------------------

![image](https://cloud.githubusercontent.com/assets/663212/21665176/f150b224-d29f-11e6-8372-14d32da21db9.png)

---------------------------
---------------------------
<img width="931" alt="screen shot 2017-01-03 at 6 09 11 pm" src="https://cloud.githubusercontent.com/assets/663212/21629740/d21c9bb8-d1df-11e6-915b-488a59589fa6.png">
<img width="933" alt="screen shot 2017-01-03 at 6 10 00 pm" src="https://cloud.githubusercontent.com/assets/663212/21629749/e22bdabe-d1df-11e6-86d3-7e51d2f28dbc.png">

---------------------------
---------------------------
![image](https://cloud.githubusercontent.com/assets/663212/21665200/108e18fc-d2a0-11e6-8640-af598cab090b.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665148/cfe414fa-d29f-11e6-9baa-4124ccbab093.png)
![image](https://cloud.githubusercontent.com/assets/663212/21665226/2e8f39e4-d2a0-11e6-85b1-7657e2df5491.png)

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #16468 from tdas/SPARK-19074.
2017-01-06 11:29:01 -08:00
zuotingbing a9a137377e [SPARK-19083] sbin/start-history-server.sh script use of $@ without quotes
JIRA Issue: https://issues.apache.org/jira/browse/SPARK-19083#

sbin/start-history-server.sh script use of $ without quotes, this will affect the length of args which used in HistoryServerArguments::parse(args: List[String])

Author: zuotingbing <zuo.tingbing9@zte.com.cn>

Closes #16484 from zuotingbing/sh.
2017-01-06 09:57:49 -08:00
Kay Ousterhout 2e139eed31 [SPARK-17931] Eliminate unnecessary task (de) serialization
In the existing code, there are three layers of serialization
    involved in sending a task from the scheduler to an executor:
        - A Task object is serialized
        - The Task object is copied to a byte buffer that also
          contains serialized information about any additional JARs,
          files, and Properties needed for the task to execute. This
          byte buffer is stored as the member variable serializedTask
          in the TaskDescription class.
        - The TaskDescription is serialized (in addition to the serialized
          task + JARs, the TaskDescription class contains the task ID and
          other metadata) and sent in a LaunchTask message.

While it *is* necessary to have two layers of serialization, so that
the JAR, file, and Property info can be deserialized prior to
deserializing the Task object, the third layer of deserialization is
unnecessary.  This commit eliminates a layer of serialization by moving
the JARs, files, and Properties into the TaskDescription class.

This commit also serializes the Properties manually (by traversing the map),
as is done with the JARs and files, which reduces the final serialized size.

Unit tests

This is a simpler alternative to the approach proposed in #15505.

shivaram and I did some benchmarking of this and #15505 on a 20-machine m2.4xlarge EC2 machines (160 cores). We ran ~30 trials of code [1] (a very simple job with 10K tasks per stage) and measured the average time per stage:

Before this change: 2490ms
With this change: 2345 ms (so ~6% improvement over the baseline)
With witgo's approach in #15505: 2046 ms (~18% improvement over baseline)

The reason that #15505 has a more significant improvement is that it also moves the serialization from the TaskSchedulerImpl thread to the CoarseGrainedSchedulerBackend thread. I added that functionality on top of this change, and got almost the same improvement [1] as #15505 (average of 2103ms). I think we should decouple these two changes, both so we have some record of the improvement form each individual improvement, and because this change is more about simplifying the code base (the improvement is negligible) while the other is about performance improvement.  The plan, currently, is to merge this PR and then merge the remaining part of #15505 that moves serialization.

[1] The reason the improvement wasn't quite as good as with #15505 when we ran the benchmarks is almost certainly because, at the point when we ran the benchmarks, I hadn't updated the code to manually serialize the Properties (instead the code was using Java's default serialization for the Properties object, whereas #15505 manually serialized the Properties).  This PR has since been updated to manually serialize the Properties, just like the other maps.

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #16053 from kayousterhout/SPARK-17931.
2017-01-06 10:48:08 -06:00
jerryshao 4a4c3dc9ca [SPARK-19033][CORE] Add admin acls for history server
## What changes were proposed in this pull request?

Current HistoryServer's ACLs is derived from application event-log, which means the newly changed ACLs cannot be applied to the old data, this will become a problem where newly added admin cannot access the old application history UI, only the new application can be affected.

So here propose to add admin ACLs for history server, any configured user/group could have the view access to all the applications, while the view ACLs derived from application run-time still take effect.

## How was this patch tested?

Unit test added.

Author: jerryshao <sshao@hortonworks.com>

Closes #16470 from jerryshao/SPARK-19033.
2017-01-06 10:07:54 -06:00
Michal Senkyr 903bb8e8a2 [SPARK-16792][SQL] Dataset containing a Case Class with a List type causes a CompileException (converting sequence to list)
## What changes were proposed in this pull request?

Added a `to` call at the end of the code generated by `ScalaReflection.deserializerFor` if the requested type is not a supertype of `WrappedArray[_]` that uses `CanBuildFrom[_, _, _]` to convert result into an arbitrary subtype of `Seq[_]`.

Care was taken to preserve the original deserialization where it is possible to avoid the overhead of conversion in cases where it is not needed

`ScalaReflection.serializerFor` could already be used to serialize any `Seq[_]` so it was not altered

`SQLImplicits` had to be altered and new implicit encoders added to permit serialization of other sequence types

Also fixes [SPARK-16815] Dataset[List[T]] leads to ArrayStoreException

## How was this patch tested?
```bash
./build/mvn -DskipTests clean package && ./dev/run-tests
```

Also manual execution of the following sets of commands in the Spark shell:
```scala
case class TestCC(key: Int, letters: List[String])

val ds1 = sc.makeRDD(Seq(
(List("D")),
(List("S","H")),
(List("F","H")),
(List("D","L","L"))
)).map(x=>(x.length,x)).toDF("key","letters").as[TestCC]

val test1=ds1.map{_.key}
test1.show
```

```scala
case class X(l: List[String])
spark.createDataset(Seq(List("A"))).map(X).show
```

```scala
spark.sqlContext.createDataset(sc.parallelize(List(1) :: Nil)).collect
```

After adding arbitrary sequence support also tested with the following commands:

```scala
case class QueueClass(q: scala.collection.immutable.Queue[Int])

spark.createDataset(Seq(List(1,2,3))).map(x => QueueClass(scala.collection.immutable.Queue(x: _*))).map(_.q.dequeue).collect
```

Author: Michal Senkyr <mike.senkyr@gmail.com>

Closes #16240 from michalsenkyr/sql-caseclass-list-fix.
2017-01-06 15:05:20 +08:00
Kevin Yu bcc510b021 [SPARK-18871][SQL] New test cases for IN/NOT IN subquery
## What changes were proposed in this pull request?
This PR extends the existing IN/NOT IN subquery test cases coverage, adds more test cases to the IN subquery test suite.

Based on the discussion, we will create  `subquery/in-subquery` sub structure under `sql/core/src/test/resources/sql-tests/inputs` directory.

This is the high level grouping for IN subquery:

`subquery/in-subquery/`
`subquery/in-subquery/simple-in.sql`
`subquery/in-subquery/in-group-by.sql (in parent side, subquery, and both)`
`subquery/in-subquery/not-in-group-by.sql`
`subquery/in-subquery/in-order-by.sql`
`subquery/in-subquery/in-limit.sql`
`subquery/in-subquery/in-having.sql`
`subquery/in-subquery/in-joins.sql`
`subquery/in-subquery/not-in-joins.sql`
`subquery/in-subquery/in-set-operations.sql`
`subquery/in-subquery/in-with-cte.sql`
`subquery/in-subquery/not-in-with-cte.sql`
subquery/in-subquery/in-multiple-columns.sql`

We will deliver it through multiple prs, this is the first pr for the IN subquery, it has

`subquery/in-subquery/simple-in.sql`
`subquery/in-subquery/in-group-by.sql (in parent side, subquery, and both)`

These are the results from running on DB2.
[Modified test file of in-group-by.sql used to run on DB2](https://github.com/apache/spark/files/683367/in-group-by.sql.db2.txt)
[Output of the run result on DB2](https://github.com/apache/spark/files/683362/in-group-by.sql.db2.out.txt)
[Modified test file of simple-in.sql used to run on DB2](https://github.com/apache/spark/files/683378/simple-in.sql.db2.txt)
[Output of the run result on DB2](https://github.com/apache/spark/files/683379/simple-in.sql.db2.out.txt)

## How was this patch tested?

This patch is adding tests.

Author: Kevin Yu <qyu@us.ibm.com>

Closes #16337 from kevinyu98/spark-18871.
2017-01-05 19:00:39 -08:00
Yanbo Liang dfc4c935ba [MINOR] Correct LogisticRegression test case for probability2prediction.
## What changes were proposed in this pull request?
Set correct column names for ```force to use probability2prediction``` in ```LogisticRegressionSuite```.

## How was this patch tested?
Change unit test.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #16477 from yanboliang/lor-pred.
2017-01-05 18:59:49 -08:00
Wenchen Fan cca945b6aa [SPARK-18885][SQL] unify CREATE TABLE syntax for data source and hive serde tables
## What changes were proposed in this pull request?

Today we have different syntax to create data source or hive serde tables, we should unify them to not confuse users and step forward to make hive a data source.

Please read https://issues.apache.org/jira/secure/attachment/12843835/CREATE-TABLE.pdf for  details.

TODO(for follow-up PRs):
1. TBLPROPERTIES is not added to the new syntax, we should decide if we wanna add it later.
2. `SHOW CREATE TABLE` should be updated to use the new syntax.
3. we should decide if we wanna change the behavior of `SET LOCATION`.

## How was this patch tested?

new tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes #16296 from cloud-fan/create-table.
2017-01-05 17:40:27 -08:00
Rui Li f5d18af6a8 [SPARK-14958][CORE] Failed task not handled when there's error deserializing failure reason
## What changes were proposed in this pull request?

TaskResultGetter tries to deserialize the TaskEndReason before handling the failed task. If an error is thrown during deserialization, the failed task won't be handled, which leaves the job hanging.
The PR proposes to handle the failed task in a finally block.
## How was this patch tested?

In my case I hit a NoClassDefFoundError and the job hangs. Manually verified the patch can fix it.

Author: Rui Li <rui.li@intel.com>
Author: Rui Li <lirui@apache.org>
Author: Rui Li <shlr@cn.ibm.com>

Closes #12775 from lirui-intel/SPARK-14958.
2017-01-05 14:51:13 -08:00
Wenchen Fan 30345c43b7 [SPARK-19058][SQL] fix partition related behaviors with DataFrameWriter.saveAsTable
## What changes were proposed in this pull request?

When we append data to a partitioned table with `DataFrameWriter.saveAsTable`, there are 2 issues:
1. doesn't work when the partition has custom location.
2. will recover all partitions

This PR fixes them by moving the special partition handling code from `DataSourceAnalysis` to `InsertIntoHadoopFsRelationCommand`, so that the `DataFrameWriter.saveAsTable` code path can also benefit from it.

## How was this patch tested?

newly added regression tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes #16460 from cloud-fan/append.
2017-01-05 14:11:05 +08:00
uncleGen 6873430cb5 [SPARK-19009][DOC] Add streaming rest api doc
## What changes were proposed in this pull request?

add streaming rest api doc

related to pr #16253

cc saturday-shi srowen

## How was this patch tested?

Author: uncleGen <hustyugm@gmail.com>

Closes #16414 from uncleGen/SPARK-19009.
2017-01-04 15:14:51 -08:00
Kay Ousterhout 00074b5778 [SPARK-19062] Utils.writeByteBuffer bug fix
This commit changes Utils.writeByteBuffer so that it does not change
the position of the ByteBuffer that it writes out, and adds a unit test for
this functionality.

cc mridulm

Author: Kay Ousterhout <kayousterhout@gmail.com>

Closes #16462 from kayousterhout/SPARK-19062.
2017-01-04 11:21:09 -08:00
Herman van Hovell 4262fb0d55 [SPARK-19070] Clean-up dataset actions
## What changes were proposed in this pull request?
Dataset actions currently spin off a new `Dataframe` only to track query execution. This PR simplifies this code path by using the `Dataset.queryExecution` directly. This PR also merges the typed and untyped action evaluation paths.

## How was this patch tested?
Existing tests.

Author: Herman van Hovell <hvanhovell@databricks.com>

Closes #16466 from hvanhovell/SPARK-19070.
2017-01-04 23:47:58 +08:00
Niranjan Padmanabhan a1e40b1f5d
[MINOR][DOCS] Remove consecutive duplicated words/typo in Spark Repo
## What changes were proposed in this pull request?
There are many locations in the Spark repo where the same word occurs consecutively. Sometimes they are appropriately placed, but many times they are not. This PR removes the inappropriately duplicated words.

## How was this patch tested?
N/A since only docs or comments were updated.

Author: Niranjan Padmanabhan <niranjan.padmanabhan@gmail.com>

Closes #16455 from neurons/np.structure_streaming_doc.
2017-01-04 15:07:29 +00:00
Zheng RuiFeng 7a82505817
[SPARK-19054][ML] Eliminate extra pass in NB
## What changes were proposed in this pull request?
eliminate unnecessary extra pass in NB's train

## How was this patch tested?
existing tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #16453 from zhengruifeng/nb_getNC.
2017-01-04 11:54:13 +00:00
Wenchen Fan 101556d0fa [SPARK-19060][SQL] remove the supportsPartial flag in AggregateFunction
## What changes were proposed in this pull request?

Now all aggregation functions support partial aggregate, we can remove the `supportsPartual` flag in `AggregateFunction`

## How was this patch tested?

existing tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #16461 from cloud-fan/partial.
2017-01-04 12:46:30 +01:00
mingfei fe1c895e16
[SPARK-19073] LauncherState should be only set to SUBMITTED after the application is submitted
## What changes were proposed in this pull request?
LauncherState should be only set to SUBMITTED after the application is submitted.
Currently the state is set before the application is actually submitted.

## How was this patch tested?
no test is added in this patch

Author: mingfei <mingfei.smf@alipay.com>

Closes #16459 from shimingfei/fixLauncher.
2017-01-04 10:27:51 +00:00
Wenchen Fan cbd11d2357 [SPARK-19072][SQL] codegen of Literal should not output boxed value
## What changes were proposed in this pull request?

In https://github.com/apache/spark/pull/16402 we made a mistake that, when double/float is infinity, the `Literal` codegen will output boxed value and cause wrong result.

This PR fixes this by special handling infinity to not output boxed value.

## How was this patch tested?

new regression test

Author: Wenchen Fan <wenchen@databricks.com>

Closes #16469 from cloud-fan/literal.
2017-01-03 22:40:14 -08:00
gatorsmile b67b35f76b [SPARK-19048][SQL] Delete Partition Location when Dropping Managed Partitioned Tables in InMemoryCatalog
### What changes were proposed in this pull request?
The data in the managed table should be deleted after table is dropped. However, if the partition location is not under the location of the partitioned table, it is not deleted as expected. Users can specify any location for the partition when they adding a partition.

This PR is to delete partition location when dropping managed partitioned tables stored in `InMemoryCatalog`.

### How was this patch tested?
Added test cases for both HiveExternalCatalog and InMemoryCatalog

Author: gatorsmile <gatorsmile@gmail.com>

Closes #16448 from gatorsmile/unsetSerdeProp.
2017-01-03 11:43:47 -08:00
Devaraj K 89bf370e4f [SPARK-15555][MESOS] Driver with --supervise option cannot be killed in Mesos mode
## What changes were proposed in this pull request?

Not adding the Killed applications for retry.
## How was this patch tested?

I have verified manually in the Mesos cluster, with the changes the killed applications move to Finished Drivers section and will not retry.

Author: Devaraj K <devaraj@apache.org>

Closes #13323 from devaraj-kavali/SPARK-15555.
2017-01-03 11:02:42 -08:00
Dongjoon Hyun 7a2b5f93bc [SPARK-18877][SQL] CSVInferSchema.inferField on DecimalType should find a common type with typeSoFar
## What changes were proposed in this pull request?

CSV type inferencing causes `IllegalArgumentException` on decimal numbers with heterogeneous precisions and scales because the current logic uses the last decimal type in a **partition**. Specifically, `inferRowType`, the **seqOp** of **aggregate**, returns the last decimal type. This PR fixes it to use `findTightestCommonType`.

**decimal.csv**
```
9.03E+12
1.19E+11
```

**BEFORE**
```scala
scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").printSchema
root
 |-- _c0: decimal(3,-9) (nullable = true)

scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").show
16/12/16 14:32:49 ERROR Executor: Exception in task 0.0 in stage 4.0 (TID 4)
java.lang.IllegalArgumentException: requirement failed: Decimal precision 4 exceeds max precision 3
```

**AFTER**
```scala
scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").printSchema
root
 |-- _c0: decimal(4,-9) (nullable = true)

scala> spark.read.format("csv").option("inferSchema", true).load("decimal.csv").show
+---------+
|      _c0|
+---------+
|9.030E+12|
| 1.19E+11|
+---------+
```

## How was this patch tested?

Pass the newly add test case.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #16320 from dongjoon-hyun/SPARK-18877.
2017-01-03 23:06:50 +08:00
Liang-Chi Hsieh 52636226dc [SPARK-18932][SQL] Support partial aggregation for collect_set/collect_list
## What changes were proposed in this pull request?

Currently collect_set/collect_list aggregation expression don't support partial aggregation. This patch is to enable partial aggregation for them.

## How was this patch tested?

Jenkins tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #16371 from viirya/collect-partial-support.
2017-01-03 22:11:54 +08:00
Weiqing Yang e5c307c50a
[MINOR] Add missing sc.stop() to end of examples
## What changes were proposed in this pull request?

Add `finally` clause for `sc.stop()` in the `test("register and deregister Spark listener from SparkContext")`.

## How was this patch tested?
Pass the build and unit tests.

Author: Weiqing Yang <yangweiqing001@gmail.com>

Closes #16426 from weiqingy/testIssue.
2017-01-03 09:56:42 +00:00
Zhenhua Wang ae83c21125 [SPARK-18998][SQL] Add a cbo conf to switch between default statistics and estimated statistics
## What changes were proposed in this pull request?

We add a cbo configuration to switch between default stats and estimated stats.
We also define a new statistics method `planStats` in LogicalPlan with conf as its parameter, in order to pass the cbo switch and other estimation related configurations in the future. `planStats` is used on the caller sides (i.e. in Optimizer and Strategies) to make transformation decisions based on stats.

## How was this patch tested?

Add a test case using a dummy LogicalPlan.

Author: Zhenhua Wang <wzh_zju@163.com>

Closes #16401 from wzhfy/cboSwitch.
2017-01-03 12:19:52 +08:00
gatorsmile a6cd9dbc60 [SPARK-19029][SQL] Remove databaseName from SimpleCatalogRelation
### What changes were proposed in this pull request?
Remove useless `databaseName ` from `SimpleCatalogRelation`.

### How was this patch tested?
Existing test cases.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #16438 from gatorsmile/removeDBFromSimpleCatalogRelation.
2017-01-03 11:55:31 +08:00
hyukjinkwon 46b2126024
[SPARK-19002][BUILD][PYTHON] Check pep8 against all Python scripts
## What changes were proposed in this pull request?

This PR proposes to check pep8 against all other Python scripts and fix the errors as below:

```bash
./dev/create-release/generate-contributors.py
./dev/create-release/releaseutils.py
./dev/create-release/translate-contributors.py
./dev/lint-python
./python/docs/epytext.py
./examples/src/main/python/mllib/decision_tree_classification_example.py
./examples/src/main/python/mllib/decision_tree_regression_example.py
./examples/src/main/python/mllib/gradient_boosting_classification_example.py
./examples/src/main/python/mllib/gradient_boosting_regression_example.py
./examples/src/main/python/mllib/linear_regression_with_sgd_example.py
./examples/src/main/python/mllib/logistic_regression_with_lbfgs_example.py
./examples/src/main/python/mllib/naive_bayes_example.py
./examples/src/main/python/mllib/random_forest_classification_example.py
./examples/src/main/python/mllib/random_forest_regression_example.py
./examples/src/main/python/mllib/svm_with_sgd_example.py
./examples/src/main/python/streaming/network_wordjoinsentiments.py
./sql/hive/src/test/resources/data/scripts/cat.py
./sql/hive/src/test/resources/data/scripts/cat_error.py
./sql/hive/src/test/resources/data/scripts/doubleescapedtab.py
./sql/hive/src/test/resources/data/scripts/dumpdata_script.py
./sql/hive/src/test/resources/data/scripts/escapedcarriagereturn.py
./sql/hive/src/test/resources/data/scripts/escapednewline.py
./sql/hive/src/test/resources/data/scripts/escapedtab.py
./sql/hive/src/test/resources/data/scripts/input20_script.py
./sql/hive/src/test/resources/data/scripts/newline.py
```

## How was this patch tested?

- `./python/docs/epytext.py`

  ```bash
  cd ./python/docs $$ make html
  ```

- pep8 check (Python 2.7 / Python 3.3.6)

  ```
  ./dev/lint-python
  ```

- `./dev/merge_spark_pr.py` (Python 2.7 only / Python 3.3.6 not working)

  ```bash
  python -m doctest -v ./dev/merge_spark_pr.py
  ```

- `./dev/create-release/releaseutils.py` `./dev/create-release/generate-contributors.py` `./dev/create-release/translate-contributors.py` (Python 2.7 only / Python 3.3.6 not working)

  ```bash
  python generate-contributors.py
  python translate-contributors.py
  ```

- Examples (Python 2.7 / Python 3.3.6)

  ```bash
  ./bin/spark-submit examples/src/main/python/mllib/decision_tree_classification_example.py
  ./bin/spark-submit examples/src/main/python/mllib/decision_tree_regression_example.py
  ./bin/spark-submit examples/src/main/python/mllib/gradient_boosting_classification_example.py
  ./bin/spark-submit examples/src/main/python/mllib/gradient_boosting_regression_example.p
  ./bin/spark-submit examples/src/main/python/mllib/random_forest_classification_example.py
  ./bin/spark-submit examples/src/main/python/mllib/random_forest_regression_example.py
  ```

- Examples (Python 2.7 only / Python 3.3.6 not working)
  ```
  ./bin/spark-submit examples/src/main/python/mllib/linear_regression_with_sgd_example.py
  ./bin/spark-submit examples/src/main/python/mllib/logistic_regression_with_lbfgs_example.py
  ./bin/spark-submit examples/src/main/python/mllib/naive_bayes_example.py
  ./bin/spark-submit examples/src/main/python/mllib/svm_with_sgd_example.py
  ```

- `sql/hive/src/test/resources/data/scripts/*.py` (Python 2.7 / Python 3.3.6 within suggested changes)

  Manually tested only changed ones.

- `./dev/github_jira_sync.py` (Python 2.7 only / Python 3.3.6 not working)

  Manually tested this after disabling actually adding comments and links.

And also via Jenkins tests.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #16405 from HyukjinKwon/minor-pep8.
2017-01-02 15:23:19 +00:00
hyukjinkwon f1330b1d9e
[SPARK-19022][TESTS] Fix tests dependent on OS due to different newline characters
## What changes were proposed in this pull request?

There are two tests failing on Windows due to the different newlines.

```
 - StreamingQueryProgress - prettyJson *** FAILED *** (0 milliseconds)
 "{
    "id" : "39788670-6722-48b7-a248-df6ba08722ac",
    "runId" : "422282f1-3b81-4b47-a15d-82dda7e69390",
    "name" : "myName",
    ...
  }" did not equal "{
    "id" : "39788670-6722-48b7-a248-df6ba08722ac",
    "runId" : "422282f1-3b81-4b47-a15d-82dda7e69390",
    "name" : "myName",
    ...
  }"
  ...
```

```
 - StreamingQueryStatus - prettyJson *** FAILED *** (0 milliseconds)
 "{
    "message" : "active",
    "isDataAvailable" : true,
    "isTriggerActive" : false
  }" did not equal "{
    "message" : "active",
    "isDataAvailable" : true,
    "isTriggerActive" : false
  }"
  ...
```

The reason is, `pretty` in `org.json4s.pretty` writes OS-dependent newlines but the string defined in the tests are `\n`. This ends up with test failures.

This PR proposes to compare these regardless of newline concerns.

## How was this patch tested?

Manually tested via AppVeyor.

**Before**
https://ci.appveyor.com/project/spark-test/spark/build/417-newlines-fix-before

**After**
https://ci.appveyor.com/project/spark-test/spark/build/418-newlines-fix

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #16433 from HyukjinKwon/tests-StreamingQueryStatusAndProgressSuite.
2017-01-02 15:17:02 +00:00
Liang-Chi Hsieh 0ac2f1e71f
[MINOR][DOC] Minor doc change for YARN credential providers
## What changes were proposed in this pull request?

The configuration `spark.yarn.security.tokens.{service}.enabled` is deprecated. Now we should use `spark.yarn.security.credentials.{service}.enabled`. Some places in the doc is not updated yet.

## How was this patch tested?

N/A. Just doc change.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #16444 from viirya/minor-credential-provider-doc.
2017-01-02 14:41:57 +00:00
Liwei Lin 808b84e2de
[SPARK-19041][SS] Fix code snippet compilation issues in Structured Streaming Programming Guide
## What changes were proposed in this pull request?

Currently some code snippets in the programming guide just do not compile. We should fix them.

## How was this patch tested?

```
SKIP_API=1 jekyll build
```

## Screenshot from part of the change:

![snip20161231_37](https://cloud.githubusercontent.com/assets/15843379/21576864/cc52fcd8-cf7b-11e6-8bd6-f935d9ff4a6b.png)

Author: Liwei Lin <lwlin7@gmail.com>

Closes #16442 from lw-lin/ss-pro-guide-.
2017-01-02 14:40:06 +00:00
Sean Owen ba4881268e
[BUILD] Close stale PRs
Closes #12968
Closes #16215
Closes #16212
Closes #16086
Closes #15713
Closes #16413
Closes #16396

Author: Sean Owen <sowen@cloudera.com>

Closes #16447 from srowen/CloseStalePRs.
2017-01-02 14:34:02 +00:00
Shixiong Zhu 2394047370 [SPARK-19050][SS][TESTS] Fix EventTimeWatermarkSuite 'delay in months and years handled correctly'
## What changes were proposed in this pull request?

`monthsSinceEpoch` in this test is like `math.floor(num)`, so `monthDiff` has two possible values.

## How was this patch tested?

Jenkins.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #16449 from zsxwing/watermark-test-hotfix.
2017-01-01 13:25:44 -08:00
gatorsmile 35e974076d [SPARK-19028][SQL] Fixed non-thread-safe functions used in SessionCatalog
### What changes were proposed in this pull request?
Fixed non-thread-safe functions used in SessionCatalog:
- refreshTable
- lookupRelation

### How was this patch tested?
N/A

Author: gatorsmile <gatorsmile@gmail.com>

Closes #16437 from gatorsmile/addSyncToLookUpTable.
2016-12-31 19:40:28 +08:00
Cheng Lian 871f6114ac [SPARK-19016][SQL][DOC] Document scalable partition handling
## What changes were proposed in this pull request?

This PR documents the scalable partition handling feature in the body of the programming guide.

Before this PR, we only mention it in the migration guide. It's not super clear that external datasource tables require an extra `MSCK REPAIR TABLE` command is to have per-partition information persisted since 2.1.

## How was this patch tested?

N/A.

Author: Cheng Lian <lian@databricks.com>

Closes #16424 from liancheng/scalable-partition-handling-doc.
2016-12-30 14:46:30 -08:00
Dongjoon Hyun b85e29437d [SPARK-18123][SQL] Use db column names instead of RDD column ones during JDBC Writing
## What changes were proposed in this pull request?

Apache Spark supports the following cases **by quoting RDD column names** while saving through JDBC.
- Allow reserved keyword as a column name, e.g., 'order'.
- Allow mixed-case colume names like the following, e.g., `[a: int, A: int]`.

  ``` scala
  scala> val df = sql("select 1 a, 1 A")
  df: org.apache.spark.sql.DataFrame = [a: int, A: int]
  ...
  scala> df.write.mode("overwrite").format("jdbc").options(option).save()
  scala> df.write.mode("append").format("jdbc").options(option).save()
  ```

This PR aims to use **database column names** instead of RDD column ones in order to support the following additionally.
Note that this case succeeds with `MySQL`, but fails on `Postgres`/`Oracle` before.

``` scala
val df1 = sql("select 1 a")
val df2 = sql("select 1 A")
...
df1.write.mode("overwrite").format("jdbc").options(option).save()
df2.write.mode("append").format("jdbc").options(option).save()
```
## How was this patch tested?

Pass the Jenkins test with a new testcase.

Author: Dongjoon Hyun <dongjoon@apache.org>
Author: gatorsmile <gatorsmile@gmail.com>

Closes #15664 from dongjoon-hyun/SPARK-18123.
2016-12-30 10:27:14 -08:00