Commit graph

16280 commits

Author SHA1 Message Date
Sean Owen 4768d037b7 [SPARK-15386][CORE] Master doesn't compile against Java 1.7 / Process.isAlive
## What changes were proposed in this pull request?

Remove call to Process.isAlive -- Java 8 only. Introduced in https://github.com/apache/spark/pull/13042 / SPARK-15263

## How was this patch tested?

Jenkins tests

Author: Sean Owen <sowen@cloudera.com>

Closes #13174 from srowen/SPARK-15386.
2016-05-18 14:22:21 -07:00
Nick Pentreath e8b79afa02 [SPARK-14891][ML] Add schema validation for ALS
This PR adds schema validation to `ml`'s ALS and ALSModel. Currently, no schema validation was performed as `transformSchema` was never called in `ALS.fit` or `ALSModel.transform`. Furthermore, due to no schema validation, if users passed in Long (or Float etc) ids, they would be silently cast to Int with no warning or error thrown.

With this PR, ALS now supports all numeric types for `user`, `item`, and `rating` columns. The rating column is cast to `Float` and the user and item cols are cast to `Int` (as is the case currently) - however for user/item, the cast throws an error if the value is outside integer range. Behavior for rating col is unchanged (as it is not an issue).

## How was this patch tested?
New test cases in `ALSSuite`.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #12762 from MLnick/SPARK-14891-als-validate-schema.
2016-05-18 21:13:12 +02:00
Liang-Chi Hsieh 3d1e67f903 [SPARK-15342] [SQL] [PYSPARK] PySpark test for non ascii column name does not actually test with unicode column name
## What changes were proposed in this pull request?

The PySpark SQL `test_column_name_with_non_ascii` wants to test non-ascii column name. But it doesn't actually test it. We need to construct an unicode explicitly using `unicode` under Python 2.

## How was this patch tested?

Existing tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #13134 from viirya/correct-non-ascii-colname-pytest.
2016-05-18 11:18:33 -07:00
Davies Liu 8fb1d1c7f3 [SPARK-15357] Cooperative spilling should check consumer memory mode
## What changes were proposed in this pull request?

Since we support forced spilling for Spillable, which only works in OnHeap mode, different from other SQL operators (could be OnHeap or OffHeap), we should considering the mode of consumer before calling trigger forced spilling.

## How was this patch tested?

Add new test.

Author: Davies Liu <davies@databricks.com>

Closes #13151 from davies/fix_mode.
2016-05-18 09:44:21 -07:00
Tejas Patil c1fd9cacba [SPARK-15263][CORE] Make shuffle service dir cleanup faster by using rm -rf
## What changes were proposed in this pull request?

Jira: https://issues.apache.org/jira/browse/SPARK-15263

The current logic for directory cleanup is slow because it does directory listing, recurses over child directories, checks for symbolic links, deletes leaf files and finally deletes the dirs when they are empty. There is back-and-forth switching from kernel space to user space while doing this. Since most of the deployment backends would be Unix systems, we could essentially just do `rm -rf` so that entire deletion logic runs in kernel space.

The current Java based impl in Spark seems to be similar to what standard libraries like guava and commons IO do (eg. http://svn.apache.org/viewvc/commons/proper/io/trunk/src/main/java/org/apache/commons/io/FileUtils.java?view=markup#l1540). However, guava removed this method in favour of shelling out to an operating system command (like in this PR). See the `Deprecated` note in older javadocs for guava for details : http://google.github.io/guava/releases/10.0.1/api/docs/com/google/common/io/Files.html#deleteRecursively(java.io.File)

Ideally, Java should be providing such APIs so that users won't have to do such things to get platform specific code. Also, its not just about speed, but also handling race conditions while doing at FS deletions is tricky. I could find this bug for Java in similar context : http://bugs.java.com/bugdatabase/view_bug.do?bug_id=7148952

## How was this patch tested?

I am relying on existing test cases to test the method. If there are suggestions about testing it, welcome to hear about it.

## Performance gains

*Input setup* : Created a nested directory structure of depth 3 and each entry having 50 sub-dirs. The input being cleaned up had total ~125k dirs.

Ran both approaches (in isolation) for 6 times to get average numbers:

Native Java cleanup  | `rm -rf` as a separate process
------------ | -------------
10.04 sec | 4.11 sec

This change made deletion 2.4 times faster for the given test input.

Author: Tejas Patil <tejasp@fb.com>

Closes #13042 from tejasapatil/delete_recursive.
2016-05-18 12:10:32 +01:00
DLucky 420b700695 [SPARK-15346][MLLIB] Reduce duplicate computation in picking initial points
mateiz srowen

I state that the contribution is my original work and that I license the work to the project under the project's open source license

There's some format problems with my last PR, with HyukjinKwon 's help I read the guidance, re-check my code and PR, then run the tests, finally re-submit the PR request here.

The related JIRA issue though marked as resolved, this change may relate to it I think.

## Proposed Change

After picking each new initial centers, it's unnecessary to compute the distances between all the points and the old ones.
Instead this change keeps the distance between all the points and their closest centers, and compare to the distance of them with the new center then update them.

## Test result

One can find an easy test way in (https://issues.apache.org/jira/browse/SPARK-6706)

I test the KMeans++ method for a small dataset with 16k points, and the whole KMeans|| with a large one with 240k points.
The data has 4096 features and I tunes K from 100 to 500.
The test environment was on my 4 machine cluster, I also tested a 3M points data on a larger cluster with 25 machines and got similar results, which I would not draw the detail curve. The result of the first two exps are shown below

### Local KMeans++ test:

Dataset:4m_ini_center
Data_size:16234
Dimension:4096

Lloyd's Iteration = 10
The y-axis is time in sec, the x-axis is tuning the K.

![image](https://cloud.githubusercontent.com/assets/10915169/15175831/d0c92b82-179a-11e6-8b68-4e165fc2fdff.png)

![local_total](https://cloud.githubusercontent.com/assets/10915169/15175957/6b21c3b0-179b-11e6-9741-66dfe4e23eb7.jpg)

### On a larger dataset

An improve show in the graph but not commit in this file: In this experiment I also have an improvement for calculation in normalization data (the distance is convert to the cosine distance). As if the data is normalized into (0,1), one improvement in the original vesion for util.MLUtils.fastSauaredDistance would have no effect (the precisionBound 2.0 * EPSILON * sumSquaredNorm / (normDiff * normDiff + EPSILON) will never less then precision in this case). Therefore I design an early terminal method when comparing two distance (used for findClosest). But I don't include this improve in this file, you may only refer to the curves without "normalize" for comparing the results.

Dataset:4k24
Data_size:243960
Dimension:4096

Normlize 	Enlarge 	Initialize 	Lloyd's_Iteration
NO    	1 	         3 	          5
YES 	        10000 	 3 	          5

Notice: the normlized data is enlarged to ensure precision

The cost time: x-for value of K, y-for time in sec
![4k24_total](https://cloud.githubusercontent.com/assets/10915169/15176635/9a54c0bc-179e-11e6-81c5-238e0c54bce2.jpg)

SE for unnormalized data between two version, to ensure the correctness
![4k24_unnorm_se](https://cloud.githubusercontent.com/assets/10915169/15176661/b85dabc8-179e-11e6-9269-fe7d2101dd48.jpg)

Here is the SE between normalized data just for reference, it's also correct.
![4k24_norm_se](https://cloud.githubusercontent.com/assets/10915169/15176742/1fbde940-179f-11e6-8290-d24b0dd4a4f7.jpg)

Author: DLucky <mouendless@gmail.com>

Closes #13133 from mouendless/patch-2.
2016-05-18 12:05:21 +01:00
Cheng Lian c4a45fd855 [SPARK-15334][SQL][HOTFIX] Fixes compilation error for Scala 2.10
## What changes were proposed in this pull request?

This PR fixes a Scala 2.10 compilation failure introduced in PR #13127.

## How was this patch tested?

Jenkins build.

Author: Cheng Lian <lian@databricks.com>

Closes #13166 from liancheng/hotfix-for-scala-2.10.
2016-05-18 18:58:24 +08:00
Dongjoon Hyun d2f81df1ba [MINOR][SQL] Remove unused pattern matching variables in Optimizers.
## What changes were proposed in this pull request?

This PR removes unused pattern matching variable in Optimizers in order to improve readability.

## How was this patch tested?

Pass the existing Jenkins tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13145 from dongjoon-hyun/remove_unused_pattern_matching_variables.
2016-05-18 11:51:50 +01:00
WeichenXu 2f9047b5eb [SPARK-15322][MLLIB][CORE][SQL] update deprecate accumulator usage into accumulatorV2 in spark project
## What changes were proposed in this pull request?

I use Intellj-IDEA to search usage of deprecate SparkContext.accumulator in the whole spark project, and update the code.(except those test code for accumulator method itself)

## How was this patch tested?

Exisiting unit tests

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #13112 from WeichenXu123/update_accuV2_in_mllib.
2016-05-18 11:48:46 +01:00
Davies Liu 33814f887a [SPARK-15307][SQL] speed up listing files for data source
## What changes were proposed in this pull request?

Currently, listing files is very slow if there is thousands files, especially on local file system, because:
1) FileStatus.getPermission() is very slow on local file system, which is launch a subprocess and parse the stdout.
2) Create an JobConf is very expensive (ClassUtil.findContainingJar() is slow).

This PR improve these by:
1) Use another constructor of LocatedFileStatus to avoid calling FileStatus.getPermission, the permissions are not used for data sources.
2) Only create an JobConf once within one task.

## How was this patch tested?

Manually tests on a partitioned table with 1828 partitions, decrease the time to load the table from 22 seconds to 1.6 seconds (Most of time are spent in merging schema now).

Author: Davies Liu <davies@databricks.com>

Closes #13094 from davies/listing.
2016-05-18 18:46:57 +08:00
Sean Zhong 6e02aec44b [SPARK-15334][SQL] HiveClient facade not compatible with Hive 0.12
## What changes were proposed in this pull request?

HiveClient facade is not compatible with Hive 0.12.

This PR Fixes the following compatibility issues:
1. `org.apache.spark.sql.hive.client.HiveClientImpl` use `AddPartitionDesc(db, table, ignoreIfExists)` to create partitions, however, Hive 0.12 doesn't have this constructor for `AddPartitionDesc`.
2. `HiveClientImpl` uses `PartitionDropOptions` when dropping partition, however, class `PartitionDropOptions` doesn't exist in Hive 0.12.
3. Hive 0.12 doesn't support adding permanent functions. It is not valid to call `org.apache.hadoop.hive.ql.metadata.Hive.createFunction`, `org.apache.hadoop.hive.ql.metadata.Hive.alterFunction`, and `org.apache.hadoop.hive.ql.metadata.Hive.alterFunction`
4. `org.apache.spark.sql.hive.client.VersionsSuite` doesn't have enough test coverage for different hive versions 0.12, 0.13, 0.14, 1.0.0, 1.1.0, 1.2.0.

## How was this patch tested?

Unit test.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13127 from clockfly/versionSuite.
2016-05-18 16:00:02 +08:00
Takuya Kuwahara 411c04adb5 [SPARK-14978][PYSPARK] PySpark TrainValidationSplitModel should support validationMetrics
## What changes were proposed in this pull request?

This pull request includes supporting validationMetrics for TrainValidationSplitModel with Python and test for it.

## How was this patch tested?

test in `python/pyspark/ml/tests.py`

Author: Takuya Kuwahara <taakuu19@gmail.com>

Closes #12767 from taku-k/spark-14978.
2016-05-18 08:29:47 +02:00
Yin Huai 2a5db9c140 [SPARK-14346] Fix scala-2.10 build
## What changes were proposed in this pull request?
Scala 2.10 build was broken by #13079. I am reverting the change of that line.

Author: Yin Huai <yhuai@databricks.com>

Closes #13157 from yhuai/SPARK-14346-fix-scala2.10.
2016-05-17 18:02:31 -07:00
Sean Zhong 25b315e6ca [SPARK-15171][SQL] Remove the references to deprecated method dataset.registerTempTable
## What changes were proposed in this pull request?

Update the unit test code, examples, and documents to remove calls to deprecated method `dataset.registerTempTable`.

## How was this patch tested?

This PR only changes the unit test code, examples, and comments. It should be safe.
This is a follow up of PR https://github.com/apache/spark/pull/12945 which was merged.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13098 from clockfly/spark-15171-remove-deprecation.
2016-05-18 09:01:59 +08:00
Cheng Lian b674e67c22 [SPARK-14346][SQL] Native SHOW CREATE TABLE for Hive tables/views
## What changes were proposed in this pull request?

This is a follow-up of #12781. It adds native `SHOW CREATE TABLE` support for Hive tables and views. A new field `hasUnsupportedFeatures` is added to `CatalogTable` to indicate whether all table metadata retrieved from the concrete underlying external catalog (i.e. Hive metastore in this case) can be mapped to fields in `CatalogTable`. This flag is useful when the target Hive table contains structures that can't be handled by Spark SQL, e.g., skewed columns and storage handler, etc..

## How was this patch tested?

New test cases are added in `ShowCreateTableSuite` to do round-trip tests.

Author: Cheng Lian <lian@databricks.com>

Closes #13079 from liancheng/spark-14346-show-create-table-for-hive-tables.
2016-05-17 15:56:44 -07:00
Shixiong Zhu 8e8bc9f957 [SPARK-11735][CORE][SQL] Add a check in the constructor of SQLContext/SparkSession to make sure its SparkContext is not stopped
## What changes were proposed in this pull request?

Add a check in the constructor of SQLContext/SparkSession to make sure its SparkContext is not stopped.

## How was this patch tested?

Jenkins unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #13154 from zsxwing/check-spark-context-stop.
2016-05-17 14:57:21 -07:00
Dongjoon Hyun 0f576a5748 [SPARK-15244] [PYTHON] Type of column name created with createDataFrame is not consistent.
## What changes were proposed in this pull request?

**createDataFrame** returns inconsistent types for column names.
```python
>>> from pyspark.sql.types import StructType, StructField, StringType
>>> schema = StructType([StructField(u"col", StringType())])
>>> df1 = spark.createDataFrame([("a",)], schema)
>>> df1.columns # "col" is str
['col']
>>> df2 = spark.createDataFrame([("a",)], [u"col"])
>>> df2.columns # "col" is unicode
[u'col']
```

The reason is only **StructField** has the following code.
```
if not isinstance(name, str):
    name = name.encode('utf-8')
```
This PR adds the same logic into **createDataFrame** for consistency.
```
if isinstance(schema, list):
    schema = [x.encode('utf-8') if not isinstance(x, str) else x for x in schema]
```

## How was this patch tested?

Pass the Jenkins test (with new python doctest)

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13097 from dongjoon-hyun/SPARK-15244.
2016-05-17 13:05:07 -07:00
DB Tsai e2efe0529a [SPARK-14615][ML] Use the new ML Vector and Matrix in the ML pipeline based algorithms
## What changes were proposed in this pull request?

Once SPARK-14487 and SPARK-14549 are merged, we will migrate to use the new vector and matrix type in the new ml pipeline based apis.

## How was this patch tested?

Unit tests

Author: DB Tsai <dbt@netflix.com>
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #12627 from dbtsai/SPARK-14615-NewML.
2016-05-17 12:51:07 -07:00
Dongjoon Hyun 9f176dd391 [MINOR][DOCS] Replace remaining 'sqlContext' in ScalaDoc/JavaDoc.
## What changes were proposed in this pull request?

According to the recent change, this PR replaces all the remaining `sqlContext` usage with `spark` in ScalaDoc/JavaDoc (.scala/.java files) except `SQLContext.scala`, `SparkPlan.scala', and `DatasetHolder.scala`.

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #13125 from dongjoon-hyun/minor_doc_sparksession.
2016-05-17 20:50:22 +02:00
Yuhao Yang 3308a862ba [SPARK-15182][ML] Copy MLlib doc to ML: ml.feature.tf, idf
## What changes were proposed in this pull request?

We should now begin copying algorithm details from the spark.mllib guide to spark.ml as needed, rather than just linking back to the corresponding algorithms in the spark.mllib user guide.

## How was this patch tested?

manual review for doc.

Author: Yuhao Yang <hhbyyh@gmail.com>
Author: Yuhao Yang <yuhao.yang@intel.com>

Closes #12957 from hhbyyh/tfidfdoc.
2016-05-17 20:44:19 +02:00
hyukjinkwon 8d05a7a98b [SPARK-10216][SQL] Avoid creating empty files during overwriting with group by query
## What changes were proposed in this pull request?

Currently, `INSERT INTO` with `GROUP BY` query tries to make at least 200 files (default value of `spark.sql.shuffle.partition`), which results in lots of empty files.

This PR makes it avoid creating empty files during overwriting into Hive table and in internal data sources  with group by query.

This checks whether the given partition has data in it or not and creates/writes file only when it actually has data.

## How was this patch tested?

Unittests in `InsertIntoHiveTableSuite` and `HadoopFsRelationTest`.

Closes #8411

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Keuntae Park <sirpkt@apache.org>

Closes #12855 from HyukjinKwon/pr/8411.
2016-05-17 11:18:51 -07:00
Wenchen Fan 20a89478e1 [SPARK-14346][SQL][FOLLOW-UP] add tests for CREAT TABLE USING with partition and bucket
## What changes were proposed in this pull request?

https://github.com/apache/spark/pull/12781 introduced PARTITIONED BY, CLUSTERED BY, and SORTED BY keywords to CREATE TABLE USING. This PR adds tests to make sure those keywords are handled correctly.

This PR also fixes a mistake that we should create non-hive-compatible table if partition or bucket info exists.

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13144 from cloud-fan/add-test.
2016-05-17 10:12:51 -07:00
Kousuke Saruta c0c3ec3547 [SPARK-15165] [SQL] Codegen can break because toCommentSafeString is not actually safe
## What changes were proposed in this pull request?

toCommentSafeString method replaces "\u" with "\\\\u" to avoid codegen breaking.
But if the even number of "\" is put before "u", like "\\\\u", in the string literal in the query, codegen can break.

Following code causes compilation error.

```
val df = Seq(...).toDF
df.select("'\\\\\\\\u002A/'").show
```

The reason of the compilation error is because "\\\\\\\\\\\\\\\\u002A/" is translated into "*/" (the end of comment).

Due to this unsafety, arbitrary code can be injected like as follows.

```
val df = Seq(...).toDF
// Inject "System.exit(1)"
df.select("'\\\\\\\\u002A/{System.exit(1);}/*'").show
```

## How was this patch tested?

Added new test cases.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>
Author: sarutak <sarutak@oss.nttdata.co.jp>

Closes #12939 from sarutak/SPARK-15165.
2016-05-17 10:07:01 -07:00
wm624@hotmail.com bebe5f9811 [SPARK-15318][ML][EXAMPLE] spark.ml Collaborative Filtering example does not work in spark-shell
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

copy & paste example in ml-collaborative-filtering.html into spark-shell, we see the following errors.
scala> case class Rating(userId: Int, movieId: Int, rating: Float, timestamp: Long)
defined class Rating

scala> object Rating {
def parseRating(str: String): Rating = { | val fields = str.split("::") | assert(fields.size == 4) | Rating(fields(0).toInt, fields(1).toInt, fields(2).toFloat, fields(3).toLong) | }
}
<console>:29: error: Rating.type does not take parameters
Rating(fields(0).toInt, fields(1).toInt, fields(2).toFloat, fields(3).toLong)
^
In standard scala repl, it has the same error.

Scala/spark-shell repl has some quirks (e.g. packages are also not well supported).

The reason of errors is that scala/spark-shell repl discards previous definitions when we define the Object with the same class name. Solution: We can rename the Object Rating.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Manually test it: 1). ./bin/run-example ALSExample
2). copy & paste example in the generated document. It works fine.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #13110 from wangmiao1981/repl.
2016-05-17 16:51:01 +01:00
Sean Owen 932d800293 [SPARK-15333][DOCS] Reorganize building-spark.md; rationalize vs wiki
## What changes were proposed in this pull request?

See JIRA for the motivation. The changes are almost entirely movement of text and edits to sections. Minor changes to text include:

- Copying in / merging text from the "Useful Developer Tools" wiki, in areas of
  - Docker
  - R
  - Running one test
- standardizing on ./build/mvn not mvn, and likewise for ./build/sbt
- correcting some typos
- standardizing code block formatting

No text has been removed from this doc; text has been imported from the https://cwiki.apache.org/confluence/display/SPARK/Useful+Developer+Tools wiki

## How was this patch tested?

Jekyll doc build and inspection of resulting HTML in browser.

Author: Sean Owen <sowen@cloudera.com>

Closes #13124 from srowen/SPARK-15333.
2016-05-17 16:40:38 +01:00
wm624@hotmail.com 4134ff0c65 [SPARK-14434][ML] User guide doc and examples for GaussianMixture in spark.ml
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)

Add guide doc and examples for GaussianMixture in Spark.ml in Java, Scala and Python.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Manual compile and test all examples

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #12788 from wangmiao1981/example.
2016-05-17 15:20:47 +02:00
Wenchen Fan c36ca651f9 [SPARK-15351][SQL] RowEncoder should support array as the external type for ArrayType
## What changes were proposed in this pull request?

This PR improves `RowEncoder` and `MapObjects`, to support array as the external type for `ArrayType`. The idea is straightforward, we use `Object` as the external input type for `ArrayType`, and determine its type at runtime in `MapObjects`.

## How was this patch tested?

new test in `RowEncoderSuite`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13138 from cloud-fan/map-object.
2016-05-17 17:02:52 +08:00
Sean Owen 122302cbf5 [SPARK-15290][BUILD] Move annotations, like @Since / @DeveloperApi, into spark-tags
## What changes were proposed in this pull request?

(See https://github.com/apache/spark/pull/12416 where most of this was already reviewed and committed; this is just the module structure and move part. This change does not move the annotations into test scope, which was the apparently problem last time.)

Rename `spark-test-tags` -> `spark-tags`; move common annotations like `Since` to `spark-tags`

## How was this patch tested?

Jenkins tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #13074 from srowen/SPARK-15290.
2016-05-17 09:55:53 +01:00
Xiangrui Meng 8ad9f08c94 [SPARK-14906][ML] Copy linalg in PySpark to new ML package
## What changes were proposed in this pull request?

Copy the linalg (Vector/Matrix and VectorUDT/MatrixUDT) in PySpark to new ML package.

## How was this patch tested?
Existing tests.

Author: Xiangrui Meng <meng@databricks.com>
Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #13099 from viirya/move-pyspark-vector-matrix-udt4.
2016-05-17 00:08:02 -07:00
Liwei Lin 95f4fbae52 [SPARK-14942][SQL][STREAMING] Reduce delay between batch construction and execution
## Problem

Currently in `StreamExecution`, [we first run the batch, then construct the next](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/streaming/StreamExecution.scala#L165):
```scala
if (dataAvailable) runBatch()
constructNextBatch()
```

This is good when we run batches ASAP, where data would get processed in the **very next batch**:

![1](https://cloud.githubusercontent.com/assets/15843379/14779964/2786e698-0b0d-11e6-9d2c-bb41513488b2.png)

However, when we run batches at trigger like `ProcessTime("1 minute")`, data - such as _y_ below - may not get processed in the very next batch i.e. _batch 1_, but in _batch 2_:

![2](https://cloud.githubusercontent.com/assets/15843379/14779818/6f3bb064-0b0c-11e6-9f16-c1ce4897186b.png)

## What changes were proposed in this pull request?

This patch reverses the order of `constructNextBatch()` and `runBatch()`. After this patch, data would get processed in the **very next batch**, i.e. _batch 1_:

![3](https://cloud.githubusercontent.com/assets/15843379/14779816/6f36ee62-0b0c-11e6-9e53-bc8397fade18.png)

In addition, this patch alters when we do `currentBatchId += 1`: let's do that when the processing of the current batch's data is completed, so we won't bother passing `currentBatchId + 1` or  `currentBatchId - 1` to states or sinks.

## How was this patch tested?

New added test case. Also this should be covered by existing test suits, e.g. stress tests and others.

Author: Liwei Lin <lwlin7@gmail.com>

Closes #12725 from lw-lin/construct-before-run-3.
2016-05-16 12:59:55 -07:00
Sean Owen fabc8e5b12 [SPARK-12972][CORE][TEST-MAVEN][TEST-HADOOP2.2] Update org.apache.httpcomponents.httpclient, commons-io
## What changes were proposed in this pull request?

This is sort of a hot-fix for https://github.com/apache/spark/pull/13117, but, the problem is limited to Hadoop 2.2. The change is to manage `commons-io` to 2.4 for all Hadoop builds, which is only a net change for Hadoop 2.2, which was using 2.1.

## How was this patch tested?

Jenkins tests -- normal PR builder, then the `[test-hadoop2.2] [test-maven]` if successful.

Author: Sean Owen <sowen@cloudera.com>

Closes #13132 from srowen/SPARK-12972.3.
2016-05-16 16:27:04 +01:00
Yanbo Liang f116a84ef8 [SPARK-14979][ML][PYSPARK] Add examples for GeneralizedLinearRegression
## What changes were proposed in this pull request?
Add Scala/Java/Python examples for ```GeneralizedLinearRegression```.

## How was this patch tested?
They are examples and have been tested offline.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12754 from yanboliang/spark-14979.
2016-05-16 09:55:35 +02:00
wm624@hotmail.com c1836d66bd [SPARK-15305][ML][DOC] spark.ml document Bisectiong k-means has the incorrect format
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)
The generated document has the incorrect format for biseckmeans.

![bug](https://cloud.githubusercontent.com/assets/5033592/15233120/d910098a-185a-11e6-901d-44aeafc8a011.jpg)

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Fix the formatting.
![fix](https://cloud.githubusercontent.com/assets/5033592/15233136/fce2ccd0-185a-11e6-9ded-14d71da4bdab.jpg)

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #13083 from wangmiao1981/doc.
2016-05-16 08:22:16 +02:00
Sean Zhong 4a5ee1954a [SPARK-15253][SQL] Support old table schema config key "spark.sql.sources.schema" for DESCRIBE TABLE
## What changes were proposed in this pull request?

"DESCRIBE table" is broken when table schema is stored at key "spark.sql.sources.schema".

Originally, we used spark.sql.sources.schema to store the schema of a data source table.
After SPARK-6024, we removed this flag. Although we are not using spark.sql.sources.schema any more, we need to still support it.

## How was this patch tested?

Unit test.

When using spark2.0 to load a table generated by spark 1.2.
Before change:
`DESCRIBE table` => Schema of this table is inferred at runtime,,

After change:
`DESCRIBE table` => correct output.

Author: Sean Zhong <seanzhong@databricks.com>

Closes #13073 from clockfly/spark-15253.
2016-05-16 10:41:20 +08:00
Zheng RuiFeng c7efc56c7b [MINOR] Fix Typos
## What changes were proposed in this pull request?
1,Rename matrix args in BreezeUtil to upper to match the doc
2,Fix several typos in ML and SQL

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13078 from zhengruifeng/fix_ann.
2016-05-15 15:59:49 +01:00
Sean Owen f5576a052d [SPARK-12972][CORE] Update org.apache.httpcomponents.httpclient
## What changes were proposed in this pull request?

(Retry of https://github.com/apache/spark/pull/13049)

- update to httpclient 4.5 / httpcore 4.4
- remove some defunct exclusions
- manage httpmime version to match
- update selenium / httpunit to support 4.5 (possible now that Jetty 9 is used)

## How was this patch tested?

Jenkins tests. Also, locally running the same test command of one Jenkins profile that failed: `mvn -Phadoop-2.6 -Pyarn -Phive -Phive-thriftserver -Pkinesis-asl ...`

Author: Sean Owen <sowen@cloudera.com>

Closes #13117 from srowen/SPARK-12972.2.
2016-05-15 15:56:46 +01:00
wm624@hotmail.com 354f8f11bd [SPARK-15096][ML] LogisticRegression MultiClassSummarizer numClasses can fail if no valid labels are found
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)
Throw better exception when numClasses is empty and empty.max is thrown.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Add a new unit test, which calls histogram with empty numClasses.

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #12969 from wangmiao1981/logisticR.
2016-05-14 09:45:56 +01:00
Nicholas Tietz 0f1f31d3a6 [SPARK-15197][DOCS] Added Scaladoc for countApprox and countByValueApprox parameters
This pull request simply adds Scaladoc documentation of the parameters for countApprox and countByValueApprox.

This is an important documentation change, as it clarifies what should be passed in for the timeout. Without units, this was previously unclear.

I did not open a JIRA ticket per my understanding of the project contribution guidelines; as they state, the description in the ticket would be essentially just what is in the PR. If I should open one, let me know and I will do so.

Author: Nicholas Tietz <nicholas.tietz@crosschx.com>

Closes #12955 from ntietz/rdd-countapprox-docs.
2016-05-14 09:44:20 +01:00
Tejas Patil 4210e2a6b7 [TRIVIAL] Add () to SparkSession's builder function
## What changes were proposed in this pull request?

Was trying out `SparkSession` for the first time and the given class doc (when copied as is) did not work over Spark shell:

```
scala> SparkSession.builder().master("local").appName("Word Count").getOrCreate()
<console>:27: error: org.apache.spark.sql.SparkSession.Builder does not take parameters
       SparkSession.builder().master("local").appName("Word Count").getOrCreate()
```

Adding () to the builder method in SparkSession.

## How was this patch tested?

```
scala> SparkSession.builder().master("local").appName("Word Count").getOrCreate()
res0: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession65c17e38

scala> SparkSession.builder.master("local").appName("Word Count").getOrCreate()
res1: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession65c17e38
```

Author: Tejas Patil <tejasp@fb.com>

Closes #13086 from tejasapatil/doc_correction.
2016-05-13 18:10:22 -07:00
hyukjinkwon 3ded5bc4db [SPARK-15267][SQL] Refactor options for JDBC and ORC data sources and change default compression for ORC
## What changes were proposed in this pull request?

Currently, Parquet, JSON and CSV data sources have a class for thier options, (`ParquetOptions`, `JSONOptions` and `CSVOptions`).

It is convenient to manage options for sources to gather options into a class. Currently, `JDBC`, `Text`, `libsvm` and `ORC` datasources do not have this class. This might be nicer if these options are in a unified format so that options can be added and

This PR refactors the options in Spark internal data sources adding new classes, `OrcOptions`, `TextOptions`, `JDBCOptions` and `LibSVMOptions`.

Also, this PR change the default compression codec for ORC from `NONE` to `SNAPPY`.

## How was this patch tested?

Existing tests should cover this for refactoring and unittests in `OrcHadoopFsRelationSuite` for changing the default compression codec for ORC.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #13048 from HyukjinKwon/SPARK-15267.
2016-05-13 09:04:37 -07:00
Sean Owen 10a8389674 Revert "[SPARK-12972][CORE] Update org.apache.httpcomponents.httpclient"
This reverts commit c74a6c3f23.
2016-05-13 13:50:26 +01:00
Sean Owen c74a6c3f23 [SPARK-12972][CORE] Update org.apache.httpcomponents.httpclient
## What changes were proposed in this pull request?

- update httpcore/httpclient to latest
- centralize version management
- remove excludes that are no longer relevant according to SBT/Maven dep graphs
- also manage httpmime to match httpclient

## How was this patch tested?

Jenkins tests, plus review of dependency graphs from SBT/Maven, and review of test-dependencies.sh  output

Author: Sean Owen <sowen@cloudera.com>

Closes #13049 from srowen/SPARK-12972.
2016-05-13 09:00:50 +01:00
Holden Karau 382dbc12bb [SPARK-15061][PYSPARK] Upgrade to Py4J 0.10.1
## What changes were proposed in this pull request?

This upgrades to Py4J 0.10.1 which reduces syscal overhead in Java gateway ( see https://github.com/bartdag/py4j/issues/201 ). Related https://issues.apache.org/jira/browse/SPARK-6728 .

## How was this patch tested?

Existing doctests & unit tests pass

Author: Holden Karau <holden@us.ibm.com>

Closes #13064 from holdenk/SPARK-15061-upgrade-to-py4j-0.10.1.
2016-05-13 08:59:18 +01:00
wm624@hotmail.com bdff299f9e [SPARK-14900][ML] spark.ml classification metrics should include accuracy
## What changes were proposed in this pull request?

(Please fill in changes proposed in this fix)
Add accuracy to MulticlassMetrics class and add corresponding code in MulticlassClassificationEvaluator.

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)
Scala Unit tests in ml.evaluation

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #12882 from wangmiao1981/accuracy.
2016-05-13 08:29:37 +01:00
Reynold Xin e1dc853737 [SPARK-15310][SQL] Rename HiveTypeCoercion -> TypeCoercion
## What changes were proposed in this pull request?
We originally designed the type coercion rules to match Hive, but over time we have diverged. It does not make sense to call it HiveTypeCoercion anymore. This patch renames it TypeCoercion.

## How was this patch tested?
Updated unit tests to reflect the rename.

Author: Reynold Xin <rxin@databricks.com>

Closes #13091 from rxin/SPARK-15310.
2016-05-13 00:15:39 -07:00
BenFradet 31f1aebbeb [SPARK-13961][ML] spark.ml ChiSqSelector and RFormula should support other numeric types for label
## What changes were proposed in this pull request?

Made ChiSqSelector and RFormula accept all numeric types for label

## How was this patch tested?

Unit tests

Author: BenFradet <benjamin.fradet@gmail.com>

Closes #12467 from BenFradet/SPARK-13961.
2016-05-13 09:08:04 +02:00
sethah 5b849766ab [SPARK-15181][ML][PYSPARK] Python API for GLR summaries.
## What changes were proposed in this pull request?

This patch adds a python API for generalized linear regression summaries (training and test). This helps provide feature parity for Python GLMs.

## How was this patch tested?

Added a unit test to `pyspark.ml.tests`

Author: sethah <seth.hendrickson16@gmail.com>

Closes #12961 from sethah/GLR_summary.
2016-05-13 09:01:20 +02:00
Zheng RuiFeng 87d69a01f0 [MINOR][PYSPARK] update _shared_params_code_gen.py
## What changes were proposed in this pull request?

1, add arg-checkings for `tol` and `stepSize` to  keep in line with `SharedParamsCodeGen.scala`
2, fix one typo

## How was this patch tested?
local build

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12996 from zhengruifeng/py_args_checking.
2016-05-13 08:52:06 +02:00
Holden Karau d1aadea05a [SPARK-15188] Add missing thresholds param to NaiveBayes in PySpark
## What changes were proposed in this pull request?

Add missing thresholds param to NiaveBayes

## How was this patch tested?
doctests

Author: Holden Karau <holden@us.ibm.com>

Closes #12963 from holdenk/SPARK-15188-add-missing-naive-bayes-param.
2016-05-13 08:39:59 +02:00
hyukjinkwon 51841d77d9 [SPARK-13866] [SQL] Handle decimal type in CSV inference at CSV data source.
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-13866

This PR adds the support to infer `DecimalType`.
Here are the rules between `IntegerType`, `LongType` and `DecimalType`.

#### Infering Types

1. `IntegerType` and then `LongType`are tried first.

  ```scala
  Int.MaxValue => IntegerType
  Long.MaxValue => LongType
  ```

2. If it fails, try `DecimalType`.

  ```scala
  (Long.MaxValue + 1) => DecimalType(20, 0)
  ```
  This does not try to infer this as `DecimalType` when scale is less than 0.

3. if it fails, try `DoubleType`
  ```scala
  0.1 => DoubleType // This is failed to be inferred as `DecimalType` because it has the scale, 1.
  ```

#### Compatible Types (Merging Types)

For merging types, this is the same with JSON data source. If `DecimalType` is not capable, then it becomes `DoubleType`

## How was this patch tested?

Unit tests were used and `./dev/run_tests` for code style test.

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>

Closes #11724 from HyukjinKwon/SPARK-13866.
2016-05-12 22:31:14 -07:00