Commit graph

1764 commits

Author SHA1 Message Date
Burak Köse e20cd9f4ce [SPARK-14050][ML] Add multiple languages support and additional methods for Stop Words Remover
## What changes were proposed in this pull request?

This PR continues the work from #11871 with the following changes:
* load English stopwords as default
* covert stopwords to list in Python
* update some tests and doc

## How was this patch tested?

Unit tests.

Closes #11871

cc: burakkose srowen

Author: Burak Köse <burakks41@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>
Author: Burak KOSE <burakks41@gmail.com>

Closes #12843 from mengxr/SPARK-14050.
2016-05-06 13:58:12 -07:00
Holden Karau 4c0d827cfc [SPARK-15106][PYSPARK][ML] Add PySpark package doc for ML component & remove "BETA"
## What changes were proposed in this pull request?

Copy the package documentation from Scala/Java to Python for ML package and remove beta tags. Not super sure if we want to keep the BETA tag but since we are making it the default it seems like probably the time to remove it (happy to put it back in if we want to keep it BETA).

## How was this patch tested?

Python documentation built locally as HTML and text and verified output.

Author: Holden Karau <holden@us.ibm.com>

Closes #12883 from holdenk/SPARK-15106-add-pyspark-package-doc-for-ml.
2016-05-05 10:52:25 +01:00
Davies Liu 4283741956 [MINOR] remove dead code 2016-05-04 21:30:13 -07:00
Andrew Or fa79d346e1 [SPARK-14896][SQL] Deprecate HiveContext in python
## What changes were proposed in this pull request?

See title.

## How was this patch tested?

PySpark tests.

Author: Andrew Or <andrew@databricks.com>

Closes #12917 from andrewor14/deprecate-hive-context-python.
2016-05-04 17:39:30 -07:00
Dongjoon Hyun cdce4e62a5 [SPARK-15031][EXAMPLE] Use SparkSession in Scala/Python/Java example.
## What changes were proposed in this pull request?

This PR aims to update Scala/Python/Java examples by replacing `SQLContext` with newly added `SparkSession`.

- Use **SparkSession Builder Pattern** in 154(Scala 55, Java 52, Python 47) files.
- Add `getConf` in Python SparkContext class: `python/pyspark/context.py`
- Replace **SQLContext Singleton Pattern** with **SparkSession Singleton Pattern**:
  - `SqlNetworkWordCount.scala`
  - `JavaSqlNetworkWordCount.java`
  - `sql_network_wordcount.py`

Now, `SQLContexts` are used only in R examples and the following two Python examples. The python examples are untouched in this PR since it already fails some unknown issue.
- `simple_params_example.py`
- `aft_survival_regression.py`

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12809 from dongjoon-hyun/SPARK-15031.
2016-05-04 14:31:36 -07:00
Reynold Xin 6ae9fc00ed [SPARK-15126][SQL] RuntimeConfig.set should return Unit
## What changes were proposed in this pull request?
Currently we return RuntimeConfig itself to facilitate chaining. However, it makes the output in interactive environments (e.g. notebooks, scala repl) weird because it'd show the response of calling set as a RuntimeConfig itself.

## How was this patch tested?
Updated unit tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #12902 from rxin/SPARK-15126.
2016-05-04 14:26:05 -07:00
Dongjoon Hyun 0903a185c7 [SPARK-15084][PYTHON][SQL] Use builder pattern to create SparkSession in PySpark.
## What changes were proposed in this pull request?

This is a python port of corresponding Scala builder pattern code. `sql.py` is modified as a target example case.

## How was this patch tested?

Manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12860 from dongjoon-hyun/SPARK-15084.
2016-05-03 18:05:40 -07:00
François Garillot 439e361010 [SPARK-9819][STREAMING][DOCUMENTATION] Clarify doc for invReduceFunc in incremental versions of reduceByWindow
- that reduceFunc and invReduceFunc should be associative
- that the intermediate result in iterated applications of inverseReduceFunc
  is its first argument

Author: François Garillot <francois@garillot.net>

Closes #8103 from huitseeker/issue/invReduceFuncDoc.
2016-05-03 11:42:47 -07:00
Tathagata Das 4ad492c403 [SPARK-14716][SQL] Added support for partitioning in FileStreamSink
# What changes were proposed in this pull request?

Support partitioning in the file stream sink. This is implemented using a new, but simpler code path for writing parquet files - both unpartitioned and partitioned. This new code path does not use Output Committers, as we will eventually write the file names to the metadata log for "committing" them.

This patch duplicates < 100 LOC from the WriterContainer. But its far simpler that WriterContainer as it does not involve output committing. In addition, it introduces the new APIs in FileFormat and OutputWriterFactory in an attempt to simplify the APIs (not have Job in the `FileFormat` API, not have bucket and other stuff in the `OutputWriterFactory.newInstance()` ).

# Tests
- New unit tests to test the FileStreamSinkWriter for partitioned and unpartitioned files
- New unit test to partially test the FileStreamSink for partitioned files (does not test recovery of partition column data, as that requires change in the StreamFileCatalog, future PR).
- Updated FileStressSuite to test number of records read from partitioned output files.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #12409 from tdas/streaming-partitioned-parquet.
2016-05-03 10:58:26 -07:00
Yanbo Liang d26f7cb012 [SPARK-14971][ML][PYSPARK] PySpark ML Params setter code clean up
## What changes were proposed in this pull request?
PySpark ML Params setter code clean up.
For examples,
```setInputCol``` can be simplified from
```
self._set(inputCol=value)
return self
```
to:
```
return self._set(inputCol=value)
```
This is a pretty big sweeps, and we cleaned wherever possible.
## How was this patch tested?
Exist unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12749 from yanboliang/spark-14971.
2016-05-03 16:46:13 +02:00
hyukjinkwon d37c7f7f04 [SPARK-15050][SQL] Put CSV and JSON options as Python csv and json function parameters
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-15050

This PR adds function parameters for Python API for reading and writing `csv()`.

## How was this patch tested?

This was tested by `./dev/run_tests`.

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>

Closes #12834 from HyukjinKwon/SPARK-15050.
2016-05-02 17:50:40 -07:00
hyukjinkwon a832cef112 [SPARK-13425][SQL] Documentation for CSV datasource options
## What changes were proposed in this pull request?

This PR adds the explanation and documentation for CSV options for reading and writing.

## How was this patch tested?

Style tests with `./dev/run_tests` for documentation style.

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>

Closes #12817 from HyukjinKwon/SPARK-13425.
2016-05-01 19:05:20 -07:00
Xusen Yin a6428292f7 [SPARK-14931][ML][PYTHON] Mismatched default values between pipelines in Spark and PySpark - update
## What changes were proposed in this pull request?

This PR is an update for [https://github.com/apache/spark/pull/12738] which:
* Adds a generic unit test for JavaParams wrappers in pyspark.ml for checking default Param values vs. the defaults in the Scala side
* Various fixes for bugs found
  * This includes changing classes taking weightCol to treat unset and empty String Param values the same way.

Defaults changed:
* Scala
 * LogisticRegression: weightCol defaults to not set (instead of empty string)
 * StringIndexer: labels default to not set (instead of empty array)
 * GeneralizedLinearRegression:
   * maxIter always defaults to 25 (simpler than defaulting to 25 for a particular solver)
   * weightCol defaults to not set (instead of empty string)
 * LinearRegression: weightCol defaults to not set (instead of empty string)
* Python
 * MultilayerPerceptron: layers default to not set (instead of [1,1])
 * ChiSqSelector: numTopFeatures defaults to 50 (instead of not set)

## How was this patch tested?

Generic unit test.  Manually tested that unit test by changing defaults and verifying that broke the test.

Author: Joseph K. Bradley <joseph@databricks.com>
Author: yinxusen <yinxusen@gmail.com>

Closes #12816 from jkbradley/yinxusen-SPARK-14931.
2016-05-01 12:29:01 -07:00
Herman van Hovell e5fb78baf9 [SPARK-14952][CORE][ML] Remove methods that were deprecated in 1.6.0
#### What changes were proposed in this pull request?

This PR removes three methods the were deprecated in 1.6.0:
- `PortableDataStream.close()`
- `LinearRegression.weights`
- `LogisticRegression.weights`

The rationale for doing this is that the impact is small and that Spark 2.0 is a major release.

#### How was this patch tested?
Compilation succeded.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #12732 from hvanhovell/SPARK-14952.
2016-04-30 16:06:20 +01:00
Junyang 1192fe4cd2 [SPARK-13289][MLLIB] Fix infinite distances between word vectors in Word2VecModel
## What changes were proposed in this pull request?

This PR fixes the bug that generates infinite distances between word vectors. For example,

Before this PR, we have
```
val synonyms = model.findSynonyms("who", 40)
```
will give the following results:
```
to Infinity
and Infinity
that Infinity
with Infinity
```
With this PR, the distance between words is a value between 0 and 1, as follows:
```
scala> model.findSynonyms("who", 10)
res0: Array[(String, Double)] = Array((Harvard-educated,0.5253688097000122), (ex-SAS,0.5213794708251953), (McMutrie,0.5187736749649048), (fellow,0.5166833400726318), (businessman,0.5145374536514282), (American-born,0.5127736330032349), (British-born,0.5062344074249268), (gray-bearded,0.5047978162765503), (American-educated,0.5035858750343323), (mentored,0.49849334359169006))

scala> model.findSynonyms("king", 10)
res1: Array[(String, Double)] = Array((queen,0.6787897944450378), (prince,0.6786158084869385), (monarch,0.659771203994751), (emperor,0.6490438580513), (goddess,0.643266499042511), (dynasty,0.635733425617218), (sultan,0.6166239380836487), (pharaoh,0.6150713562965393), (birthplace,0.6143025159835815), (empress,0.6109727025032043))

scala> model.findSynonyms("queen", 10)
res2: Array[(String, Double)] = Array((princess,0.7670737504959106), (godmother,0.6982434988021851), (raven-haired,0.6877717971801758), (swan,0.684934139251709), (hunky,0.6816608309745789), (Titania,0.6808111071586609), (heroine,0.6794036030769348), (king,0.6787897944450378), (diva,0.67848801612854), (lip-synching,0.6731793284416199))
```

### There are two places changed in this PR:
- Normalize the word vector to avoid overflow when calculating inner product between word vectors. This also simplifies the distance calculation, since the word vectors only need to be normalized once.
- Scale the learning rate by number of iteration, to be consistent with Google Word2Vec implementation

## How was this patch tested?

Use word2vec to train text corpus, and run model.findSynonyms() to get the distances between word vectors.

Author: Junyang <fly.shenjy@gmail.com>
Author: flyskyfly <fly.shenjy@gmail.com>

Closes #11812 from flyjy/TVec.
2016-04-30 10:16:35 +01:00
Xiangrui Meng 7fbe1bb24d [SPARK-14412][.2][ML] rename *RDDStorageLevel to *StorageLevel in ml.ALS
## What changes were proposed in this pull request?

As discussed in #12660, this PR renames
* intermediateRDDStorageLevel -> intermediateStorageLevel
* finalRDDStorageLevel -> finalStorageLevel

The argument name in `ALS.train` will be addressed in SPARK-15027.

## How was this patch tested?

Existing unit tests.

Author: Xiangrui Meng <meng@databricks.com>

Closes #12803 from mengxr/SPARK-14412.
2016-04-30 00:41:28 -07:00
Nick Pentreath 90fa2c6e7f [SPARK-14412][ML][PYSPARK] Add StorageLevel params to ALS
`mllib` `ALS` supports `setIntermediateRDDStorageLevel` and `setFinalRDDStorageLevel`. This PR adds these as Params in `ml` `ALS`. They are put in group **expertParam** since few users will need them.

## How was this patch tested?

New test cases in `ALSSuite` and `tests.py`.

cc yanboliang jkbradley sethah rishabhbhardwaj

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #12660 from MLnick/SPARK-14412-als-storage-params.
2016-04-29 22:01:41 -07:00
Joseph K. Bradley 09da43d514 [SPARK-13786][ML][PYTHON] Removed save/load for python tuning
## What changes were proposed in this pull request?

Per discussion on [https://github.com/apache/spark/pull/12604], this removes ML persistence for Python tuning (TrainValidationSplit, CrossValidator, and their Models) since they do not handle nesting easily.  This support should be re-designed and added in the next release.

## How was this patch tested?

Removed unit test elements saving and loading the tuning algorithms, but kept tests to save and load their bestModel fields.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12782 from jkbradley/remove-python-tuning-saveload.
2016-04-29 20:51:24 -07:00
Andrew Or 66773eb8a5 [SPARK-15012][SQL] Simplify configuration API further
## What changes were proposed in this pull request?

1. Remove all the `spark.setConf` etc. Just expose `spark.conf`
2. Make `spark.conf` take in things set in the core `SparkConf` as well, otherwise users may get confused

This was done for both the Python and Scala APIs.

## How was this patch tested?
`SQLConfSuite`, python tests.

This one fixes the failed tests in #12787

Closes #12787

Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #12798 from yhuai/conf-api.
2016-04-29 20:46:07 -07:00
Andrew Or d33e3d572e [SPARK-14988][PYTHON] SparkSession API follow-ups
## What changes were proposed in this pull request?

Addresses comments in #12765.

## How was this patch tested?

Python tests.

Author: Andrew Or <andrew@databricks.com>

Closes #12784 from andrewor14/python-followup.
2016-04-29 16:41:13 -07:00
Jeff Zhang 775772de36 [SPARK-11940][PYSPARK][ML] Python API for ml.clustering.LDA PR2
## What changes were proposed in this pull request?

pyspark.ml API for LDA
* LDA, LDAModel, LocalLDAModel, DistributedLDAModel
* includes persistence

This replaces [https://github.com/apache/spark/pull/10242]

## How was this patch tested?

* doc test for LDA, including Param setters
* unit test for persistence

Author: Joseph K. Bradley <joseph@databricks.com>
Author: Jeff Zhang <zjffdu@apache.org>

Closes #12723 from jkbradley/zjffdu-SPARK-11940.
2016-04-29 10:42:52 -07:00
Andrew Or a7d0fedc94 [SPARK-14988][PYTHON] SparkSession catalog and conf API
## What changes were proposed in this pull request?

The `catalog` and `conf` APIs were exposed in `SparkSession` in #12713 and #12669. This patch adds those to the python API.

## How was this patch tested?

Python tests.

Author: Andrew Or <andrew@databricks.com>

Closes #12765 from andrewor14/python-spark-session-more.
2016-04-29 09:34:10 -07:00
Zheng RuiFeng cabd54d931 [SPARK-14829][MLLIB] Deprecate GLM APIs using SGD
## What changes were proposed in this pull request?
According to the [SPARK-14829](https://issues.apache.org/jira/browse/SPARK-14829), deprecate API of LogisticRegression and LinearRegression using SGD

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #12596 from zhengruifeng/deprecate_sgd.
2016-04-28 22:44:14 -07:00
Burak Yavuz 78c8aaf849 [SPARK-14555] Second cut of Python API for Structured Streaming
## What changes were proposed in this pull request?

This PR adds Python APIs for:
 - `ContinuousQueryManager`
 - `ContinuousQueryException`

The `ContinuousQueryException` is a very basic wrapper, it doesn't provide the functionality that the Scala side provides, but it follows the same pattern for `AnalysisException`.

For `ContinuousQueryManager`, all APIs are provided except for registering listeners.

This PR also attempts to fix test flakiness by stopping all active streams just before tests.

## How was this patch tested?

Python Doc tests and unit tests

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #12673 from brkyvz/pyspark-cqm.
2016-04-28 15:22:28 -07:00
Kai Jiang d584a2b8ac [SPARK-12810][PYSPARK] PySpark CrossValidatorModel should support avgMetrics
## What changes were proposed in this pull request?
support avgMetrics in CrossValidatorModel with Python
## How was this patch tested?
Doctest and `test_save_load` in `pyspark/ml/test.py`
[JIRA](https://issues.apache.org/jira/browse/SPARK-12810)

Author: Kai Jiang <jiangkai@gmail.com>

Closes #12464 from vectorijk/spark-12810.
2016-04-28 14:19:11 -07:00
Andrew Or 89addd40ab [SPARK-14945][PYTHON] SparkSession Python API
## What changes were proposed in this pull request?

```
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 2.0.0-SNAPSHOT
      /_/

Using Python version 2.7.5 (default, Mar  9 2014 22:15:05)
SparkSession available as 'spark'.
>>> spark
<pyspark.sql.session.SparkSession object at 0x101f3bfd0>
>>> spark.sql("SHOW TABLES").show()
...
+---------+-----------+
|tableName|isTemporary|
+---------+-----------+
|      src|      false|
+---------+-----------+

>>> spark.range(1, 10, 2).show()
+---+
| id|
+---+
|  1|
|  3|
|  5|
|  7|
|  9|
+---+
```
**Note**: This API is NOT complete in its current state. In particular, for now I left out the `conf` and `catalog` APIs, which were added later in Scala. These will be added later before 2.0.

## How was this patch tested?

Python tests.

Author: Andrew Or <andrew@databricks.com>

Closes #12746 from andrewor14/python-spark-session.
2016-04-28 10:55:48 -07:00
Yanbo Liang 4672e9838b [SPARK-14899][ML][PYSPARK] Remove spark.ml HashingTF hashingAlg option
## What changes were proposed in this pull request?
Since [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574) breaks behavior of ```HashingTF```, we should try to enforce good practice by removing the "native" hashAlgorithm option in spark.ml and pyspark.ml. We can leave spark.mllib and pyspark.mllib alone.

## How was this patch tested?
Unit tests.

cc jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12702 from yanboliang/spark-14899.
2016-04-27 14:08:26 -07:00
Mike Dusenberry 607f50341c [SPARK-9656][MLLIB][PYTHON] Add missing methods to PySpark's Distributed Linear Algebra Classes
This PR adds the remaining group of methods to PySpark's distributed linear algebra classes as follows:

* `RowMatrix` <sup>**[1]**</sup>
  1. `computeGramianMatrix`
  2. `computeCovariance`
  3. `computeColumnSummaryStatistics`
  4. `columnSimilarities`
  5. `tallSkinnyQR` <sup>**[2]**</sup>
* `IndexedRowMatrix` <sup>**[3]**</sup>
  1. `computeGramianMatrix`
* `CoordinateMatrix`
  1. `transpose`
* `BlockMatrix`
  1. `validate`
  2. `cache`
  3. `persist`
  4. `transpose`

**[1]**: Note: `multiply`, `computeSVD`, and `computePrincipalComponents` are already part of PR #7963 for SPARK-6227.
**[2]**: Implementing `tallSkinnyQR` uncovered a bug with our PySpark `RowMatrix` constructor.  As discussed on the dev list [here](http://apache-spark-developers-list.1001551.n3.nabble.com/K-Means-And-Class-Tags-td10038.html), there appears to be an issue with type erasure with RDDs coming from Java, and by extension from PySpark.  Although we are attempting to construct a `RowMatrix` from an `RDD[Vector]` in [PythonMLlibAPI](https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala#L1115), the `Vector` type is erased, resulting in an `RDD[Object]`.  Thus, when calling Scala's `tallSkinnyQR` from PySpark, we get a Java `ClassCastException` in which an `Object` cannot be cast to a Spark `Vector`.  As noted in the aforementioned dev list thread, this issue was also encountered with `DecisionTrees`, and the fix involved an explicit `retag` of the RDD with a `Vector` type.  Thus, this PR currently contains that fix applied to the `createRowMatrix` helper function in `PythonMLlibAPI`.  `IndexedRowMatrix` and `CoordinateMatrix` do not appear to have this issue likely due to their related helper functions in `PythonMLlibAPI` creating the RDDs explicitly from DataFrames with pattern matching, thus preserving the types.  However, this fix may be out of scope for this single PR, and it may be better suited in a separate JIRA/PR.  Therefore, I have marked this PR as WIP and am open to discussion.
**[3]**: Note: `multiply` and `computeSVD` are already part of PR #7963 for SPARK-6227.

Author: Mike Dusenberry <mwdusenb@us.ibm.com>

Closes #9441 from dusenberrymw/SPARK-9656_Add_Missing_Methods_to_PySpark_Distributed_Linear_Algebra.
2016-04-27 19:48:05 +02:00
Joseph K. Bradley bd2c9a6d48 [SPARK-14732][ML] spark.ml GaussianMixture should use MultivariateGaussian in mllib-local
## What changes were proposed in this pull request?

Before, spark.ml GaussianMixtureModel used the spark.mllib MultivariateGaussian in its public API.  This was added after 1.6, so we can modify this API without breaking APIs.

This PR copies MultivariateGaussian to mllib-local in spark.ml, with a few changes:
* Renamed fields to match numpy, scipy: mu => mean, sigma => cov

This PR then uses the spark.ml MultivariateGaussian in the spark.ml GaussianMixtureModel, which involves:
* Modifying the constructor
* Adding a computeProbabilities method

Also:
* Added EPSILON to mllib-local for use in MultivariateGaussian

## How was this patch tested?

Existing unit tests

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12593 from jkbradley/sparkml-gmm-fix.
2016-04-26 16:53:16 -07:00
Joseph K. Bradley 89f082de0e [SPARK-14903][SPARK-14071][ML][PYTHON] Revert : MLWritable.write property
## What changes were proposed in this pull request?

SPARK-14071 changed MLWritable.write to be a property.  This reverts that change since there was not a good way to make MLReadable.read appear to be a property.

## How was this patch tested?

existing unit tests

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12671 from jkbradley/revert-MLWritable-write-py.
2016-04-26 12:00:57 -07:00
Yanbo Liang 302a186869 [SPARK-11559][MLLIB] Make runs no effect in mllib.KMeans
## What changes were proposed in this pull request?
We deprecated  ```runs``` of mllib.KMeans in Spark 1.6 (SPARK-11358). In 2.0, we will make it no effect (with warning messages). We did not remove ```setRuns/getRuns``` for better binary compatibility.
This PR change `runs` which are appeared at the public API. Usage inside of ```KMeans.runAlgorithm()``` will be resolved at #10806.

## How was this patch tested?
Existing unit tests.

cc jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12608 from yanboliang/spark-11559.
2016-04-26 11:55:21 -07:00
Andrew Or 3c5e65c339 [SPARK-14721][SQL] Remove HiveContext (part 2)
## What changes were proposed in this pull request?

This removes the class `HiveContext` itself along with all code usages associated with it. The bulk of the work was already done in #12485. This is mainly just code cleanup and actually removing the class.

Note: A couple of things will break after this patch. These will be fixed separately.
- the python HiveContext
- all the documentation / comments referencing HiveContext
- there will be no more HiveContext in the REPL (fixed by #12589)

## How was this patch tested?

No change in functionality.

Author: Andrew Or <andrew@databricks.com>

Closes #12585 from andrewor14/delete-hive-context.
2016-04-25 13:23:05 -07:00
Yanbo Liang 425f691646 [SPARK-10574][ML][MLLIB] HashingTF supports MurmurHash3
## What changes were proposed in this pull request?
As the discussion at [SPARK-10574](https://issues.apache.org/jira/browse/SPARK-10574), ```HashingTF``` should support MurmurHash3 and make it as the default hash algorithm. We should also expose set/get API for ```hashAlgorithm```, then users can choose the hash method.

Note: The problem that ```mllib.feature.HashingTF``` behaves differently between Scala/Java and Python will be resolved in the followup work.

## How was this patch tested?
unit tests.

cc jkbradley MLnick

Author: Yanbo Liang <ybliang8@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12498 from yanboliang/spark-10574.
2016-04-25 12:08:43 -07:00
Joseph K. Bradley c7758ba384 [MINOR][ML][PYTHON][DOC] Remove use of JavaMLWriter/Reader in public Python API docs
## What changes were proposed in this pull request?

Removed instances of JavaMLWriter, JavaMLReader appearing in public Python API docs

## How was this patch tested?

n/a

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12542 from jkbradley/javamlwriter-doc.
2016-04-25 11:02:32 -07:00
wm624@hotmail.com b50e2eca93 [SPARK-14433][PYSPARK][ML] PySpark ml GaussianMixture
## What changes were proposed in this pull request?

Add Python API in ML for GaussianMixture

## How was this patch tested?

(Please explain how this patch was tested. E.g. unit tests, integration tests, manual tests)

Add doctest and test cases are the same as mllib Python tests
./dev/lint-python
PEP8 checks passed.
rm -rf _build/*
pydoc checks passed.

./python/run-tests --python-executables=python2.7 --modules=pyspark-ml
Running PySpark tests. Output is in /Users/mwang/spark_ws_0904/python/unit-tests.log
Will test against the following Python executables: ['python2.7']
Will test the following Python modules: ['pyspark-ml']
Finished test(python2.7): pyspark.ml.evaluation (18s)
Finished test(python2.7): pyspark.ml.clustering (40s)
Finished test(python2.7): pyspark.ml.classification (49s)
Finished test(python2.7): pyspark.ml.recommendation (44s)
Finished test(python2.7): pyspark.ml.feature (64s)
Finished test(python2.7): pyspark.ml.regression (45s)
Finished test(python2.7): pyspark.ml.tuning (30s)
Finished test(python2.7): pyspark.ml.tests (56s)
Tests passed in 106 seconds

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #12402 from wangmiao1981/gmm.
2016-04-25 10:48:15 -07:00
Jason Lee bfda099913 [SPARK-14768][ML][PYSPARK] removed expectedType from Param __init__()
## What changes were proposed in this pull request?
Removed expectedType arg from PySpark Param __init__, as suggested by the JIRA.

## How was this patch tested?
Manually looked through all places that use Param. Compiled and ran all ML PySpark test cases before and after the fix.

Author: Jason Lee <cjlee@us.ibm.com>

Closes #12581 from jasoncl/SPARK-14768.
2016-04-25 15:32:11 +02:00
mathieu longtin 902c15c5e6 Support single argument version of sqlContext.getConf
## What changes were proposed in this pull request?

In Python, sqlContext.getConf didn't allow getting the system default (getConf with one parameter).

Now the following are supported:
```
sqlContext.getConf(confName)  # System default if not locally set, this is new
sqlContext.getConf(confName, myDefault)  # myDefault if not locally set, old behavior
```

I also added doctests to this function. The original behavior does not change.

## How was this patch tested?

Manually, but doctests were added.

Author: mathieu longtin <mathieu.longtin@nuance.com>

Closes #12488 from mathieulongtin/pyfixgetconf3.
2016-04-23 22:38:36 -07:00
Liang-Chi Hsieh 056883e070 [SPARK-13266] [SQL] None read/writer options were not transalated to "null"
## What changes were proposed in this pull request?

In Python, the `option` and `options` method of `DataFrameReader` and `DataFrameWriter` were sending the string "None" instead of `null` when passed `None`, therefore making it impossible to send an actual `null`. This fixes that problem.

This is based on #11305 from mathieulongtin.

## How was this patch tested?

Added test to readwriter.py.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>
Author: mathieu longtin <mathieu.longtin@nuance.com>

Closes #12494 from viirya/py-df-none-option.
2016-04-22 09:19:36 -07:00
Arash Parsa 2b8906c437 [SPARK-14739][PYSPARK] Fix Vectors parser bugs
## What changes were proposed in this pull request?

The PySpark deserialization has a bug that shows while deserializing all zero sparse vectors. This fix filters out empty string tokens before casting, hence properly stringified SparseVectors successfully get parsed.

## How was this patch tested?

Standard unit-tests similar to other methods.

Author: Arash Parsa <arash@ip-192-168-50-106.ec2.internal>
Author: Arash Parsa <arashpa@gmail.com>
Author: Vishnu Prasad <vishnu667@gmail.com>
Author: Vishnu Prasad S <vishnu667@gmail.com>

Closes #12516 from arashpa/SPARK-14739.
2016-04-21 11:29:24 +01:00
Sheamus K. Parkes e7791c4f69 [SPARK-13842] [PYSPARK] pyspark.sql.types.StructType accessor enhancements
## What changes were proposed in this pull request?

Expand the possible ways to interact with the contents of a `pyspark.sql.types.StructType` instance.
  - Iterating a `StructType` will iterate its fields
    - `[field.name for field in my_structtype]`
  - Indexing with a string will return a field by name
    - `my_structtype['my_field_name']`
  - Indexing with an integer will return a field by position
    - `my_structtype[0]`
  - Indexing with a slice will return a new `StructType` with just the chosen fields:
    - `my_structtype[1:3]`
  - The length is the number of fields (should also provide "truthiness" for free)
    - `len(my_structtype) == 2`

## How was this patch tested?

Extended the unit test coverage in the accompanying `tests.py`.

Author: Sheamus K. Parkes <shea.parkes@milliman.com>

Closes #12251 from skparkes/pyspark-structtype-enhance.
2016-04-20 13:45:14 -07:00
Yanbo Liang 296c384aff [MINOR][ML][PYSPARK] Fix omissive params which should use TypeConverter
## What changes were proposed in this pull request?
#11663 adds type conversion functionality for parameters in Pyspark. This PR find out the omissive ```Param``` that did not pass corresponding ```TypeConverter``` argument and fix them. After this PR, all params in pyspark/ml/ used ```TypeConverter```.

## How was this patch tested?
Existing tests.

cc jkbradley sethah

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12529 from yanboliang/typeConverter.
2016-04-20 13:02:37 -07:00
Yanbo Liang 08f84d7a9a [MINOR][ML][PYSPARK] Fix omissive param setters which should use _set method
## What changes were proposed in this pull request?
#11939 make Python param setters use the `_set` method. This PR fix omissive ones.

## How was this patch tested?
Existing tests.

cc jkbradley sethah

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12531 from yanboliang/setters-omissive.
2016-04-20 20:06:27 +02:00
Burak Yavuz 80bf48f437 [SPARK-14555] First cut of Python API for Structured Streaming
## What changes were proposed in this pull request?

This patch provides a first cut of python APIs for structured streaming. This PR provides the new classes:
 - ContinuousQuery
 - Trigger
 - ProcessingTime
in pyspark under `pyspark.sql.streaming`.

In addition, it contains the new methods added under:
 -  `DataFrameWriter`
     a) `startStream`
     b) `trigger`
     c) `queryName`

 -  `DataFrameReader`
     a) `stream`

 - `DataFrame`
    a) `isStreaming`

This PR doesn't contain all methods exposed for `ContinuousQuery`, for example:
 - `exception`
 - `sourceStatuses`
 - `sinkStatus`

They may be added in a follow up.

This PR also contains some very minor doc fixes in the Scala side.

## How was this patch tested?

Python doc tests

TODO:
 - [ ] verify Python docs look good

Author: Burak Yavuz <brkyvz@gmail.com>
Author: Burak Yavuz <burak@databricks.com>

Closes #12320 from brkyvz/stream-python.
2016-04-20 10:32:01 -07:00
Dongjoon Hyun 14869ae64e [SPARK-14639] [PYTHON] [R] Add bround function in Python/R.
## What changes were proposed in this pull request?

This issue aims to expose Scala `bround` function in Python/R API.
`bround` function is implemented in SPARK-14614 by extending current `round` function.
We used the following semantics from Hive.
```java
public static double bround(double input, int scale) {
    if (Double.isNaN(input) || Double.isInfinite(input)) {
      return input;
    }
    return BigDecimal.valueOf(input).setScale(scale, RoundingMode.HALF_EVEN).doubleValue();
}
```

After this PR, `pyspark` and `sparkR` also support `bround` function.

**PySpark**
```python
>>> from pyspark.sql.functions import bround
>>> sqlContext.createDataFrame([(2.5,)], ['a']).select(bround('a', 0).alias('r')).collect()
[Row(r=2.0)]
```

**SparkR**
```r
> df = createDataFrame(sqlContext, data.frame(x = c(2.5, 3.5)))
> head(collect(select(df, bround(df$x, 0))))
  bround(x, 0)
1            2
2            4
```

## How was this patch tested?

Pass the Jenkins tests (including new testcases).

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #12509 from dongjoon-hyun/SPARK-14639.
2016-04-19 22:28:11 -07:00
felixcheung 3664142350 [SPARK-14717] [PYTHON] Scala, Python APIs for Dataset.unpersist differ in default blocking value
## What changes were proposed in this pull request?

Change unpersist blocking parameter default value to match Scala

## How was this patch tested?

unit tests, manual tests

jkbradley davies

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #12507 from felixcheung/pyunpersist.
2016-04-19 17:29:28 -07:00
Joseph K. Bradley d29e429eeb [SPARK-14714][ML][PYTHON] Fixed issues with non-kwarg typeConverter arg for Param constructor
## What changes were proposed in this pull request?

PySpark Param constructors need to pass the TypeConverter argument by name, partly to make sure it is not mistaken for the expectedType arg and partly because we will remove the expectedType arg in 2.1. In several places, this is not being done correctly.

This PR changes all usages in pyspark/ml/ to keyword args.

## How was this patch tested?

Existing unit tests.  I will not test type conversion for every Param unless we really think it necessary.

Also, if you start the PySpark shell and import classes (e.g., pyspark.ml.feature.StandardScaler), then you no longer get this warning:
```
/Users/josephkb/spark/python/pyspark/ml/param/__init__.py:58: UserWarning: expectedType is deprecated and will be removed in 2.1. Use typeConverter instead, as a keyword argument.
  "Use typeConverter instead, as a keyword argument.")
```
That warning came from the typeConverter argument being passes as the expectedType arg by mistake.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12480 from jkbradley/typeconverter-fix.
2016-04-18 17:15:12 -07:00
Xusen Yin f31a62d1b2 [SPARK-14440][PYSPARK] Remove pipeline specific reader and writer
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-14440

Remove

* PipelineMLWriter
* PipelineMLReader
* PipelineModelMLWriter
* PipelineModelMLReader

and modify comments.

## How was this patch tested?

test with unit test.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #12216 from yinxusen/SPARK-14440.
2016-04-18 13:31:48 -07:00
Jason Lee 3d66a2ce9b [SPARK-14564][ML][MLLIB][PYSPARK] Python Word2Vec missing setWindowSize method
## What changes were proposed in this pull request?
Added windowSize getter/setter to ML/MLlib

## How was this patch tested?
Added test cases in tests.py under both ML and MLlib

Author: Jason Lee <cjlee@us.ibm.com>

Closes #12428 from jasoncl/SPARK-14564.
2016-04-18 12:47:14 -07:00
Xusen Yin b64482f49f [SPARK-14306][ML][PYSPARK] PySpark ml.classification OneVsRest support export/import
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-14306

Add PySpark OneVsRest save/load supports.

## How was this patch tested?

Test with Python unit test.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #12439 from yinxusen/SPARK-14306-0415.
2016-04-18 11:52:29 -07:00
Joseph K. Bradley 36da5e3234 [SPARK-14605][ML][PYTHON] Changed Python to use unicode UIDs for spark.ml Identifiable
## What changes were proposed in this pull request?

Python spark.ml Identifiable classes use UIDs of type str, but they should use unicode (in Python 2.x) to match Java. This could be a problem if someone created a class in Java with odd unicode characters, saved it, and loaded it in Python.

This PR: Use unicode everywhere in Python.

## How was this patch tested?

Updated persistence unit test to check uid type

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12368 from jkbradley/python-uid-unicode.
2016-04-16 11:23:28 -07:00
Xusen Yin 90b46e014a [SPARK-7861][ML] PySpark OneVsRest
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-7861

Add PySpark OneVsRest. I implement it with Python since it's a meta-pipeline.

## How was this patch tested?

Test with doctest.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #12124 from yinxusen/SPARK-14306-7861.
2016-04-15 12:58:38 -07:00
sethah 129f2f455d [SPARK-14104][PYSPARK][ML] All Python param setters should use the _set method
## What changes were proposed in this pull request?

Param setters in python previously accessed the _paramMap directly to update values. The `_set` method now implements type checking, so it should be used to update all parameters. This PR eliminates all direct accesses to `_paramMap` besides the one in the `_set` method to ensure type checking happens.

Additional changes:
* [SPARK-13068](https://github.com/apache/spark/pull/11663) missed adding type converters in evaluation.py so those are done here
* An incorrect `toBoolean` type converter was used for StringIndexer `handleInvalid` param in previous PR. This is fixed here.

## How was this patch tested?

Existing unit tests verify that parameters are still set properly. No new functionality is actually added in this PR.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #11939 from sethah/SPARK-14104.
2016-04-15 12:14:41 -07:00
Joseph K. Bradley d6ae7d4637 [SPARK-14665][ML][PYTHON] Fixed bug with StopWordsRemover default stopwords
## What changes were proposed in this pull request?

The default stopwords were a Java object.  They are no longer.

## How was this patch tested?

Unit test which failed before the fix

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12422 from jkbradley/pyspark-stopwords.
2016-04-15 11:50:21 -07:00
Yanbo Liang b9613239d3 [SPARK-14374][ML][PYSPARK] PySpark ml GBTClassifier, Regressor support export/import
## What changes were proposed in this pull request?
PySpark ml GBTClassifier, Regressor support export/import.

## How was this patch tested?
Doc test.

cc jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12383 from yanboliang/spark-14374.
2016-04-14 21:36:03 -07:00
Yong Tang bc748b7b8f [SPARK-14238][ML][MLLIB][PYSPARK] Add binary toggle Param to PySpark HashingTF in ML & MLlib
## What changes were proposed in this pull request?

This fix tries to add binary toggle Param to PySpark HashingTF in ML & MLlib. If this toggle is set, then all non-zero counts will be set to 1.

Note: This fix (SPARK-14238) is extended from SPARK-13963 where Scala implementation was done.

## How was this patch tested?

This fix adds two tests to cover the code changes. One for HashingTF in PySpark's ML and one for HashingTF in PySpark's MLLib.

Author: Yong Tang <yong.tang.github@outlook.com>

Closes #12079 from yongtang/SPARK-14238.
2016-04-14 21:53:32 +02:00
Bryan Cutler c5172f8205 [SPARK-13967][PYSPARK][ML] Added binary Param to Python CountVectorizer
Added binary toggle param to CountVectorizer feature transformer in PySpark.

Created a unit test for using CountVectorizer with the binary toggle on.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #12308 from BryanCutler/binary-param-python-CountVectorizer-SPARK-13967.
2016-04-14 20:47:31 +02:00
Holden Karau 478af2f455 [SPARK-14573][PYSPARK][BUILD] Fix PyDoc Makefile & highlighting issues
## What changes were proposed in this pull request?

The PyDoc Makefile used "=" rather than "?=" for setting env variables so it overwrote the user values. This ignored the environment variables we set for linting allowing warnings through. This PR also fixes the warnings that had been introduced.

## How was this patch tested?

manual local export & make

Author: Holden Karau <holden@us.ibm.com>

Closes #12336 from holdenk/SPARK-14573-fix-pydoc-makefile.
2016-04-14 09:42:15 +01:00
Bryan Cutler fc3cd2f509 [SPARK-14472][PYSPARK][ML] Cleanup ML JavaWrapper and related class hierarchy
Currently, JavaWrapper is only a wrapper class for pipeline classes that have Params and JavaCallable is a separate mixin that provides methods to make Java calls.  This change simplifies the class structure and to define the Java wrapper in a plain base class along with methods to make Java calls.  Also, renames Java wrapper classes to better reflect their purpose.

Ran existing Python ml tests and generated documentation to test this change.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #12304 from BryanCutler/pyspark-cleanup-JavaWrapper-SPARK-14472.
2016-04-13 14:08:57 -07:00
Liwei Lin 23f93f559c [SPARK-13992][CORE][PYSPARK][FOLLOWUP] Update OFF_HEAP semantics for Java api and Python api
## What changes were proposed in this pull request?

- updated `OFF_HEAP` semantics for `StorageLevels.java`
- updated `OFF_HEAP` semantics for `storagelevel.py`

## How was this patch tested?

no need to test

Author: Liwei Lin <lwlin7@gmail.com>

Closes #12126 from lw-lin/storagelevel.py.
2016-04-12 23:06:55 -07:00
Kai Jiang 7f024c4744 [SPARK-13597][PYSPARK][ML] Python API for GeneralizedLinearRegression
## What changes were proposed in this pull request?

Python API for GeneralizedLinearRegression
JIRA: https://issues.apache.org/jira/browse/SPARK-13597

## How was this patch tested?

The patch is tested with Python doctest.

Author: Kai Jiang <jiangkai@gmail.com>

Closes #11468 from vectorijk/spark-13597.
2016-04-12 11:29:12 -07:00
Holden Karau 00288ea2a4 [SPARK-13687][PYTHON] Cleanup PySpark parallelize temporary files
## What changes were proposed in this pull request?

Eagerly cleanup PySpark's temporary parallelize cleanup files rather than waiting for shut down.

## How was this patch tested?

Unit tests

Author: Holden Karau <holden@us.ibm.com>

Closes #12233 from holdenk/SPARK-13687-cleanup-pyspark-temporary-files.
2016-04-10 02:34:54 +01:00
Joseph K. Bradley d7af736b2c [SPARK-14498][ML][PYTHON][SQL] Many cleanups to ML and ML-related docs
## What changes were proposed in this pull request?

Cleanups to documentation.  No changes to code.
* GBT docs: Move Scala doc for private object GradientBoostedTrees to public docs for GBTClassifier,Regressor
* GLM regParam: needs doc saying it is for L2 only
* TrainValidationSplitModel: add .. versionadded:: 2.0.0
* Rename “_transformer_params_from_java” to “_transfer_params_from_java”
* LogReg Summary classes: “probability” col should not say “calibrated”
* LR summaries: coefficientStandardErrors —> document that intercept stderr comes last.  Same for t,p-values
* approxCountDistinct: Document meaning of “rsd" argument.
* LDA: note which params are for online LDA only

## How was this patch tested?

Doc build

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #12266 from jkbradley/ml-doc-cleanups.
2016-04-08 20:15:44 -07:00
wm624@hotmail.com e0ad75f2b5 [SPARK-12569][PYSPARK][ML] DecisionTreeRegressor: provide variance of prediction: Python API
## What changes were proposed in this pull request?

A new column VarianceCol has been added to DecisionTreeRegressor in ML scala code.

This patch adds the corresponding Python API, HasVarianceCol, to class DecisionTreeRegressor.

## How was this patch tested?
./dev/lint-python
PEP8 checks passed.
rm -rf _build/*
pydoc checks passed.

./python/run-tests --python-executables=python2.7 --modules=pyspark-ml
Running PySpark tests. Output is in /Users/mwang/spark_ws_0904/python/unit-tests.log
Will test against the following Python executables: ['python2.7']
Will test the following Python modules: ['pyspark-ml']
Finished test(python2.7): pyspark.ml.evaluation (12s)
Finished test(python2.7): pyspark.ml.clustering (18s)
Finished test(python2.7): pyspark.ml.classification (30s)
Finished test(python2.7): pyspark.ml.recommendation (28s)
Finished test(python2.7): pyspark.ml.feature (43s)
Finished test(python2.7): pyspark.ml.regression (31s)
Finished test(python2.7): pyspark.ml.tuning (19s)
Finished test(python2.7): pyspark.ml.tests (34s)

(If this patch involves UI changes, please attach a screenshot; otherwise, remove this)

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #12116 from wangmiao1981/fix_api.
2016-04-08 10:47:05 -07:00
Kai Jiang e5d8d6e09c [SPARK-14373][PYSPARK] PySpark RandomForestClassifier, Regressor support export/import
## What changes were proposed in this pull request?
supporting `RandomForest{Classifier, Regressor}` save/load for Python API.
[JIRA](https://issues.apache.org/jira/browse/SPARK-14373)
## How was this patch tested?
doctest

Author: Kai Jiang <jiangkai@gmail.com>

Closes #12238 from vectorijk/spark-14373.
2016-04-08 10:39:12 -07:00
Bryan Cutler 9c6556c5f8 [SPARK-13430][PYSPARK][ML] Python API for training summaries of linear and logistic regression
## What changes were proposed in this pull request?

Adding Python API for training summaries of LogisticRegression and LinearRegression in PySpark ML.

## How was this patch tested?
Added unit tests to exercise the api calls for the summary classes.  Also, manually verified values are expected and match those from Scala directly.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #11621 from BryanCutler/pyspark-ml-summary-SPARK-13430.
2016-04-06 12:07:47 -07:00
Xusen Yin db0b06c6ea [SPARK-13786][ML][PYSPARK] Add save/load for pyspark.ml.tuning
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-13786

Add save/load for Python CrossValidator/Model and TrainValidationSplit/Model.

## How was this patch tested?

Test with Python doctest.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #12020 from yinxusen/SPARK-13786.
2016-04-06 11:24:11 -07:00
Davies Liu 90ca184486 [SPARK-14418][PYSPARK] fix unpersist of Broadcast in Python
## What changes were proposed in this pull request?

Currently, Broaccast.unpersist() will remove the file of broadcast, which should be the behavior of destroy().

This PR added destroy() for Broadcast in Python, to match the sematics in Scala.

## How was this patch tested?

Added regression tests.

Author: Davies Liu <davies@databricks.com>

Closes #12189 from davies/py_unpersist.
2016-04-06 10:46:34 -07:00
Burak Yavuz 9ee5c25717 [SPARK-14353] Dataset Time Window window API for Python, and SQL
## What changes were proposed in this pull request?

The `window` function was added to Dataset with [this PR](https://github.com/apache/spark/pull/12008).
This PR adds the Python, and SQL, API for this function.

With this PR, SQL, Java, and Scala will share the same APIs as in users can use:
 - `window(timeColumn, windowDuration)`
 - `window(timeColumn, windowDuration, slideDuration)`
 - `window(timeColumn, windowDuration, slideDuration, startTime)`

In Python, users can access all APIs above, but in addition they can do
 - In Python:
   `window(timeColumn, windowDuration, startTime=...)`

that is, they can provide the startTime without providing the `slideDuration`. In this case, we will generate tumbling windows.

## How was this patch tested?

Unit tests + manual tests

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #12136 from brkyvz/python-windows.
2016-04-05 13:18:39 -07:00
Yong Tang 7db56244fa [SPARK-14368][PYSPARK] Support python.spark.worker.memory with upper-case unit.
## What changes were proposed in this pull request?

This fix tries to address the issue in PySpark where `spark.python.worker.memory`
could only be configured with a lower case unit (`k`, `m`, `g`, `t`). This fix
allows the upper case unit (`K`, `M`, `G`, `T`) to be used as well. This is to
conform to the JVM memory string as is specified in the documentation .

## How was this patch tested?

This fix adds additional test to cover the changes.

Author: Yong Tang <yong.tang.github@outlook.com>

Closes #12163 from yongtang/SPARK-14368.
2016-04-05 12:19:20 +09:00
Marcelo Vanzin 24d7d2e453 [SPARK-13579][BUILD] Stop building the main Spark assembly.
This change modifies the "assembly/" module to just copy needed
dependencies to its build directory, and modifies the packaging
script to pick those up (and remove duplicate jars packages in the
examples module).

I also made some minor adjustments to dependencies to remove some
test jars from the final packaging, and remove jars that conflict with each
other when packaged separately (e.g. servlet api).

Also note that this change restores guava in applications' classpaths, even
though it's still shaded inside Spark. This is now needed for the Hadoop
libraries that are packaged with Spark, which now are not processed by
the shade plugin.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #11796 from vanzin/SPARK-13579.
2016-04-04 16:52:22 -07:00
Davies Liu cc70f17416 [SPARK-14334] [SQL] add toLocalIterator for Dataset/DataFrame
## What changes were proposed in this pull request?

RDD.toLocalIterator() could be used to fetch one partition at a time to reduce the memory usage. Right now, for Dataset/Dataframe we have to use df.rdd.toLocalIterator, which is super slow also requires lots of memory (because of the Java serializer or even Kyro serializer).

This PR introduce an optimized toLocalIterator for Dataset/DataFrame, which is much faster and requires much less memory. For a partition with 5 millions rows, `df.rdd.toIterator` took about 100 seconds, but df.toIterator took less than 7 seconds. For 10 millions row, rdd.toIterator will crash (not enough memory) with 4G heap, but df.toLocalIterator could finished in 12 seconds.

The JDBC server has been updated to use DataFrame.toIterator.

## How was this patch tested?

Existing tests.

Author: Davies Liu <davies@databricks.com>

Closes #12114 from davies/local_iterator.
2016-04-04 13:31:44 -07:00
Davies Liu 5743c6476d [SPARK-12981] [SQL] extract Pyhton UDF in physical plan
## What changes were proposed in this pull request?

Currently we extract Python UDFs into a special logical plan EvaluatePython in analyzer, But EvaluatePython is not part of catalyst, many rules have no knowledge of it , which will break many things (for example, filter push down or column pruning).

We should treat Python UDFs as normal expressions, until we want to evaluate in physical plan, we could extract them in end of optimizer, or physical plan.

This PR extract Python UDFs in physical plan.

Closes #10935

## How was this patch tested?

Added regression tests.

Author: Davies Liu <davies@databricks.com>

Closes #12127 from davies/py_udf.
2016-04-04 10:56:26 -07:00
hyukjinkwon 2262a93358 [SPARK-14231] [SQL] JSON data source infers floating-point values as a double when they do not fit in a decimal
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-14231

Currently, JSON data source supports to infer `DecimalType` for big numbers and `floatAsBigDecimal` option which reads floating-point values as `DecimalType`.

But there are few restrictions in Spark `DecimalType` below:

1. The precision cannot be bigger than 38.
2. scale cannot be bigger than precision.

Currently, both restrictions are not being handled.

This PR handles the cases by inferring them as `DoubleType`. Also, the option name was changed from `floatAsBigDecimal` to `prefersDecimal` as suggested [here](https://issues.apache.org/jira/browse/SPARK-14231?focusedCommentId=15215579&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15215579).

So, the codes below:

```scala
def doubleRecords: RDD[String] =
  sqlContext.sparkContext.parallelize(
    s"""{"a": 1${"0" * 38}, "b": 0.01}""" ::
    s"""{"a": 2${"0" * 38}, "b": 0.02}""" :: Nil)

val jsonDF = sqlContext.read
  .option("prefersDecimal", "true")
  .json(doubleRecords)
jsonDF.printSchema()
```

produces below:

- **Before**

```scala
org.apache.spark.sql.AnalysisException: Decimal scale (2) cannot be greater than precision (1).;
	at org.apache.spark.sql.types.DecimalType.<init>(DecimalType.scala:44)
	at org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:144)
	at org.apache.spark.sql.execution.datasources.json.InferSchema$.org$apache$spark$sql$execution$datasources$json$InferSchema$$inferField(InferSchema.scala:108)
	at
...
```

- **After**

```scala
root
 |-- a: double (nullable = true)
 |-- b: double (nullable = true)
```

## How was this patch tested?

Unit tests were used and `./dev/run_tests` for coding style tests.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #12030 from HyukjinKwon/SPARK-14231.
2016-04-02 23:12:04 -07:00
Liwei Lin 03d130f973 [SPARK-14342][CORE][DOCS][TESTS] Remove straggler references to Tachyon
## What changes were proposed in this pull request?

Straggler references to Tachyon were removed:
- for docs, `tachyon` has been generalized as `off-heap memory`;
- for Mesos test suits, the key-value `tachyon:true`/`tachyon:false` has been changed to `os:centos`/`os:ubuntu`, since `os` is an example constrain used by the [Mesos official docs](http://mesos.apache.org/documentation/attributes-resources/).

## How was this patch tested?

Existing test suites.

Author: Liwei Lin <lwlin7@gmail.com>

Closes #12129 from lw-lin/tachyon-cleanup.
2016-04-02 17:55:46 -07:00
Yanbo Liang 381358fbe9 [SPARK-14305][ML][PYSPARK] PySpark ml.clustering BisectingKMeans support export/import
## What changes were proposed in this pull request?
PySpark ml.clustering BisectingKMeans support export/import
## How was this patch tested?
doc test.

cc jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #12112 from yanboliang/spark-14305.
2016-04-01 12:53:39 -07:00
Alexander Ulanov 26867ebc67 [SPARK-11262][ML] Unit test for gradient, loss layers, memory management for multilayer perceptron
1.Implement LossFunction trait and implement squared error and cross entropy
loss with it
2.Implement unit test for gradient and loss
3.Implement InPlace trait and in-place layer evaluation
4.Refactor interface for ActivationFunction
5.Update of Layer and LayerModel interfaces
6.Fix random weights assignment
7.Implement memory allocation by MLP model instead of individual layers

These features decreased the memory usage and increased flexibility of
internal API.

Author: Alexander Ulanov <nashb@yandex.ru>
Author: avulanov <avulanov@gmail.com>

Closes #9229 from avulanov/mlp-refactoring.
2016-03-31 23:48:36 -07:00
Davies Liu f0afafdc5d [SPARK-14267] [SQL] [PYSPARK] execute multiple Python UDFs within single batch
## What changes were proposed in this pull request?

This PR support multiple Python UDFs within single batch, also improve the performance.

```python
>>> from pyspark.sql.types import IntegerType
>>> sqlContext.registerFunction("double", lambda x: x * 2, IntegerType())
>>> sqlContext.registerFunction("add", lambda x, y: x + y, IntegerType())
>>> sqlContext.sql("SELECT double(add(1, 2)), add(double(2), 1)").explain(True)
== Parsed Logical Plan ==
'Project [unresolvedalias('double('add(1, 2)), None),unresolvedalias('add('double(2), 1), None)]
+- OneRowRelation$

== Analyzed Logical Plan ==
double(add(1, 2)): int, add(double(2), 1): int
Project [double(add(1, 2))#14,add(double(2), 1)#15]
+- Project [double(add(1, 2))#14,add(double(2), 1)#15]
   +- Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15]
      +- EvaluatePython [add(pythonUDF1#17, 1)], [pythonUDF0#18]
         +- EvaluatePython [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17]
            +- OneRowRelation$

== Optimized Logical Plan ==
Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15]
+- EvaluatePython [add(pythonUDF1#17, 1)], [pythonUDF0#18]
   +- EvaluatePython [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17]
      +- OneRowRelation$

== Physical Plan ==
WholeStageCodegen
:  +- Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS add(double(2), 1)#15]
:     +- INPUT
+- !BatchPythonEvaluation [add(pythonUDF1#17, 1)], [pythonUDF0#16,pythonUDF1#17,pythonUDF0#18]
   +- !BatchPythonEvaluation [double(add(1, 2)),double(2)], [pythonUDF0#16,pythonUDF1#17]
      +- Scan OneRowRelation[]
```

## How was this patch tested?

Added new tests.

Using the following script to benchmark 1, 2 and 3 udfs,
```
df = sqlContext.range(1, 1 << 23, 1, 4)
double = F.udf(lambda x: x * 2, LongType())
print df.select(double(df.id)).count()
print df.select(double(df.id), double(df.id + 1)).count()
print df.select(double(df.id), double(df.id + 1), double(df.id + 2)).count()
```
Here is the results:

N | Before | After  | speed up
---- |------------ | -------------|------
1 | 22 s | 7 s |  3.1X
2 | 38 s | 13 s | 2.9X
3 | 58 s | 16 s | 3.6X

This benchmark ran locally with 4 CPUs. For 3 UDFs, it launched 12 Python before before this patch, 4 process after this patch. After this patch, it will use less memory for multiple UDFs than before (less buffering).

Author: Davies Liu <davies@databricks.com>

Closes #12057 from davies/multi_udfs.
2016-03-31 16:40:20 -07:00
sethah b11887c086 [SPARK-14264][PYSPARK][ML] Add feature importance for GBTs in pyspark
## What changes were proposed in this pull request?

Feature importances are exposed in the python API for GBTs.

Other changes:
* Update the random forest feature importance documentation to not repeat decision tree docstring and instead place a reference to it.

## How was this patch tested?

Python doc tests were updated to validate GBT feature importance.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #12056 from sethah/Pyspark_GBT_feature_importance.
2016-03-31 13:00:10 -07:00
Herman van Hovell a9b93e0739 [SPARK-14211][SQL] Remove ANTLR3 based parser
### What changes were proposed in this pull request?

This PR removes the ANTLR3 based parser, and moves the new ANTLR4 based parser into the `org.apache.spark.sql.catalyst.parser package`.

### How was this patch tested?

Existing unit tests.

cc rxin andrewor14 yhuai

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #12071 from hvanhovell/SPARK-14211.
2016-03-31 09:25:09 -07:00
Yanbo Liang f301df37cb [SPARK-14152][ML][PYSPARK] MultilayerPerceptronClassifier supports save/load for Python API
## What changes were proposed in this pull request?
```MultilayerPerceptronClassifier``` supports save/load for Python API.

## How was this patch tested?
doctest.

cc mengxr jkbradley yinxusen

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11952 from yanboliang/spark-14152.
2016-03-30 15:47:01 -07:00
Davies Liu a7a93a116d [SPARK-14215] [SQL] [PYSPARK] Support chained Python UDFs
## What changes were proposed in this pull request?

This PR brings the support for chained Python UDFs, for example

```sql
select udf1(udf2(a))
select udf1(udf2(a) + 3)
select udf1(udf2(a) + udf3(b))
```

Also directly chained unary Python UDFs are put in single batch of Python UDFs, others may require multiple batches.

For example,
```python
>>> sqlContext.sql("select double(double(1))").explain()
== Physical Plan ==
WholeStageCodegen
:  +- Project [pythonUDF#10 AS double(double(1))#9]
:     +- INPUT
+- !BatchPythonEvaluation double(double(1)), [pythonUDF#10]
   +- Scan OneRowRelation[]
>>> sqlContext.sql("select double(double(1) + double(2))").explain()
== Physical Plan ==
WholeStageCodegen
:  +- Project [pythonUDF#19 AS double((double(1) + double(2)))#16]
:     +- INPUT
+- !BatchPythonEvaluation double((pythonUDF#17 + pythonUDF#18)), [pythonUDF#17,pythonUDF#18,pythonUDF#19]
   +- !BatchPythonEvaluation double(2), [pythonUDF#17,pythonUDF#18]
      +- !BatchPythonEvaluation double(1), [pythonUDF#17]
         +- Scan OneRowRelation[]
```

TODO: will support multiple unrelated Python UDFs in one batch (another PR).

## How was this patch tested?

Added new unit tests for chained UDFs.

Author: Davies Liu <davies@databricks.com>

Closes #12014 from davies/py_udfs.
2016-03-29 15:06:29 -07:00
wm624@hotmail.com 63b200e8d4 [SPARK-14071][PYSPARK][ML] Change MLWritable.write to be a property
Add property to MLWritable.write method, so we can use .write instead of .write()

Add a new test to ml/test.py to check whether the write is a property.
./python/run-tests --python-executables=python2.7 --modules=pyspark-ml

Will test against the following Python executables: ['python2.7']
Will test the following Python modules: ['pyspark-ml']
Finished test(python2.7): pyspark.ml.evaluation (11s)
Finished test(python2.7): pyspark.ml.clustering (16s)
Finished test(python2.7): pyspark.ml.classification (24s)
Finished test(python2.7): pyspark.ml.recommendation (24s)
Finished test(python2.7): pyspark.ml.feature (39s)
Finished test(python2.7): pyspark.ml.regression (26s)
Finished test(python2.7): pyspark.ml.tuning (15s)
Finished test(python2.7): pyspark.ml.tests (30s)
Tests passed in 55 seconds

Author: wm624@hotmail.com <wm624@hotmail.com>

Closes #11945 from wangmiao1981/fix_property.
2016-03-28 22:33:25 -07:00
zero323 39f743a623 [SPARK-14202] [PYTHON] Use generator expression instead of list comp in python_full_outer_jo…
## What changes were proposed in this pull request?

This PR replaces list comprehension in python_full_outer_join.dispatch with a generator expression.

## How was this patch tested?

PySpark-Core, PySpark-MLlib test suites against Python 2.7, 3.5.

Author: zero323 <matthew.szymkiewicz@gmail.com>

Closes #11998 from zero323/pyspark-join-generator-expr.
2016-03-28 14:51:36 -07:00
Herman van Hovell 600c0b69ca [SPARK-13713][SQL] Migrate parser from ANTLR3 to ANTLR4
### What changes were proposed in this pull request?
The current ANTLR3 parser is quite complex to maintain and suffers from code blow-ups. This PR introduces a new parser that is based on ANTLR4.

This parser is based on the [Presto's SQL parser](https://github.com/facebook/presto/blob/master/presto-parser/src/main/antlr4/com/facebook/presto/sql/parser/SqlBase.g4). The current implementation can parse and create Catalyst and SQL plans. Large parts of the HiveQl DDL and some of the DML functionality is currently missing, the plan is to add this in follow-up PRs.

This PR is a work in progress, and work needs to be done in the following area's:

- [x] Error handling should be improved.
- [x] Documentation should be improved.
- [x] Multi-Insert needs to be tested.
- [ ] Naming and package locations.

### How was this patch tested?

Catalyst and SQL unit tests.

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #11557 from hvanhovell/ngParser.
2016-03-28 12:31:12 -07:00
Shixiong Zhu d23ad7c1c9 [SPARK-13874][DOC] Remove docs of streaming-akka, streaming-zeromq, streaming-mqtt and streaming-twitter
## What changes were proposed in this pull request?

This PR removes all docs about the old streaming-akka, streaming-zeromq, streaming-mqtt and streaming-twitter projects since I have already copied them to https://github.com/spark-packages

Also remove mqtt_wordcount.py that I forgot to remove previously.

## How was this patch tested?

Jenkins PR Build.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11824 from zsxwing/remove-doc.
2016-03-26 01:47:27 -07:00
Shixiong Zhu 24587ce433 [SPARK-14073][STREAMING][TEST-MAVEN] Move flume back to Spark
## What changes were proposed in this pull request?

This PR moves flume back to Spark as per the discussion in the dev mail-list.

## How was this patch tested?

Existing Jenkins tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11895 from zsxwing/move-flume-back.
2016-03-25 17:37:16 -07:00
Wenchen Fan 43b15e01c4 [SPARK-14061][SQL] implement CreateMap
## What changes were proposed in this pull request?

As we have `CreateArray` and `CreateStruct`, we should also have `CreateMap`.  This PR adds the `CreateMap` expression, and the DataFrame API, and python API.

## How was this patch tested?

various new tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11879 from cloud-fan/create_map.
2016-03-25 09:50:06 -07:00
Andrew Or 20ddf5fddf [SPARK-14014][SQL] Integrate session catalog (attempt #2)
## What changes were proposed in this pull request?

This reopens #11836, which was merged but promptly reverted because it introduced flaky Hive tests.

## How was this patch tested?

See `CatalogTestCases`, `SessionCatalogSuite` and `HiveContextSuite`.

Author: Andrew Or <andrew@databricks.com>

Closes #11938 from andrewor14/session-catalog-again.
2016-03-24 22:59:35 -07:00
Reynold Xin 3619fec1ec [SPARK-14142][SQL] Replace internal use of unionAll with union
## What changes were proposed in this pull request?
unionAll has been deprecated in SPARK-14088.

## How was this patch tested?
Should be covered by all existing tests.

Author: Reynold Xin <rxin@databricks.com>

Closes #11946 from rxin/SPARK-14142.
2016-03-24 22:34:55 -07:00
GayathriMurali 0874ff3aad [SPARK-13949][ML][PYTHON] PySpark ml DecisionTreeClassifier, Regressor support export/import
## What changes were proposed in this pull request?

Added MLReadable and MLWritable to Decision Tree Classifier and Regressor. Added doctests.

## How was this patch tested?

Python Unit tests. Tests added to check persistence in DecisionTreeClassifier and DecisionTreeRegressor.

Author: GayathriMurali <gayathri.m.softie@gmail.com>

Closes #11892 from GayathriMurali/SPARK-13949.
2016-03-24 19:20:49 -07:00
sethah 585097716c [SPARK-14107][PYSPARK][ML] Add seed as named argument to GBTs in pyspark
## What changes were proposed in this pull request?

GBTs in pyspark previously had seed parameters, but they could not be passed as keyword arguments through the class constructor. This patch adds seed as a keyword argument and also sets default value.

## How was this patch tested?

Doc tests were updated to pass a random seed through the GBTClassifier and GBTRegressor constructors.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #11944 from sethah/SPARK-14107.
2016-03-24 19:14:24 -07:00
Andrew Or c44d140cae Revert "[SPARK-14014][SQL] Replace existing catalog with SessionCatalog"
This reverts commit 5dfc01976b.
2016-03-23 22:21:15 -07:00
Joseph K. Bradley cf823bead1 [SPARK-12183][ML][MLLIB] Remove mllib tree implementation, and wrap spark.ml one
Primary change:
* Removed spark.mllib.tree.DecisionTree implementation of tree and forest learning.
* spark.mllib now calls the spark.ml implementation.
* Moved unit tests (of tree learning internals) from spark.mllib to spark.ml as needed.

ml.tree.DecisionTreeModel
* Added toOld and made ```private[spark]```, implemented for Classifier and Regressor in subclasses.  These methods now use OldInformationGainStats.invalidInformationGainStats for LeafNodes in order to mimic the spark.mllib implementation.

ml.tree.Node
* Added ```private[tree] def deepCopy```, used by unit tests

Copied developer comments from spark.mllib implementation to spark.ml one.

Moving unit tests
* Tree learning internals were tested by spark.mllib.tree.DecisionTreeSuite, or spark.mllib.tree.RandomForestSuite.
* Those tests were all moved to spark.ml.tree.impl.RandomForestSuite.  The order in the file + the test names are the same, so you should be able to compare them by opening them in 2 windows side-by-side.
* I made minimal changes to each test to allow it to run.  Each test makes the same checks as before, except for a few removed assertions which were checking irrelevant values.
* No new unit tests were added.
* mllib.tree.DecisionTreeSuite: I removed some checks of splits and bins which were not relevant to the unit tests they were in.  Those same split calculations were already being tested in other unit tests, for each dataset type.

**Changes of behavior** (to be noted in SPARK-13448 once this PR is merged)
* spark.ml.tree.impl.RandomForest: Rather than throwing an error when maxMemoryInMB is set to too small a value (to split any node), we now allow 1 node to be split, even if its memory requirements exceed maxMemoryInMB.  This involved removing the maxMemoryPerNode check in RandomForest.run, as well as modifying selectNodesToSplit().  Once this PR is merged, I will note the change of behavior on SPARK-13448.
* spark.mllib.tree.DecisionTree: When a tree only has one node (root = leaf node), the "stats" field will now be empty, rather than being set to InformationGainStats.invalidInformationGainStats.  This does not remove information from the tree, and it will save a bit of storage.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #11855 from jkbradley/remove-mllib-tree-impl.
2016-03-23 21:16:00 -07:00
Andrew Or 5dfc01976b [SPARK-14014][SQL] Replace existing catalog with SessionCatalog
## What changes were proposed in this pull request?

`SessionCatalog`, introduced in #11750, is a catalog that keeps track of temporary functions and tables, and delegates metastore operations to `ExternalCatalog`. This functionality overlaps a lot with the existing `analysis.Catalog`.

As of this commit, `SessionCatalog` and `ExternalCatalog` will no longer be dead code. There are still things that need to be done after this patch, namely:
- SPARK-14013: Properly implement temporary functions in `SessionCatalog`
- SPARK-13879: Decide which DDL/DML commands to support natively in Spark
- SPARK-?????: Implement the ones we do want to support through `SessionCatalog`.
- SPARK-?????: Merge SQL/HiveContext

## How was this patch tested?

This is largely a refactoring task so there are no new tests introduced. The particularly relevant tests are `SessionCatalogSuite` and `ExternalCatalogSuite`.

Author: Andrew Or <andrew@databricks.com>
Author: Yin Huai <yhuai@databricks.com>

Closes #11836 from andrewor14/use-session-catalog.
2016-03-23 13:34:22 -07:00
sethah 30bdb5cbd9 [SPARK-13068][PYSPARK][ML] Type conversion for Pyspark params
## What changes were proposed in this pull request?

This patch adds type conversion functionality for parameters in Pyspark. A `typeConverter` field is added to the constructor of `Param` class. This argument is a function which converts values passed to this param to the appropriate type if possible. This is beneficial so that the params can fail at set time if they are given inappropriate values, but even more so because coherent error messages are now provided when Py4J cannot cast the python type to the appropriate Java type.

This patch also adds a `TypeConverters` class with factory methods for common type conversions. Most of the changes involve adding these factory type converters to existing params. The previous solution to this issue, `expectedType`, is deprecated and can be removed in 2.1.0 as discussed on the Jira.

## How was this patch tested?

Unit tests were added in python/pyspark/ml/tests.py to test parameter type conversion. These tests check that values that should be convertible are converted correctly, and that the appropriate errors are thrown when invalid values are provided.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #11663 from sethah/SPARK-13068-tc.
2016-03-23 11:20:44 -07:00
Reynold Xin 926a93e54b [SPARK-14088][SQL] Some Dataset API touch-up
## What changes were proposed in this pull request?
1. Deprecated unionAll. It is pretty confusing to have both "union" and "unionAll" when the two do the same thing in Spark but are different in SQL.
2. Rename reduce in KeyValueGroupedDataset to reduceGroups so it is more consistent with rest of the functions in KeyValueGroupedDataset. Also makes it more obvious what "reduce" and "reduceGroups" mean. Previously it was confusing because it could be reducing a Dataset, or just reducing groups.
3. Added a "name" function, which is more natural to name columns than "as" for non-SQL users.
4. Remove "subtract" function since it is just an alias for "except".

## How was this patch tested?
All changes should be covered by existing tests. Also added couple test cases to cover "name".

Author: Reynold Xin <rxin@databricks.com>

Closes #11908 from rxin/SPARK-14088.
2016-03-22 23:43:09 -07:00
Joseph K. Bradley 7e3423b9c0 [SPARK-13951][ML][PYTHON] Nested Pipeline persistence
Adds support for saving and loading nested ML Pipelines from Python.  Pipeline and PipelineModel do not extend JavaWrapper, but they are able to utilize the JavaMLWriter, JavaMLReader implementations.

Also:
* Separates out interfaces from Java wrapper implementations for MLWritable, MLReadable, MLWriter, MLReader.
* Moves methods _stages_java2py, _stages_py2java into Pipeline, PipelineModel as _transfer_stage_from_java, _transfer_stage_to_java

Added new unit test for nested Pipelines.  Abstracted validity check into a helper method for the 2 unit tests.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #11866 from jkbradley/nested-pipeline-io.
Closes #11835
2016-03-22 12:11:37 -07:00
hyukjinkwon 4e09a0d5ea [SPARK-13953][SQL] Specifying the field name for corrupted record via option at JSON datasource
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-13953

Currently, JSON data source creates a new field in `PERMISSIVE` mode for storing malformed string.
This field can be renamed via `spark.sql.columnNameOfCorruptRecord` option but it is a global configuration.

This PR make that option can be applied per read and can be specified via `option()`. This will overwrites `spark.sql.columnNameOfCorruptRecord` if it is set.

## How was this patch tested?

Unit tests were used and `./dev/run_tests` for coding style tests.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #11881 from HyukjinKwon/SPARK-13953.
2016-03-22 20:30:48 +08:00
zero323 8193a266b5 [SPARK-14058][PYTHON] Incorrect docstring in Window.order
## What changes were proposed in this pull request?

Replaces current docstring ("Creates a :class:`WindowSpec` with the partitioning defined.") with "Creates a :class:`WindowSpec` with the ordering defined."

## How was this patch tested?

PySpark unit tests (no regression introduced). No changes to the code.

Author: zero323 <matthew.szymkiewicz@gmail.com>

Closes #11877 from zero323/order-by-description.
2016-03-21 23:52:33 -07:00
hyukjinkwon e474088144 [SPARK-13764][SQL] Parse modes in JSON data source
## What changes were proposed in this pull request?

Currently, there is no way to control the behaviour when fails to parse corrupt records in JSON data source .

This PR adds the support for parse modes just like CSV data source. There are three modes below:

- `PERMISSIVE` :  When it fails to parse, this sets `null` to to field. This is a default mode when it has been this mode.
- `DROPMALFORMED`: When it fails to parse, this drops the whole record.
- `FAILFAST`: When it fails to parse, it just throws an exception.

This PR also make JSON data source share the `ParseModes` in CSV data source.

## How was this patch tested?

Unit tests were used and `./dev/run_tests` for code style tests.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #11756 from HyukjinKwon/SPARK-13764.
2016-03-21 15:42:35 +08:00
Xusen Yin 454a00df2a [SPARK-13993][PYSPARK] Add pyspark Rformula/RforumlaModel save/load
## What changes were proposed in this pull request?

https://issues.apache.org/jira/browse/SPARK-13993

## How was this patch tested?

doctest

Author: Xusen Yin <yinxusen@gmail.com>

Closes #11807 from yinxusen/SPARK-13993.
2016-03-20 15:34:34 -07:00
Bryan Cutler 828213d4ca [SPARK-13937][PYSPARK][ML] Change JavaWrapper _java_obj from static to member variable
## What changes were proposed in this pull request?
In PySpark wrapper.py JavaWrapper change _java_obj from an unused static variable to a member variable that is consistent with usage in derived classes.

## How was this patch tested?
Ran python tests for ML and MLlib.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #11767 from BryanCutler/JavaWrapper-static-_java_obj-SPARK-13937.
2016-03-17 10:16:51 -07:00
GayathriMurali 27e1f38851 [SPARK-13034] PySpark ml.classification support export/import
## What changes were proposed in this pull request?

Add export/import for all estimators and transformers(which have Scala implementation) under pyspark/ml/classification.py.

## How was this patch tested?

./python/run-tests
./dev/lint-python
Unit tests added to check persistence in Logistic Regression

Author: GayathriMurali <gayathri.m.softie@gmail.com>

Closes #11707 from GayathriMurali/SPARK-13034.
2016-03-16 14:21:42 -07:00
Xusen Yin ae6c677c8a [SPARK-13038][PYSPARK] Add load/save to pipeline
## What changes were proposed in this pull request?

JIRA issue: https://issues.apache.org/jira/browse/SPARK-13038

1. Add load/save to PySpark Pipeline and PipelineModel

2. Add `_transfer_stage_to_java()` and `_transfer_stage_from_java()` for `JavaWrapper`.

## How was this patch tested?

Test with doctest.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #11683 from yinxusen/SPARK-13038-only.
2016-03-16 13:49:40 -07:00
Reynold Xin 8e0b030606 [SPARK-10380][SQL] Fix confusing documentation examples for astype/drop_duplicates.
## What changes were proposed in this pull request?
We have seen users getting confused by the documentation for astype and drop_duplicates, because the examples in them do not use these functions (but do uses their aliases). This patch simply removes all examples for these functions, and say that they are aliases.

## How was this patch tested?
Existing PySpark unit tests.

Closes #11543.

Author: Reynold Xin <rxin@databricks.com>

Closes #11698 from rxin/SPARK-10380.
2016-03-14 19:25:49 -07:00
Shixiong Zhu 06dec37455 [SPARK-13843][STREAMING] Remove streaming-flume, streaming-mqtt, streaming-zeromq, streaming-akka, streaming-twitter to Spark packages
## What changes were proposed in this pull request?

Currently there are a few sub-projects, each for integrating with different external sources for Streaming.  Now that we have better ability to include external libraries (spark packages) and with Spark 2.0 coming up, we can move the following projects out of Spark to https://github.com/spark-packages

- streaming-flume
- streaming-akka
- streaming-mqtt
- streaming-zeromq
- streaming-twitter

They are just some ancillary packages and considering the overhead of maintenance, running tests and PR failures, it's better to maintain them out of Spark. In addition, these projects can have their different release cycles and we can release them faster.

I have already copied these projects to https://github.com/spark-packages

## How was this patch tested?

Jenkins tests

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11672 from zsxwing/remove-external-pkg.
2016-03-14 16:56:04 -07:00
Josh Rosen 07cb323e7a [SPARK-13848][SPARK-5185] Update to Py4J 0.9.2 in order to fix classloading issue
This patch upgrades Py4J from 0.9.1 to 0.9.2 in order to include a patch which modifies Py4J to use the current thread's ContextClassLoader when performing reflection / class loading. This is necessary in order to fix [SPARK-5185](https://issues.apache.org/jira/browse/SPARK-5185), a longstanding issue affecting the use of `--jars` and `--packages` in PySpark.

In order to demonstrate that the fix works, I removed the workarounds which were added as part of [SPARK-6027](https://issues.apache.org/jira/browse/SPARK-6027) / #4779 and other patches.

Py4J diff: https://github.com/bartdag/py4j/compare/0.9.1...0.9.2

/cc zsxwing tdas davies brkyvz

Author: Josh Rosen <joshrosen@databricks.com>

Closes #11687 from JoshRosen/py4j-0.9.2.
2016-03-14 12:22:02 -07:00
Davies Liu ba8c86d06f [SPARK-13671] [SPARK-13311] [SQL] Use different physical plans for RDD and data sources
## What changes were proposed in this pull request?

This PR split the PhysicalRDD into two classes, PhysicalRDD and PhysicalScan. PhysicalRDD is used for DataFrames that is created from existing RDD. PhysicalScan is used for DataFrame that is created from data sources. This enable use to apply different optimization on both of them.

Also fix the problem for sameResult() on two DataSourceScan.

Also fix the equality check to toString for `In`. It's better to use Seq there, but we can't break this public API (sad).

## How was this patch tested?

Existing tests. Manually tested with TPCDS query Q59 and Q64, all those duplicated exchanges can be re-used now, also saw there are 40+% performance improvement (saving half of the scan).

Author: Davies Liu <davies@databricks.com>

Closes #11514 from davies/existing_rdd.
2016-03-12 00:48:36 -08:00
Josh Rosen 073bf9d4d9 [SPARK-13807] De-duplicate Python*Helper instantiation code in PySpark streaming
This patch de-duplicates code in PySpark streaming which loads the `Python*Helper` classes. I also changed a few `raise e` statements to simply `raise` in order to preserve the full exception stacktrace when re-throwing.

Here's a link to the whitespace-change-free diff: https://github.com/apache/spark/compare/master...JoshRosen:pyspark-reflection-deduplication?w=0

Author: Josh Rosen <joshrosen@databricks.com>

Closes #11641 from JoshRosen/pyspark-reflection-deduplication.
2016-03-11 11:18:51 -08:00
sethah 234f781ae1 [SPARK-13787][ML][PYSPARK] Pyspark feature importances for decision tree and random forest
## What changes were proposed in this pull request?

This patch adds a `featureImportance` property to the Pyspark API for `DecisionTreeRegressionModel`, `DecisionTreeClassificationModel`, `RandomForestRegressionModel` and `RandomForestClassificationModel`.

## How was this patch tested?

Python doc tests for the affected classes were updated to check feature importances.

Author: sethah <seth.hendrickson16@gmail.com>

Closes #11622 from sethah/SPARK-13787.
2016-03-11 09:54:23 +02:00
Zheng RuiFeng d18276cb1d [SPARK-13672][ML] Add python examples of BisectingKMeans in ML and MLLIB
JIRA: https://issues.apache.org/jira/browse/SPARK-13672

## What changes were proposed in this pull request?

add two python examples of BisectingKMeans for ml and mllib

## How was this patch tested?

manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #11515 from zhengruifeng/mllib_bkm_pe.
2016-03-11 09:21:12 +02:00
Cheng Lian 1d542785b9 [SPARK-13244][SQL] Migrates DataFrame to Dataset
## What changes were proposed in this pull request?

This PR unifies DataFrame and Dataset by migrating existing DataFrame operations to Dataset and make `DataFrame` a type alias of `Dataset[Row]`.

Most Scala code changes are source compatible, but Java API is broken as Java knows nothing about Scala type alias (mostly replacing `DataFrame` with `Dataset<Row>`).

There are several noticeable API changes related to those returning arrays:

1.  `collect`/`take`

    -   Old APIs in class `DataFrame`:

        ```scala
        def collect(): Array[Row]
        def take(n: Int): Array[Row]
        ```

    -   New APIs in class `Dataset[T]`:

        ```scala
        def collect(): Array[T]
        def take(n: Int): Array[T]

        def collectRows(): Array[Row]
        def takeRows(n: Int): Array[Row]
        ```

    Two specialized methods `collectRows` and `takeRows` are added because Java doesn't support returning generic arrays. Thus, for example, `DataFrame.collect(): Array[T]` actually returns `Object` instead of `Array<T>` from Java side.

    Normally, Java users may fall back to `collectAsList` and `takeAsList`.  The two new specialized versions are added to avoid performance regression in ML related code (but maybe I'm wrong and they are not necessary here).

1.  `randomSplit`

    -   Old APIs in class `DataFrame`:

        ```scala
        def randomSplit(weights: Array[Double], seed: Long): Array[DataFrame]
        def randomSplit(weights: Array[Double]): Array[DataFrame]
        ```

    -   New APIs in class `Dataset[T]`:

        ```scala
        def randomSplit(weights: Array[Double], seed: Long): Array[Dataset[T]]
        def randomSplit(weights: Array[Double]): Array[Dataset[T]]
        ```

    Similar problem as above, but hasn't been addressed for Java API yet.  We can probably add `randomSplitAsList` to fix this one.

1.  `groupBy`

    Some original `DataFrame.groupBy` methods have conflicting signature with original `Dataset.groupBy` methods.  To distinguish these two, typed `Dataset.groupBy` methods are renamed to `groupByKey`.

Other noticeable changes:

1.  Dataset always do eager analysis now

    We used to support disabling DataFrame eager analysis to help reporting partially analyzed malformed logical plan on analysis failure.  However, Dataset encoders requires eager analysi during Dataset construction.  To preserve the error reporting feature, `AnalysisException` now takes an extra `Option[LogicalPlan]` argument to hold the partially analyzed plan, so that we can check the plan tree when reporting test failures.  This plan is passed by `QueryExecution.assertAnalyzed`.

## How was this patch tested?

Existing tests do the work.

## TODO

- [ ] Fix all tests
- [ ] Re-enable MiMA check
- [ ] Update ScalaDoc (`since`, `group`, and example code)

Author: Cheng Lian <lian@databricks.com>
Author: Yin Huai <yhuai@databricks.com>
Author: Wenchen Fan <wenchen@databricks.com>
Author: Cheng Lian <liancheng@users.noreply.github.com>

Closes #11443 from liancheng/ds-to-df.
2016-03-10 17:00:17 -08:00
Tristan Reid 5f7dbdba6f [MINOR] Fix typo in 'hypot' docstring
Minor typo:  docstring for pyspark.sql.functions: hypot has extra characters

N/A

Author: Tristan Reid <treid@netflix.com>

Closes #11616 from tristanreid/master.
2016-03-09 18:05:03 -08:00
Sean Owen 256704c771 [SPARK-13595][BUILD] Move docker, extras modules into external
## What changes were proposed in this pull request?

Move `docker` dirs out of top level into `external/`; move `extras/*` into `external/`

## How was this patch tested?

This is tested with Jenkins tests.

Author: Sean Owen <sowen@cloudera.com>

Closes #11523 from srowen/SPARK-13595.
2016-03-09 18:27:44 +00:00
Bryan Cutler d8813fa043 [SPARK-13625][PYSPARK][ML] Added a check to see if an attribute is a property when getting param list
## What changes were proposed in this pull request?

Added a check in pyspark.ml.param.Param.params() to see if an attribute is a property (decorated with `property`) before checking if it is a `Param` instance.  This prevents the property from being invoked to 'get' this attribute, which could possibly cause an error.

## How was this patch tested?

Added a test case with a class has a property that will raise an error when invoked and then call`Param.params` to verify that the property is not invoked, but still able to find another property in the class.  Also ran pyspark-ml test before fix that will trigger an error, and again after the fix to verify that the error was resolved and the method was working properly.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #11476 from BryanCutler/pyspark-ml-property-attr-SPARK-13625.
2016-03-08 17:34:25 -08:00
Wenchen Fan d57daf1f77 [SPARK-13593] [SQL] improve the createDataFrame to accept data type string and verify the data
## What changes were proposed in this pull request?

This PR improves the `createDataFrame` method to make it also accept datatype string, then users can convert python RDD to DataFrame easily, for example, `df = rdd.toDF("a: int, b: string")`.
It also supports flat schema so users can convert an RDD of int to DataFrame directly, we will automatically wrap int to row for users.
If schema is given, now we checks if the real data matches the given schema, and throw error if it doesn't.

## How was this patch tested?

new tests in `test.py` and doc test in `types.py`

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11444 from cloud-fan/pyrdd.
2016-03-08 14:00:03 -08:00
Wenchen Fan d5ce61722f [SPARK-13740][SQL] add null check for _verify_type in types.py
## What changes were proposed in this pull request?

This PR adds null check in `_verify_type` according to the nullability information.

## How was this patch tested?

new doc tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11574 from cloud-fan/py-null-check.
2016-03-08 13:46:17 -08:00
Sean Owen 0eea12a3d9 [SPARK-13596][BUILD] Move misc top-level build files into appropriate subdirs
## What changes were proposed in this pull request?

Move many top-level files in dev/ or other appropriate directory. In particular, put `make-distribution.sh` in `dev` and update docs accordingly. Remove deprecated `sbt/sbt`.

I was (so far) unable to figure out how to move `tox.ini`. `scalastyle-config.xml` should be movable but edits to the project `.sbt` files didn't work; config file location is updatable for compile but not test scope.

## How was this patch tested?

`./dev/run-tests` to verify RAT and checkstyle work. Jenkins tests for the rest.

Author: Sean Owen <sowen@cloudera.com>

Closes #11522 from srowen/SPARK-13596.
2016-03-07 14:48:02 -08:00
Dongjoon Hyun e72914f37d [SPARK-12243][BUILD][PYTHON] PySpark tests are slow in Jenkins.
## What changes were proposed in this pull request?

In the Jenkins pull request builder, PySpark tests take around [962 seconds ](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/52530/console) of end-to-end time to run, despite the fact that we run four Python test suites in parallel. According to the log, the basic reason is that the long running test starts at the end due to FIFO queue. We first try to reduce the test time by just starting some long running tests first with simple priority queue.

```
========================================================================
Running PySpark tests
========================================================================
...
Finished test(python3.4): pyspark.streaming.tests (213s)
Finished test(pypy): pyspark.sql.tests (92s)
Finished test(pypy): pyspark.streaming.tests (280s)
Tests passed in 962 seconds
```

## How was this patch tested?

Manual check.
Check 'Running PySpark tests' part of the Jenkins log.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11551 from dongjoon-hyun/SPARK-12243.
2016-03-07 12:06:46 -08:00
Shixiong Zhu ee913e6e2d [SPARK-13697] [PYSPARK] Fix the missing module name of TransformFunctionSerializer.loads
## What changes were proposed in this pull request?

Set the function's module name to `__main__` if it's missing in `TransformFunctionSerializer.loads`.

## How was this patch tested?

Manually test in the shell.

Before this patch:
```
>>> from pyspark.streaming import StreamingContext
>>> from pyspark.streaming.util import TransformFunction
>>> ssc = StreamingContext(sc, 1)
>>> func = TransformFunction(sc, lambda x: x, sc.serializer)
>>> func.rdd_wrapper(lambda x: x)
TransformFunction(<function <lambda> at 0x106ac8b18>)
>>> bytes = bytearray(ssc._transformerSerializer.serializer.dumps((func.func, func.rdd_wrap_func, func.deserializers)))
>>> func2 = ssc._transformerSerializer.loads(bytes)
>>> print(func2.func.__module__)
None
>>> print(func2.rdd_wrap_func.__module__)
None
>>>
```
After this patch:
```
>>> from pyspark.streaming import StreamingContext
>>> from pyspark.streaming.util import TransformFunction
>>> ssc = StreamingContext(sc, 1)
>>> func = TransformFunction(sc, lambda x: x, sc.serializer)
>>> func.rdd_wrapper(lambda x: x)
TransformFunction(<function <lambda> at 0x108bf1b90>)
>>> bytes = bytearray(ssc._transformerSerializer.serializer.dumps((func.func, func.rdd_wrap_func, func.deserializers)))
>>> func2 = ssc._transformerSerializer.loads(bytes)
>>> print(func2.func.__module__)
__main__
>>> print(func2.rdd_wrap_func.__module__)
__main__
>>>
```

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11535 from zsxwing/loads-module.
2016-03-06 08:57:01 -08:00
gatorsmile adce5ee721 [SPARK-12720][SQL] SQL Generation Support for Cube, Rollup, and Grouping Sets
#### What changes were proposed in this pull request?

This PR is for supporting SQL generation for cube, rollup and grouping sets.

For example, a query using rollup:
```SQL
SELECT count(*) as cnt, key % 5, grouping_id() FROM t1 GROUP BY key % 5 WITH ROLLUP
```
Original logical plan:
```
  Aggregate [(key#17L % cast(5 as bigint))#47L,grouping__id#46],
            [(count(1),mode=Complete,isDistinct=false) AS cnt#43L,
             (key#17L % cast(5 as bigint))#47L AS _c1#45L,
             grouping__id#46 AS _c2#44]
  +- Expand [List(key#17L, value#18, (key#17L % cast(5 as bigint))#47L, 0),
             List(key#17L, value#18, null, 1)],
            [key#17L,value#18,(key#17L % cast(5 as bigint))#47L,grouping__id#46]
     +- Project [key#17L,
                 value#18,
                 (key#17L % cast(5 as bigint)) AS (key#17L % cast(5 as bigint))#47L]
        +- Subquery t1
           +- Relation[key#17L,value#18] ParquetRelation
```
Converted SQL:
```SQL
  SELECT count( 1) AS `cnt`,
         (`t1`.`key` % CAST(5 AS BIGINT)),
         grouping_id() AS `_c2`
  FROM `default`.`t1`
  GROUP BY (`t1`.`key` % CAST(5 AS BIGINT))
  GROUPING SETS (((`t1`.`key` % CAST(5 AS BIGINT))), ())
```

#### How was the this patch tested?

Added eight test cases in `LogicalPlanToSQLSuite`.

Author: gatorsmile <gatorsmile@gmail.com>
Author: xiaoli <lixiao1983@gmail.com>
Author: Xiao Li <xiaoli@Xiaos-MacBook-Pro.local>

Closes #11283 from gatorsmile/groupingSetsToSQL.
2016-03-05 19:25:03 +08:00
Xusen Yin 83302c3bff [SPARK-13036][SPARK-13318][SPARK-13319] Add save/load for feature.py
Add save/load for feature.py. Meanwhile, add save/load for `ElementwiseProduct` in Scala side and fix a bug of missing `setDefault` in `VectorSlicer` and `StopWordsRemover`.

In this PR I ignore the `RFormula` and `RFormulaModel` because its Scala implementation is pending in https://github.com/apache/spark/pull/9884. I'll add them in this PR if https://github.com/apache/spark/pull/9884 gets merged first. Or add a follow-up JIRA for `RFormula`.

Author: Xusen Yin <yinxusen@gmail.com>

Closes #11203 from yinxusen/SPARK-13036.
2016-03-04 08:32:24 -08:00
Dongjoon Hyun c8f25459ed [SPARK-13676] Fix mismatched default values for regParam in LogisticRegression
## What changes were proposed in this pull request?

The default value of regularization parameter for `LogisticRegression` algorithm is different in Scala and Python. We should provide the same value.

**Scala**
```
scala> new org.apache.spark.ml.classification.LogisticRegression().getRegParam
res0: Double = 0.0
```

**Python**
```
>>> from pyspark.ml.classification import LogisticRegression
>>> LogisticRegression().getRegParam()
0.1
```

## How was this patch tested?
manual. Check the following in `pyspark`.
```
>>> from pyspark.ml.classification import LogisticRegression
>>> LogisticRegression().getRegParam()
0.0
```

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11519 from dongjoon-hyun/SPARK-13676.
2016-03-04 08:25:41 -08:00
Wenchen Fan 15d57f9c23 [SPARK-13647] [SQL] also check if numeric value is within allowed range in _verify_type
## What changes were proposed in this pull request?

This PR makes the `_verify_type` in `types.py` more strict, also check if numeric value is within allowed range.

## How was this patch tested?

newly added doc test.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11492 from cloud-fan/py-verify.
2016-03-03 20:16:37 -08:00
Dongjoon Hyun 941b270b70 [MINOR] Fix typos in comments and testcase name of code
## What changes were proposed in this pull request?

This PR fixes typos in comments and testcase name of code.

## How was this patch tested?

manual.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11481 from dongjoon-hyun/minor_fix_typos_in_code.
2016-03-03 22:42:12 +00:00
hyukjinkwon cf95d728c6 [SPARK-13543][SQL] Support for specifying compression codec for Parquet/ORC via option()
## What changes were proposed in this pull request?

This PR adds the support to specify compression codecs for both ORC and Parquet.

## How was this patch tested?

unittests within IDE and code style tests with `dev/run_tests`.

Author: hyukjinkwon <gurwls223@gmail.com>

Closes #11464 from HyukjinKwon/SPARK-13543.
2016-03-03 10:30:55 -08:00
JeremyNixon 511d4929c8 [SPARK-12877][ML] Add train-validation-split to pyspark
## What changes were proposed in this pull request?
The changes proposed were to add train-validation-split to pyspark.ml.tuning.

## How was the this patch tested?
This patch was tested through unit tests located in pyspark/ml/test.py.

This is my original work and I license it to Spark.

Author: JeremyNixon <jnixon2@gmail.com>

Closes #11335 from JeremyNixon/tvs_pyspark.
2016-03-03 09:50:05 -08:00
Wenchen Fan 4dd24811d9 [SPARK-13594][SQL] remove typed operations(e.g. map, flatMap) from python DataFrame
## What changes were proposed in this pull request?

Remove `map`, `flatMap`, `mapPartitions` from python DataFrame, to prepare for Dataset API in the future.

## How was this patch tested?

existing tests

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11445 from cloud-fan/python-clean.
2016-03-02 15:26:34 -08:00
Joseph K. Bradley 9495c40f22 [SPARK-13008][ML][PYTHON] Put one alg per line in pyspark.ml all lists
This is to fix a long-time annoyance: Whenever we add a new algorithm to pyspark.ml, we have to add it to the ```__all__``` list at the top.  Since we keep it alphabetized, it often creates a lot more changes than needed.  It is also easy to add the Estimator and forget the Model.  I'm going to switch it to have one algorithm per line.

This also alphabetizes a few out-of-place classes in pyspark.ml.feature.  No changes have been made to the moved classes.

CC: thunterdb

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #10927 from jkbradley/ml-python-all-list.
2016-03-01 21:26:47 -08:00
hyukjinkwon 02aa499dfb [SPARK-13509][SPARK-13507][SQL] Support for writing CSV with a single function call
https://issues.apache.org/jira/browse/SPARK-13507
https://issues.apache.org/jira/browse/SPARK-13509

## What changes were proposed in this pull request?
This PR adds the support to write CSV data directly by a single call to the given path.

Several unitests were added for each functionality.
## How was this patch tested?

This was tested with unittests and with `dev/run_tests` for coding style

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>

Closes #11389 from HyukjinKwon/SPARK-13507-13509.
2016-02-29 09:44:29 -08:00
vijaykiran 236e3c8fbc [SPARK-12633][PYSPARK] [DOC] PySpark regression parameter desc to consistent format
Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the regression module.  Also, updated 2 params in classification to read as `Supported values:` to be consistent.

closes #10600

Author: vijaykiran <mail@vijaykiran.com>
Author: Bryan Cutler <cutlerb@gmail.com>

Closes #11404 from BryanCutler/param-desc-consistent-regression-SPARK-12633.
2016-02-29 15:52:41 +02:00
Yanbo Liang d81a71357e [SPARK-13545][MLLIB][PYSPARK] Make MLlib LogisticRegressionWithLBFGS's default parameters consistent in Scala and Python
## What changes were proposed in this pull request?
* The default value of ```regParam``` of PySpark MLlib ```LogisticRegressionWithLBFGS``` should be consistent with Scala which is ```0.0```. (This is also consistent with ML ```LogisticRegression```.)
* BTW, if we use a known updater(L1 or L2) for binary classification, ```LogisticRegressionWithLBFGS``` will call the ML implementation. We should update the API doc to clarifying ```numCorrections``` will have no effect if we fall into that route.
* Make a pass for all parameters of ```LogisticRegressionWithLBFGS```, others are set properly.

cc mengxr dbtsai
## How was this patch tested?
No new tests, it should pass all current tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11424 from yanboliang/spark-13545.
2016-02-29 00:55:51 -08:00
zlpmichelle 1e5fcdf96c [SPARK-13505][ML] add python api for MaxAbsScaler
## What changes were proposed in this pull request?
After SPARK-13028, we should add Python API for MaxAbsScaler.

## How was this patch tested?
unit test

Author: zlpmichelle <zlpmichelle@gmail.com>

Closes #11393 from zlpmichelle/master.
2016-02-26 14:37:44 -08:00
Bryan Cutler b33261f913 [SPARK-12634][PYSPARK][DOC] PySpark tree parameter desc to consistent format
Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent.  This is for the tree module.

closes #10601

Author: Bryan Cutler <cutlerb@gmail.com>
Author: vijaykiran <mail@vijaykiran.com>

Closes #11353 from BryanCutler/param-desc-consistent-tree-SPARK-12634.
2016-02-26 08:30:32 -08:00
Tommy YU f3be369ef7 [SPARK-13033] [ML] [PYSPARK] Add import/export for ml.regression
Add export/import for all estimators and transformers(which have Scala implementation) under pyspark/ml/regression.py.

yanboliang Please help to review.
For doctest, I though it's enough to add one since it's common usage. But I can add to all if we want it.

Author: Tommy YU <tummyyu@163.com>

Closes #11000 from Wenpei/spark-13033-ml.regression-exprot-import and squashes the following commits:

3646b36 [Tommy YU] address review comments
9cddc98 [Tommy YU] change base on review and pr 11197
cc61d9d [Tommy YU] remove default parameter set
19535d4 [Tommy YU] add export/import to regression
44a9dc2 [Tommy YU] add import/export for ml.regression
2016-02-25 21:09:02 -08:00
Yu ISHIKAWA 35316cb0b7 [SPARK-13292] [ML] [PYTHON] QuantileDiscretizer should take random seed in PySpark
## What changes were proposed in this pull request?
QuantileDiscretizer in Python should also specify a random seed.

## How was this patch tested?
unit tests

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #11362 from yu-iskw/SPARK-13292 and squashes the following commits:

02ffa76 [Yu ISHIKAWA] [SPARK-13292][ML][PYTHON] QuantileDiscretizer should take random seed in PySpark
2016-02-25 13:29:10 -08:00
Kai Jiang 4d2864b2d7 [SPARK-7106][MLLIB][PYSPARK] Support model save/load in Python's FPGrowth
## What changes were proposed in this pull request?

Python API supports mode save/load in FPGrowth
JIRA: [https://issues.apache.org/jira/browse/SPARK-7106](https://issues.apache.org/jira/browse/SPARK-7106)
## How was the this patch tested?

The patch is tested with Python doctest.

Author: Kai Jiang <jiangkai@gmail.com>

Closes #11321 from vectorijk/spark-7106.
2016-02-24 23:22:14 -08:00
Joseph K. Bradley 13ce10e954 [SPARK-13479][SQL][PYTHON] Added Python API for approxQuantile
## What changes were proposed in this pull request?

* Scala DataFrameStatFunctions: Added version of approxQuantile taking a List instead of an Array, for Python compatbility
* Python DataFrame and DataFrameStatFunctions: Added approxQuantile

## How was this patch tested?

* unit test in sql/tests.py

Documentation was copied from the existing approxQuantile exactly.

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #11356 from jkbradley/approx-quantile-python.
2016-02-24 23:15:36 -08:00
Nong Li 5a7af9e7ac [SPARK-13250] [SQL] Update PhysicallRDD to convert to UnsafeRow if using the vectorized scanner.
Some parts of the engine rely on UnsafeRow which the vectorized parquet scanner does not want
to produce. This add a conversion in Physical RDD. In the case where codegen is used (and the
scan is the start of the pipeline), there is no requirement to use UnsafeRow. This patch adds
update PhysicallRDD to support codegen, which eliminates the need for the UnsafeRow conversion
in all cases.

The result of these changes for TPCDS-Q19 at the 10gb sf reduces the query time from 9.5 seconds
to 6.5 seconds.

Author: Nong Li <nong@databricks.com>

Closes #11141 from nongli/spark-13250.
2016-02-24 17:16:45 -08:00
Wenchen Fan a60f91284c [SPARK-13467] [PYSPARK] abstract python function to simplify pyspark code
## What changes were proposed in this pull request?

When we pass a Python function to JVM side, we also need to send its context, e.g. `envVars`, `pythonIncludes`, `pythonExec`, etc. However, it's annoying to pass around so many parameters at many places. This PR abstract python function along with its context, to simplify some pyspark code and make the logic more clear.

## How was the this patch tested?

by existing unit tests.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #11342 from cloud-fan/python-clean.
2016-02-24 12:44:54 -08:00
Davies Liu c481bdf512 [SPARK-13329] [SQL] considering output for statistics of logical plan
The current implementation of statistics of UnaryNode does not considering output (for example, Project may product much less columns than it's child), we should considering it to have a better guess.

We usually only join with few columns from a parquet table, the size of projected plan could be much smaller than the original parquet files. Having a better guess of size help we choose between broadcast join or sort merge join.

After this PR, I saw a few queries choose broadcast join other than sort merge join without turning spark.sql.autoBroadcastJoinThreshold for every query, ended up with about 6-8X improvements on end-to-end time.

We use `defaultSize` of DataType to estimate the size of a column, currently For DecimalType/StringType/BinaryType and UDT, we are over-estimate too much (4096 Bytes), so this PR change them to some more reasonable values. Here are the new defaultSize for them:

DecimalType:  8 or 16 bytes, based on the precision
StringType:  20 bytes
BinaryType: 100 bytes
UDF: default size of SQL type

These numbers are not perfect (hard to have a perfect number for them), but should be better than 4096.

Author: Davies Liu <davies@databricks.com>

Closes #11210 from davies/statics.
2016-02-23 12:55:44 -08:00
Yanbo Liang 72427c3e11 [SPARK-13429][MLLIB] Unify Logistic Regression convergence tolerance of ML & MLlib
## What changes were proposed in this pull request?
In order to provide better and consistent result, let's change the default value of MLlib ```LogisticRegressionWithLBFGS convergenceTol``` from ```1E-4``` to ```1E-6``` which will be equal to ML ```LogisticRegression```.
cc dbtsai
## How was the this patch tested?
unit tests

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #11299 from yanboliang/spark-13429.
2016-02-22 23:37:09 -08:00
Bryan Cutler e298ac91e3 [SPARK-12632][PYSPARK][DOC] PySpark fpm and als parameter desc to consistent format
Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent.  This is for the fpm and recommendation modules.

Closes #10602
Closes #10897

Author: Bryan Cutler <cutlerb@gmail.com>
Author: somideshmukh <somilde@us.ibm.com>

Closes #11186 from BryanCutler/param-desc-consistent-fpmrecc-SPARK-12632.
2016-02-22 12:48:37 +02:00
Dongjoon Hyun 024482bf51 [MINOR][DOCS] Fix all typos in markdown files of doc and similar patterns in other comments
## What changes were proposed in this pull request?

This PR tries to fix all typos in all markdown files under `docs` module,
and fixes similar typos in other comments, too.

## How was the this patch tested?

manual tests.

Author: Dongjoon Hyun <dongjoon@apache.org>

Closes #11300 from dongjoon-hyun/minor_fix_typos.
2016-02-22 09:52:07 +00:00
Yong Gang Cao ef1047fca7 [SPARK-12153][SPARK-7617][MLLIB] add support of arbitrary length sentence and other tuning for Word2Vec
add support of arbitrary length sentence by using the nature representation of sentences in the input.

add new similarity functions and add normalization option for distances in synonym finding
add new accessor for internal structure(the vocabulary and wordindex) for convenience

need instructions about how to set value for the Since annotation for newly added public functions. 1.5.3?

jira link: https://issues.apache.org/jira/browse/SPARK-12153

Author: Yong Gang Cao <ygcao@amazon.com>
Author: Yong-Gang Cao <ygcao@users.noreply.github.com>

Closes #10152 from ygcao/improvementForSentenceBoundary.
2016-02-22 09:47:36 +00:00
Franklyn D'souza 0f90f4e6ac [SPARK-13410][SQL] Support unionAll for DataFrames with UDT columns.
## What changes were proposed in this pull request?

This PR adds equality operators to UDT classes so that they can be correctly tested for dataType equality during union operations.

This was previously causing `"AnalysisException: u"unresolved operator 'Union;""` when trying to unionAll two dataframes with UDT columns as below.

```
from pyspark.sql.tests import PythonOnlyPoint, PythonOnlyUDT
from pyspark.sql import types

schema = types.StructType([types.StructField("point", PythonOnlyUDT(), True)])

a = sqlCtx.createDataFrame([[PythonOnlyPoint(1.0, 2.0)]], schema)
b = sqlCtx.createDataFrame([[PythonOnlyPoint(3.0, 4.0)]], schema)

c = a.unionAll(b)
```

## How was the this patch tested?

Tested using two unit tests in sql/test.py and the DataFrameSuite.

Additional information here : https://issues.apache.org/jira/browse/SPARK-13410

Author: Franklyn D'souza <franklynd@gmail.com>

Closes #11279 from damnMeddlingKid/udt-union-all.
2016-02-21 16:58:17 -08:00
Cheng Lian d9efe63ecd [SPARK-12799] Simplify various string output for expressions
This PR introduces several major changes:

1. Replacing `Expression.prettyString` with `Expression.sql`

   The `prettyString` method is mostly an internal, developer faced facility for debugging purposes, and shouldn't be exposed to users.

1. Using SQL-like representation as column names for selected fields that are not named expression (back-ticks and double quotes should be removed)

   Before, we were using `prettyString` as column names when possible, and sometimes the result column names can be weird.  Here are several examples:

   Expression         | `prettyString` | `sql`      | Note
   ------------------ | -------------- | ---------- | ---------------
   `a && b`           | `a && b`       | `a AND b`  |
   `a.getField("f")`  | `a[f]`         | `a.f`      | `a` is a struct

1. Adding trait `NonSQLExpression` extending from `Expression` for expressions that don't have a SQL representation (e.g. Scala UDF/UDAF and Java/Scala object expressions used for encoders)

   `NonSQLExpression.sql` may return an arbitrary user facing string representation of the expression.

Author: Cheng Lian <lian@databricks.com>

Closes #10757 from liancheng/spark-12799.simplify-expression-string-methods.
2016-02-21 22:53:15 +08:00
Holden Karau 9ca79c1ece [SPARK-13302][PYSPARK][TESTS] Move the temp file creation and cleanup outside of the doctests
Some of the new doctests in ml/clustering.py have a lot of setup code, move the setup code to the general test init to keep the doctest more example-style looking.
In part this is a follow up to https://github.com/apache/spark/pull/10999
Note that the same pattern is followed in regression & recommendation - might as well clean up all three at the same time.

Author: Holden Karau <holden@us.ibm.com>

Closes #11197 from holdenk/SPARK-13302-cleanup-doctests-in-ml-clustering.
2016-02-20 09:07:19 +00:00
Reynold Xin 6624a588c1 Revert "[SPARK-12567] [SQL] Add aes_{encrypt,decrypt} UDFs"
This reverts commit 4f9a664818.
2016-02-19 22:44:20 -08:00
Kai Jiang 4f9a664818 [SPARK-12567] [SQL] Add aes_{encrypt,decrypt} UDFs
Author: Kai Jiang <jiangkai@gmail.com>

Closes #10527 from vectorijk/spark-12567.
2016-02-19 22:28:47 -08:00
Sean Owen fb7e21797e [SPARK-13339][DOCS] Clarify commutative / associative operator requirements for reduce, fold
Clarify that reduce functions need to be commutative, and fold functions do not

See https://github.com/apache/spark/pull/11091

Author: Sean Owen <sowen@cloudera.com>

Closes #11217 from srowen/SPARK-13339.
2016-02-19 10:26:38 +00:00
Miles Yucht 827ed1c067 Correct SparseVector.parse documentation
There's a small typo in the SparseVector.parse docstring (which says that it returns a DenseVector rather than a SparseVector), which seems to be incorrect.

Author: Miles Yucht <miles@databricks.com>

Closes #11213 from mgyucht/fix-sparsevector-docs.
2016-02-16 13:01:21 +00:00
Reynold Xin 354d4c24be [SPARK-13296][SQL] Move UserDefinedFunction into sql.expressions.
This pull request has the following changes:

1. Moved UserDefinedFunction into expressions package. This is more consistent with how we structure the packages for window functions and UDAFs.

2. Moved UserDefinedPythonFunction into execution.python package, so we don't have a random private class in the top level sql package.

3. Move everything in execution/python.scala into the newly created execution.python package.

Most of the diffs are just straight copy-paste.

Author: Reynold Xin <rxin@databricks.com>

Closes #11181 from rxin/SPARK-13296.
2016-02-13 21:06:31 -08:00
Liang-Chi Hsieh e3441e3f68 [SPARK-12363][MLLIB] Remove setRun and fix PowerIterationClustering failed test
JIRA: https://issues.apache.org/jira/browse/SPARK-12363

This issue is pointed by yanboliang. When `setRuns` is removed from PowerIterationClustering, one of the tests will be failed. I found that some `dstAttr`s of the normalized graph are not correct values but 0.0. By setting `TripletFields.All` in `mapTriplets` it can work.

Author: Liang-Chi Hsieh <viirya@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #10539 from viirya/fix-poweriter.
2016-02-13 15:56:20 -08:00
vijaykiran 42d656814f [SPARK-12630][PYSPARK] [DOC] PySpark classification parameter desc to consistent format
Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent. This is for the classification module.

Author: vijaykiran <mail@vijaykiran.com>
Author: Bryan Cutler <cutlerb@gmail.com>

Closes #11183 from BryanCutler/pyspark-consistent-param-classification-SPARK-12630.
2016-02-12 14:24:24 -08:00
Yanbo Liang 90de6b2fae [SPARK-12962] [SQL] [PySpark] PySpark support covar_samp and covar_pop
PySpark support ```covar_samp``` and ```covar_pop```.

cc rxin davies marmbrus

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10876 from yanboliang/spark-12962.
2016-02-12 12:43:13 -08:00
Holden Karau 64515e5fbf [SPARK-13154][PYTHON] Add linting for pydocs
We should have lint rules using sphinx to automatically catch the pydoc issues that are sometimes introduced.

Right now ./dev/lint-python will skip building the docs if sphinx isn't present - but it might make sense to fail hard - just a matter of if we want to insist all PySpark developers have sphinx present.

Author: Holden Karau <holden@us.ibm.com>

Closes #11109 from holdenk/SPARK-13154-add-pydoc-lint-for-docs.
2016-02-12 02:13:06 -08:00
Yanbo Liang a183dda6ab [SPARK-12974][ML][PYSPARK] Add Python API for spark.ml bisecting k-means
Add Python API for spark.ml bisecting k-means.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10889 from yanboliang/spark-12974.
2016-02-12 01:45:45 -08:00
Tommy YU d3e2e20299 [SPARK-13153][PYSPARK] ML persistence failed when handle no default value parameter
Fix this defect by check default value exist or not.

yanboliang Please help to review.

Author: Tommy YU <tummyyu@163.com>

Closes #11043 from Wenpei/spark-13153-handle-param-withnodefaultvalue.
2016-02-11 18:38:49 -08:00
sethah b354673886 [SPARK-13047][PYSPARK][ML] Pyspark Params.hasParam should not throw an error
Pyspark Params class has a method `hasParam(paramName)` which returns `True` if the class has a parameter by that name, but throws an `AttributeError` otherwise. There is not currently a way of getting a Boolean to indicate if a class has a parameter. With Spark 2.0 we could modify the existing behavior of `hasParam` or add an additional method with this functionality.

In Python:
```python
from pyspark.ml.classification import NaiveBayes
nb = NaiveBayes()
print nb.hasParam("smoothing")
print nb.hasParam("notAParam")
```
produces:
> True
> AttributeError: 'NaiveBayes' object has no attribute 'notAParam'

However, in Scala:
```scala
import org.apache.spark.ml.classification.NaiveBayes
val nb  = new NaiveBayes()
nb.hasParam("smoothing")
nb.hasParam("notAParam")
```
produces:
> true
> false

cc holdenk

Author: sethah <seth.hendrickson16@gmail.com>

Closes #10962 from sethah/SPARK-13047.
2016-02-11 16:42:44 -08:00
Yanbo Liang 30e0095566 [SPARK-13035][ML][PYSPARK] PySpark ml.clustering support export/import
PySpark ml.clustering support export/import.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10999 from yanboliang/spark-13035.
2016-02-11 15:55:40 -08:00
Yanbo Liang 2426eb3e16 [MINOR][ML][PYSPARK] Cleanup test cases of clustering.py
Test cases should be removed from annotation of ```setXXX``` function, otherwise it will be parts of [Python API docs](https://spark.apache.org/docs/latest/api/python/pyspark.ml.html#pyspark.ml.clustering.KMeans.setInitMode).
cc mengxr jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10975 from yanboliang/clustering-cleanup.
2016-02-11 15:53:45 -08:00
Kai Jiang c8f667d7c1 [SPARK-13037][ML][PYSPARK] PySpark ml.recommendation support export/import
PySpark ml.recommendation support export/import.

Author: Kai Jiang <jiangkai@gmail.com>

Closes #11044 from vectorijk/spark-13037.
2016-02-11 15:50:33 -08:00
Davies Liu b5761d150b [SPARK-12706] [SQL] grouping() and grouping_id()
Grouping() returns a column is aggregated or not, grouping_id() returns the aggregation levels.

grouping()/grouping_id() could be used with window function, but does not work in having/sort clause, will be fixed by another PR.

The GROUPING__ID/grouping_id() in Hive is wrong (according to docs), we also did it wrongly, this PR change that to match the behavior in most databases (also the docs of Hive).

Author: Davies Liu <davies@databricks.com>

Closes #10677 from davies/grouping.
2016-02-10 20:13:38 -08:00
Nam Pham edf4a0e62e [SPARK-12986][DOC] Fix pydoc warnings in mllib/regression.py
I have fixed the warnings by running "make html" under "python/docs/". They are caused by not having blank lines around indented paragraphs.

Author: Nam Pham <phamducnam@gmail.com>

Closes #11025 from nampham2/SPARK-12986.
2016-02-08 11:06:41 -08:00
Tommy YU 81da3bee66 [SPARK-5865][API DOC] Add doc warnings for methods that return local data structures
rxin srowen
I work out note message for rdd.take function, please help to review.

If it's fine, I can apply to all other function later.

Author: Tommy YU <tummyyu@163.com>

Closes #10874 from Wenpei/spark-5865-add-warning-for-localdatastructure.
2016-02-06 17:29:09 +00:00
Shixiong Zhu 335f10edad [SPARK-7997][CORE] Add rpcEnv.awaitTermination() back to SparkEnv
`rpcEnv.awaitTermination()` was not added in #10854 because some Streaming Python tests hung forever.

This patch fixed the hung issue and added rpcEnv.awaitTermination() back to SparkEnv.

Previously, Streaming Kafka Python tests shutdowns the zookeeper server before stopping StreamingContext. Then when stopping StreamingContext, KafkaReceiver may be hung due to https://issues.apache.org/jira/browse/KAFKA-601, hence, some thread of RpcEnv's Dispatcher cannot exit and rpcEnv.awaitTermination is hung.The patch just changed the shutdown order to fix it.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #11031 from zsxwing/awaitTermination.
2016-02-02 21:13:54 -08:00
Bryan Cutler cba1d6b659 [SPARK-12631][PYSPARK][DOC] PySpark clustering parameter desc to consistent format
Part of task for [SPARK-11219](https://issues.apache.org/jira/browse/SPARK-11219) to make PySpark MLlib parameter description formatting consistent.  This is for the clustering module.

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #10610 from BryanCutler/param-desc-consistent-cluster-SPARK-12631.
2016-02-02 10:50:22 -08:00
Herman van Hovell 5a8b978fab [SPARK-13049] Add First/last with ignore nulls to functions.scala
This PR adds the ability to specify the ```ignoreNulls``` option to the functions dsl, e.g:
```df.select($"id", last($"value", ignoreNulls = true).over(Window.partitionBy($"id").orderBy($"other"))```

This PR is some where between a bug fix (see the JIRA) and a new feature. I am not sure if we should backport to 1.6.

cc yhuai

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #10957 from hvanhovell/SPARK-13049.
2016-01-31 13:56:13 -08:00
Yanbo Liang e51b6eaa9e [SPARK-13032][ML][PYSPARK] PySpark support model export/import and take LinearRegression as example
* Implement ```MLWriter/MLWritable/MLReader/MLReadable``` for PySpark.
* Making ```LinearRegression``` to support ```save/load``` as example. After this merged, the work for other transformers/estimators will be easy, then we can list and distribute the tasks to the community.

cc mengxr jkbradley

Author: Yanbo Liang <ybliang8@gmail.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #10469 from yanboliang/spark-11939.
2016-01-29 09:22:24 -08:00
Brandon Bradley 3a40c0e575 [SPARK-12749][SQL] add json option to parse floating-point types as DecimalType
I tried to add this via `USE_BIG_DECIMAL_FOR_FLOATS` option from Jackson with no success.

Added test for non-complex types. Should I add a test for complex types?

Author: Brandon Bradley <bradleytastic@gmail.com>

Closes #10936 from blbradley/spark-12749.
2016-01-28 15:25:57 -08:00
Jason Lee edd473751b [SPARK-10847][SQL][PYSPARK] Pyspark - DataFrame - Optional Metadata with None triggers cryptic failure
The error message is now changed from "Do not support type class scala.Tuple2." to "Do not support type class org.json4s.JsonAST$JNull$" to be more informative about what is not supported. Also, StructType metadata now handles JNull correctly, i.e., {'a': None}. test_metadata_null is added to tests.py to show the fix works.

Author: Jason Lee <cjlee@us.ibm.com>

Closes #8969 from jasoncl/SPARK-10847.
2016-01-27 09:55:10 -08:00
Xusen Yin 4db255c7aa [SPARK-12780] Inconsistency returning value of ML python models' properties
https://issues.apache.org/jira/browse/SPARK-12780

Author: Xusen Yin <yinxusen@gmail.com>

Closes #10724 from yinxusen/SPARK-12780.
2016-01-26 21:16:56 -08:00
Holden Karau eb917291ca [SPARK-10509][PYSPARK] Reduce excessive param boiler plate code
The current python ml params require cut-and-pasting the param setup and description between the class & ```__init__``` methods. Remove this possible case of errors & simplify use of custom params by adding a ```_copy_new_parent``` method to param so as to avoid cut and pasting (and cut and pasting at different indentation levels urgh).

Author: Holden Karau <holden@us.ibm.com>

Closes #10216 from holdenk/SPARK-10509-excessive-param-boiler-plate-code.
2016-01-26 15:53:48 -08:00
Jeff Zhang 19fdb21afb [SPARK-12993][PYSPARK] Remove usage of ADD_FILES in pyspark
environment variable ADD_FILES is created for adding python files on spark context to be distributed to executors (SPARK-865), this is deprecated now. User are encouraged to use --py-files for adding python files.

Author: Jeff Zhang <zjffdu@apache.org>

Closes #10913 from zjffdu/SPARK-12993.
2016-01-26 14:58:39 -08:00
Xusen Yin 8beab68152 [SPARK-11923][ML] Python API for ml.feature.ChiSqSelector
https://issues.apache.org/jira/browse/SPARK-11923

Author: Xusen Yin <yinxusen@gmail.com>

Closes #10186 from yinxusen/SPARK-11923.
2016-01-26 11:56:46 -08:00
Xiangrui Meng 27c910f7f2 [SPARK-10086][MLLIB][STREAMING][PYSPARK] ignore StreamingKMeans test in PySpark for now
I saw several failures from recent PR builds, e.g., https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/50015/consoleFull. This PR marks the test as ignored and we will fix the flakyness in SPARK-10086.

gliptak Do you know why the test failure didn't show up in the Jenkins "Test Result"?

cc: jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #10909 from mengxr/SPARK-10086.
2016-01-25 22:53:34 -08:00
Holden Karau b66afdeb52 [SPARK-11922][PYSPARK][ML] Python api for ml.feature.quantile discretizer
Add Python API for ml.feature.QuantileDiscretizer.

One open question: Do we want to do this stuff to re-use the java model, create a new model, or use a different wrapper around the java model.
cc brkyvz & mengxr

Author: Holden Karau <holden@us.ibm.com>

Closes #10085 from holdenk/SPARK-11937-SPARK-11922-Python-API-for-ml.feature.QuantileDiscretizer.
2016-01-25 22:38:31 -08:00
Yanbo Liang dcae355c64 [SPARK-12905][ML][PYSPARK] PCAModel return eigenvalues for PySpark
```PCAModel```  can output ```explainedVariance``` at Python side.

cc mengxr srowen

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10830 from yanboliang/spark-12905.
2016-01-25 13:54:21 -08:00
Cheng Lian 3327fd2817 [SPARK-12624][PYSPARK] Checks row length when converting Java arrays to Python rows
When actual row length doesn't conform to specified schema field length, we should give a better error message instead of throwing an unintuitive `ArrayOutOfBoundsException`.

Author: Cheng Lian <lian@databricks.com>

Closes #10886 from liancheng/spark-12624.
2016-01-24 19:40:34 -08:00
Jeff Zhang e789b1d2c1 [SPARK-12120][PYSPARK] Improve exception message when failing to init…
…ialize HiveContext in PySpark

davies Mind to review ?

This is the error message after this PR

```
15/12/03 16:59:53 WARN ObjectStore: Failed to get database default, returning NoSuchObjectException
/Users/jzhang/github/spark/python/pyspark/sql/context.py:689: UserWarning: You must build Spark with Hive. Export 'SPARK_HIVE=true' and run build/sbt assembly
  warnings.warn("You must build Spark with Hive. "
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 663, in read
    return DataFrameReader(self)
  File "/Users/jzhang/github/spark/python/pyspark/sql/readwriter.py", line 56, in __init__
    self._jreader = sqlContext._ssql_ctx.read()
  File "/Users/jzhang/github/spark/python/pyspark/sql/context.py", line 692, in _ssql_ctx
    raise e
py4j.protocol.Py4JJavaError: An error occurred while calling None.org.apache.spark.sql.hive.HiveContext.
: java.lang.RuntimeException: java.net.ConnectException: Call From jzhangMBPr.local/127.0.0.1 to 0.0.0.0:9000 failed on connection exception: java.net.ConnectException: Connection refused; For more details see:  http://wiki.apache.org/hadoop/ConnectionRefused
	at org.apache.hadoop.hive.ql.session.SessionState.start(SessionState.java:522)
	at org.apache.spark.sql.hive.client.ClientWrapper.<init>(ClientWrapper.scala:194)
	at org.apache.spark.sql.hive.client.IsolatedClientLoader.createClient(IsolatedClientLoader.scala:238)
	at org.apache.spark.sql.hive.HiveContext.executionHive$lzycompute(HiveContext.scala:218)
	at org.apache.spark.sql.hive.HiveContext.executionHive(HiveContext.scala:208)
	at org.apache.spark.sql.hive.HiveContext.functionRegistry$lzycompute(HiveContext.scala:462)
	at org.apache.spark.sql.hive.HiveContext.functionRegistry(HiveContext.scala:461)
	at org.apache.spark.sql.UDFRegistration.<init>(UDFRegistration.scala:40)
	at org.apache.spark.sql.SQLContext.<init>(SQLContext.scala:330)
	at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:90)
	at org.apache.spark.sql.hive.HiveContext.<init>(HiveContext.scala:101)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
	at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:57)
	at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
	at java.lang.reflect.Constructor.newInstance(Constructor.java:526)
	at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:234)
	at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:381)
	at py4j.Gateway.invoke(Gateway.java:214)
	at py4j.commands.ConstructorCommand.invokeConstructor(ConstructorCommand.java:79)
	at py4j.commands.ConstructorCommand.execute(ConstructorCommand.java:68)
	at py4j.GatewayConnection.run(GatewayConnection.java:209)
	at java.lang.Thread.run(Thread.java:745)
```

Author: Jeff Zhang <zjffdu@apache.org>

Closes #10126 from zjffdu/SPARK-12120.
2016-01-24 12:29:26 -08:00
Gábor Lipták 9bb35c5b59 [SPARK-11295][PYSPARK] Add packages to JUnit output for Python tests
This is #9263 from gliptak (improving grouping/display of test case results) with a small fix of bisecting k-means unit test.

Author: Gábor Lipták <gliptak@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #10850 from mengxr/SPARK-11295.
2016-01-20 11:11:10 -08:00
Xiangrui Meng beda901422 Revert "[SPARK-11295] Add packages to JUnit output for Python tests"
This reverts commit c6f971b4ae.
2016-01-19 16:51:17 -08:00
BenFradet f6f7ca9d2e [SPARK-9716][ML] BinaryClassificationEvaluator should accept Double prediction column
This PR aims to allow the prediction column of `BinaryClassificationEvaluator` to be of double type.

Author: BenFradet <benjamin.fradet@gmail.com>

Closes #10472 from BenFradet/SPARK-9716.
2016-01-19 14:59:20 -08:00
Gábor Lipták c6f971b4ae [SPARK-11295] Add packages to JUnit output for Python tests
SPARK-11295 Add packages to JUnit output for Python tests

This improves grouping/display of test case results.

Author: Gábor Lipták <gliptak@gmail.com>

Closes #9263 from gliptak/SPARK-11295.
2016-01-19 14:06:53 -08:00
Holden Karau 0ddba6d88f [SPARK-11944][PYSPARK][MLLIB] python mllib.clustering.bisecting k means
From the coverage issues for 1.6 : Add Python API for mllib.clustering.BisectingKMeans.

Author: Holden Karau <holden@us.ibm.com>

Closes #10150 from holdenk/SPARK-11937-python-api-coverage-SPARK-11944-python-mllib.clustering.BisectingKMeans.
2016-01-19 10:15:54 -08:00
Sean Owen d8c4b00a23 [SPARK-7683][PYSPARK] Confusing behavior of fold function of RDD in pyspark
Fix order of arguments that Pyspark RDD.fold passes to its op -  should be (acc, obj) like other implementations.

Obviously, this is a potentially breaking change, so can only happen for 2.x

CC davies

Author: Sean Owen <sowen@cloudera.com>

Closes #10771 from srowen/SPARK-7683.
2016-01-19 09:34:49 +00:00
Yanbo Liang 5f843781e3 [SPARK-11925][ML][PYSPARK] Add PySpark missing methods for ml.feature during Spark 1.6 QA
Add PySpark missing methods and params for ml.feature:
* ```RegexTokenizer``` should support setting ```toLowercase```.
* ```MinMaxScalerModel``` should support output ```originalMin``` and ```originalMax```.
* ```PCAModel``` should support output ```pc```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9908 from yanboliang/spark-11925.
2016-01-15 15:54:19 -08:00
Herman van Hovell 7cd7f22025 [SPARK-12575][SQL] Grammar parity with existing SQL parser
In this PR the new CatalystQl parser stack reaches grammar parity with the old Parser-Combinator based SQL Parser. This PR also replaces all uses of the old Parser, and removes it from the code base.

Although the existing Hive and SQL parser dialects were mostly the same, some kinks had to be worked out:
- The SQL Parser allowed syntax like ```APPROXIMATE(0.01) COUNT(DISTINCT a)```. In order to make this work we needed to hardcode approximate operators in the parser, or we would have to create an approximate expression. ```APPROXIMATE_COUNT_DISTINCT(a, 0.01)``` would also do the job and is much easier to maintain. So, this PR **removes** this keyword.
- The old SQL Parser supports ```LIMIT``` clauses in nested queries. This is **not supported** anymore. See https://github.com/apache/spark/pull/10689 for the rationale for this.
- Hive has a charset name char set literal combination it supports, for instance the following expression ```_ISO-8859-1 0x4341464562616265``` would yield this string: ```CAFEbabe```. Hive will only allow charset names to start with an underscore. This is quite annoying in spark because as soon as you use a tuple names will start with an underscore. In this PR we **remove** this feature from the parser. It would be quite easy to implement such a feature as an Expression later on.
- Hive and the SQL Parser treat decimal literals differently. Hive will turn any decimal into a ```Double``` whereas the SQL Parser would convert a non-scientific decimal into a ```BigDecimal```, and would turn a scientific decimal into a Double. We follow Hive's behavior here. The new parser supports a big decimal literal, for instance: ```81923801.42BD```, which can be used when a big decimal is needed.

cc rxin viirya marmbrus yhuai cloud-fan

Author: Herman van Hovell <hvanhovell@questtec.nl>

Closes #10745 from hvanhovell/SPARK-12575-2.
2016-01-15 15:19:10 -08:00
Wenchen Fan 962e9bcf94 [SPARK-12756][SQL] use hash expression in Exchange
This PR makes bucketing and exchange share one common hash algorithm, so that we can guarantee the data distribution is same between shuffle and bucketed data source, which enables us to only shuffle one side when join a bucketed table and a normal one.

This PR also fixes the tests that are broken by the new hash behaviour in shuffle.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10703 from cloud-fan/use-hash-expr-in-shuffle.
2016-01-13 22:43:28 -08:00
Reynold Xin cbbcd8e425 [SPARK-12791][SQL] Simplify CaseWhen by breaking "branches" into "conditions" and "values"
This pull request rewrites CaseWhen expression to break the single, monolithic "branches" field into a sequence of tuples (Seq[(condition, value)]) and an explicit optional elseValue field.

Prior to this pull request, each even position in "branches" represents the condition for each branch, and each odd position represents the value for each branch. The use of them have been pretty confusing with a lot sliding windows or grouped(2) calls.

Author: Reynold Xin <rxin@databricks.com>

Closes #10734 from rxin/simplify-case.
2016-01-13 12:44:35 -08:00
Wenchen Fan c2ea79f96a [SPARK-12642][SQL] improve the hash expression to be decoupled from unsafe row
https://issues.apache.org/jira/browse/SPARK-12642

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10694 from cloud-fan/hash-expr.
2016-01-13 12:29:02 -08:00
Erik Selin e4e0b3f7b2 [SPARK-12268][PYSPARK] Make pyspark shell pythonstartup work under python3
This replaces the `execfile` used for running custom python shell scripts
with explicit open, compile and exec (as recommended by 2to3). The reason
for this change is to make the pythonstartup option compatible with python3.

Author: Erik Selin <erik.selin@gmail.com>

Closes #10255 from tyro89/pythonstartup-python3.
2016-01-13 12:21:45 -08:00
Shixiong Zhu 4f60651cbe [SPARK-12652][PYSPARK] Upgrade Py4J to 0.9.1
- [x] Upgrade Py4J to 0.9.1
- [x] SPARK-12657: Revert SPARK-12617
- [x] SPARK-12658: Revert SPARK-12511
  - Still keep the change that only reading checkpoint once. This is a manual change and worth to take a look carefully. bfd4b5c040
- [x] Verify no leak any more after reverting our workarounds

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10692 from zsxwing/py4j-0.9.1.
2016-01-12 14:27:05 -08:00
Yanbo Liang ee4ee02b86 [SPARK-12603][MLLIB] PySpark MLlib GaussianMixtureModel should support single instance predict/predictSoft
PySpark MLlib ```GaussianMixtureModel``` should support single instance ```predict/predictSoft``` just like Scala do.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10552 from yanboliang/spark-12603.
2016-01-11 14:43:25 -08:00
Sean Owen b9c8353378 [SPARK-12618][CORE][STREAMING][SQL] Clean up build warnings: 2.0.0 edition
Fix most build warnings: mostly deprecated API usages. I'll annotate some of the changes below. CC rxin who is leading the charge to remove the deprecated APIs.

Author: Sean Owen <sowen@cloudera.com>

Closes #10570 from srowen/SPARK-12618.
2016-01-08 17:47:44 +00:00
zero323 592f64985d [SPARK-12006][ML][PYTHON] Fix GMM failure if initialModel is not None
If initial model passed to GMM is not empty it causes net.razorvine.pickle.PickleException. It can be fixed by converting initialModel.weights to list.

Author: zero323 <matthew.szymkiewicz@gmail.com>

Closes #10644 from zero323/SPARK-12006.
2016-01-07 10:32:56 -08:00
Yin Huai e5cde7ab11 Revert "[SPARK-12006][ML][PYTHON] Fix GMM failure if initialModel is not None"
This reverts commit fcd013cf70.

Author: Yin Huai <yhuai@databricks.com>

Closes #10632 from yhuai/pythonStyle.
2016-01-06 22:03:31 -08:00
Shixiong Zhu 1e6648d62f [SPARK-12617][PYSPARK] Move Py4jCallbackConnectionCleaner to Streaming
Move Py4jCallbackConnectionCleaner to Streaming because the callback server starts only in StreamingContext.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10621 from zsxwing/SPARK-12617-2.
2016-01-06 12:03:01 -08:00
zero323 fcd013cf70 [SPARK-12006][ML][PYTHON] Fix GMM failure if initialModel is not None
If initial model passed to GMM is not empty it causes `net.razorvine.pickle.PickleException`. It can be fixed by converting `initialModel.weights` to `list`.

Author: zero323 <matthew.szymkiewicz@gmail.com>

Closes #9986 from zero323/SPARK-12006.
2016-01-06 11:58:33 -08:00
Yanbo Liang 3aa3488225 [SPARK-11815][ML][PYSPARK] PySpark DecisionTreeClassifier & DecisionTreeRegressor should support setSeed
PySpark ```DecisionTreeClassifier``` & ```DecisionTreeRegressor``` should support ```setSeed``` like what we do at Scala side.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9807 from yanboliang/spark-11815.
2016-01-06 10:52:25 -08:00
Yanbo Liang 95eb651633 [SPARK-11945][ML][PYSPARK] Add computeCost to KMeansModel for PySpark spark.ml
Add ```computeCost``` to ```KMeansModel``` as evaluator for PySpark spark.ml.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9931 from yanboliang/SPARK-11945.
2016-01-06 10:50:02 -08:00
Joshi 007da1a9dc [SPARK-11531][ML] SparseVector error Msg
PySpark SparseVector should have "Found duplicate indices" error message

Author: Joshi <rekhajoshm@gmail.com>
Author: Rekha Joshi <rekhajoshm@gmail.com>

Closes #9525 from rekhajoshm/SPARK-11531.
2016-01-06 10:48:14 -08:00
Holden Karau 3b29004d24 [SPARK-7675][ML][PYSPARK] sparkml params type conversion
From JIRA:
Currently, PySpark wrappers for spark.ml Scala classes are brittle when accepting Param types. E.g., Normalizer's "p" param cannot be set to "2" (an integer); it must be set to "2.0" (a float). Fixing this is not trivial since there does not appear to be a natural place to insert the conversion before Python wrappers call Java's Params setter method.

A possible fix will be to include a method "_checkType" to PySpark's Param class which checks the type, prints an error if needed, and converts types when relevant (e.g., int to float, or scipy matrix to array). The Java wrapper method which copies params to Scala can call this method when available.

This fix instead checks the types at set time since I think failing sooner is better, but I can switch it around to check at copy time if that would be better. So far this only converts int to float and other conversions (like scipymatrix to array) are left for the future.

Author: Holden Karau <holden@us.ibm.com>

Closes #9581 from holdenk/SPARK-7675-PySpark-sparkml-Params-type-conversion.
2016-01-06 10:43:03 -08:00
Kai Jiang 1537e55604 [SPARK-12041][ML][PYSPARK] Add columnSimilarities to IndexedRowMatrix
Add `columnSimilarities` to IndexedRowMatrix for PySpark spark.mllib.linalg.

Author: Kai Jiang <jiangkai@gmail.com>

Closes #10158 from vectorijk/spark-12041.
2016-01-05 15:33:27 -08:00
Shixiong Zhu 6cfe341ee8 [SPARK-12511] [PYSPARK] [STREAMING] Make sure PythonDStream.registerSerializer is called only once
There is an issue that Py4J's PythonProxyHandler.finalize blocks forever. (https://github.com/bartdag/py4j/pull/184)

Py4j will create a PythonProxyHandler in Java for "transformer_serializer" when calling "registerSerializer". If we call "registerSerializer" twice, the second PythonProxyHandler will override the first one, then the first one will be GCed and trigger "PythonProxyHandler.finalize". To avoid that, we should not call"registerSerializer" more than once, so that "PythonProxyHandler" in Java side won't be GCed.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10514 from zsxwing/SPARK-12511.
2016-01-05 13:48:47 -08:00
Shixiong Zhu 047a31bb10 [SPARK-12617] [PYSPARK] Clean up the leak sockets of Py4J
This patch added Py4jCallbackConnectionCleaner to clean the leak sockets of Py4J every 30 seconds. This is a workaround before Py4J fixes the leak issue https://github.com/bartdag/py4j/issues/187

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10579 from zsxwing/SPARK-12617.
2016-01-05 13:10:46 -08:00
Wenchen Fan 76768337be [SPARK-12480][FOLLOW-UP] use a single column vararg for hash
address comments in #10435

This makes the API easier to use if user programmatically generate the call to hash, and they will get analysis exception if the arguments of hash is empty.

Author: Wenchen Fan <wenchen@databricks.com>

Closes #10588 from cloud-fan/hash.
2016-01-05 10:23:36 -08:00
Reynold Xin 77ab49b857 [SPARK-12600][SQL] Remove deprecated methods in Spark SQL
Author: Reynold Xin <rxin@databricks.com>

Closes #10559 from rxin/remove-deprecated-sql.
2016-01-04 18:02:38 -08:00
Holden Karau 13dab9c386 [SPARK-12611][SQL][PYSPARK][TESTS] Fix test_infer_schema_to_local
Previously (when the PR was first created) not specifying b= explicitly was fine (and treated as default null) - instead be explicit about b being None in the test.

Author: Holden Karau <holden@us.ibm.com>

Closes #10564 from holdenk/SPARK-12611-fix-test-infer-schema-local.
2016-01-03 17:04:35 -08:00
Cazen b8410ff9ce [SPARK-12537][SQL] Add option to accept quoting of all character backslash quoting mechanism
We can provides the option to choose JSON parser can be enabled to accept quoting of all character or not.

Author: Cazen <Cazen@korea.com>
Author: Cazen Lee <cazen.lee@samsung.com>
Author: Cazen Lee <Cazen@korea.com>
Author: cazen.lee <cazen.lee@samsung.com>

Closes #10497 from Cazen/master.
2016-01-03 17:01:19 -08:00
Holden Karau d1ca634db4 [SPARK-12300] [SQL] [PYSPARK] fix schema inferance on local collections
Current schema inference for local python collections halts as soon as there are no NullTypes. This is different than when we specify a sampling ratio of 1.0 on a distributed collection. This could result in incomplete schema information.

Author: Holden Karau <holden@us.ibm.com>

Closes #10275 from holdenk/SPARK-12300-fix-schmea-inferance-on-local-collections.
2015-12-30 11:14:47 -08:00
jerryshao 8d49400921 [SPARK-12353][STREAMING][PYSPARK] Fix countByValue inconsistent output in Python API
The semantics of Python countByValue is different from Scala API, it is more like countDistinctValue, so here change to make it consistent with Scala/Java API.

Author: jerryshao <sshao@hortonworks.com>

Closes #10350 from jerryshao/SPARK-12353.
2015-12-28 10:43:23 +00:00
gatorsmile 9ab296ecdc [SPARK-12520] [PYSPARK] Correct Descriptions and Add Use Cases in Equi-Join
After reading the JIRA https://issues.apache.org/jira/browse/SPARK-12520, I double checked the code.

For example, users can do the Equi-Join like
  ```df.join(df2, 'name', 'outer').select('name', 'height').collect()```
- There exists a bug in 1.5 and 1.4. The code just ignores the third parameter (join type) users pass. However, the join type we called is `Inner`, even if the user-specified type is the other type (e.g., `Outer`).
- After a PR: https://github.com/apache/spark/pull/8600, the 1.6 does not have such an issue, but the description has not been updated.

Plan to submit another PR to fix 1.5 and issue an error message if users specify a non-inner join type when using Equi-Join.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #10477 from gatorsmile/pyOuterJoin.
2015-12-27 23:18:48 -08:00
Holden Karau 969d5665bb [SPARK-12296][PYSPARK][MLLIB] Feature parity for pyspark mllib standard scaler model
Some methods are missing, such as ways to access the std, mean, etc. This PR is for feature parity for pyspark.mllib.feature.StandardScaler & StandardScalerModel.

Author: Holden Karau <holden@us.ibm.com>

Closes #10298 from holdenk/SPARK-12296-feature-parity-pyspark-mllib-StandardScalerModel.
2015-12-22 09:14:12 +02:00
pshearer fc6dbcc703 Doc typo: ltrim = trim from left end, not right
Author: pshearer <pshearer@massmutual.com>

Closes #10414 from pshearer/patch-1.
2015-12-21 14:04:59 -08:00
Jeff Zhang 1920d72a1f [PYSPARK] Pyspark typo & Add missing abstractmethod annotation
No jira is created since this is a trivial change.

davies  Please help review it

Author: Jeff Zhang <zjffdu@apache.org>

Closes #10143 from zjffdu/pyspark_typo.
2015-12-21 08:53:46 -08:00
Bryan Cutler ce1798b3af [SPARK-10158][PYSPARK][MLLIB] ALS better error message when using Long IDs
Added catch for casting Long to Int exception when PySpark ALS Ratings are serialized.  It is easy to accidentally use Long IDs for user/product and before, it would fail with a somewhat cryptic "ClassCastException: java.lang.Long cannot be cast to java.lang.Integer."  Now if this is done, a more descriptive error is shown, e.g. "PickleException: Ratings id 1205640308657491975 exceeds max integer value of 2147483647."

Author: Bryan Cutler <bjcutler@us.ibm.com>

Closes #9361 from BryanCutler/als-pyspark-long-id-error-SPARK-10158.
2015-12-20 09:08:23 +00:00
Yanbo Liang a073a73a56 [SQL] Fix mistake doc of join type for dataframe.join
Fix mistake doc of join type for ```dataframe.join```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10378 from yanboliang/leftsemi.
2015-12-19 00:34:30 -08:00
gatorsmile 499ac3e69a [SPARK-12091] [PYSPARK] Deprecate the JAVA-specific deserialized storage levels
The current default storage level of Python persist API is MEMORY_ONLY_SER. This is different from the default level MEMORY_ONLY in the official document and RDD APIs.

davies Is this inconsistency intentional? Thanks!

Updates: Since the data is always serialized on the Python side, the storage levels of JAVA-specific deserialization are not removed, such as MEMORY_ONLY.

Updates: Based on the reviewers' feedback. In Python, stored objects will always be serialized with the [Pickle](https://docs.python.org/2/library/pickle.html) library, so it does not matter whether you choose a serialized level. The available storage levels in Python include `MEMORY_ONLY`, `MEMORY_ONLY_2`, `MEMORY_AND_DISK`, `MEMORY_AND_DISK_2`, `DISK_ONLY`, `DISK_ONLY_2` and `OFF_HEAP`.

Author: gatorsmile <gatorsmile@gmail.com>

Closes #10092 from gatorsmile/persistStorageLevel.
2015-12-18 20:06:05 -08:00
Yanbo Liang 6e0771665b [SQL] Update SQLContext.read.text doc
Since we rename the column name from ```text``` to ```value``` for DataFrame load by ```SQLContext.read.text```, we need to update doc.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10349 from yanboliang/text-value.
2015-12-17 09:19:46 -08:00
David Tolpin 437583f692 [SPARK-11904][PYSPARK] reduceByKeyAndWindow does not require checkpointing when invFunc is None
when invFunc is None, `reduceByKeyAndWindow(func, None, winsize, slidesize)` is equivalent to

     reduceByKey(func).window(winsize, slidesize).reduceByKey(winsize, slidesize)

and no checkpoint is necessary. The corresponding Scala code does exactly that, but Python code always creates a windowed stream with obligatory checkpointing. The patch fixes this.

I do not know how to unit-test this.

Author: David Tolpin <david.tolpin@gmail.com>

Closes #9888 from dtolpin/master.
2015-12-16 22:10:24 -08:00
Davies Liu 27b98e99d2 [SPARK-12380] [PYSPARK] use SQLContext.getOrCreate in mllib
MLlib should use SQLContext.getOrCreate() instead of creating new SQLContext.

Author: Davies Liu <davies@databricks.com>

Closes #10338 from davies/create_context.
2015-12-16 15:48:11 -08:00
Martin Menestret 3a44aebd0c [SPARK-9690][ML][PYTHON] pyspark CrossValidator random seed
Extend CrossValidator with HasSeed in PySpark.

This PR replaces [https://github.com/apache/spark/pull/7997]

CC: yanboliang thunterdb mmenestret  Would one of you mind taking a look?  Thanks!

Author: Joseph K. Bradley <joseph@databricks.com>
Author: Martin MENESTRET <mmenestret@ippon.fr>

Closes #10268 from jkbradley/pyspark-cv-seed.
2015-12-16 14:05:35 -08:00
Jeff Zhang 6a880afa83 [SPARK-12361][PYSPARK][TESTS] Should set PYSPARK_DRIVER_PYTHON before Python tests
Although this patch still doesn't solve the issue why the return code is 0 (see JIRA description), it resolves the issue of python version mismatch.

Author: Jeff Zhang <zjffdu@apache.org>

Closes #10322 from zjffdu/SPARK-12361.
2015-12-16 11:29:51 -08:00
Liang-Chi Hsieh b51a4cdff3 [SPARK-12016] [MLLIB] [PYSPARK] Wrap Word2VecModel when loading it in pyspark
JIRA: https://issues.apache.org/jira/browse/SPARK-12016

We should not directly use Word2VecModel in pyspark. We need to wrap it in a Word2VecModelWrapper when loading it in pyspark.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #10100 from viirya/fix-load-py-wordvecmodel.
2015-12-14 09:59:42 -08:00
Bryan Cutler 6a6c1fc5c8 [SPARK-11713] [PYSPARK] [STREAMING] Initial RDD updateStateByKey for PySpark
Adding ability to define an initial state RDD for use with updateStateByKey PySpark.  Added unit test and changed stateful_network_wordcount example to use initial RDD.

Author: Bryan Cutler <bjcutler@us.ibm.com>

Closes #10082 from BryanCutler/initial-rdd-updateStateByKey-SPARK-11713.
2015-12-10 14:21:15 -08:00
Cheng Lian 6e1c55eac4 [SPARK-12012][SQL] Show more comprehensive PhysicalRDD metadata when visualizing SQL query plan
This PR adds a `private[sql]` method `metadata` to `SparkPlan`, which can be used to describe detail information about a physical plan during visualization. Specifically, this PR uses this method to provide details of `PhysicalRDD`s translated from a data source relation. For example, a `ParquetRelation` converted from Hive metastore table `default.psrc` is now shown as the following screenshot:

![image](https://cloud.githubusercontent.com/assets/230655/11526657/e10cb7e6-9916-11e5-9afa-f108932ec890.png)

And here is the screenshot for a regular `ParquetRelation` (not converted from Hive metastore table) loaded from a really long path:

![output](https://cloud.githubusercontent.com/assets/230655/11680582/37c66460-9e94-11e5-8f50-842db5309d5a.png)

Author: Cheng Lian <lian@databricks.com>

Closes #10004 from liancheng/spark-12012.physical-rdd-metadata.
2015-12-09 23:30:42 +08:00
Andrew Ray 36282f78b8 [SPARK-12184][PYTHON] Make python api doc for pivot consistant with scala doc
In SPARK-11946 the API for pivot was changed a bit and got updated doc, the doc changes were not made for the python api though. This PR updates the python doc to be consistent.

Author: Andrew Ray <ray.andrew@gmail.com>

Closes #10176 from aray/sql-pivot-python-doc.
2015-12-07 15:01:00 -08:00
Davies Liu ef3f047c07 [SPARK-12132] [PYSPARK] raise KeyboardInterrupt inside SIGINT handler
Currently, the current line is not cleared by Cltr-C

After this patch
```
>>> asdfasdf^C
Traceback (most recent call last):
  File "~/spark/python/pyspark/context.py", line 225, in signal_handler
    raise KeyboardInterrupt()
KeyboardInterrupt
```

It's still worse than 1.5 (and before).

Author: Davies Liu <davies@databricks.com>

Closes #10134 from davies/fix_cltrc.
2015-12-07 11:00:25 -08:00
Burak Yavuz 302d68de87 [SPARK-12058][STREAMING][KINESIS][TESTS] fix Kinesis python tests
Python tests require access to the `KinesisTestUtils` file. When this file exists under src/test, python can't access it, since it is not available in the assembly jar.

However, if we move KinesisTestUtils to src/main, we need to add the KinesisProducerLibrary as a dependency. In order to avoid this, I moved KinesisTestUtils to src/main, and extended it with ExtendedKinesisTestUtils which is under src/test that adds support for the KPL.

cc zsxwing tdas

Author: Burak Yavuz <brkyvz@gmail.com>

Closes #10050 from brkyvz/kinesis-py.
2015-12-04 12:08:42 -08:00
Yanbo Liang d576e76bba [MINOR][ML] Use coefficients replace weights
Use ```coefficients``` replace ```weights```, I wish they are the last two.
mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #10065 from yanboliang/coefficients.
2015-12-03 11:37:34 -08:00
Davies Liu 4375eb3f48 [SPARK-12090] [PYSPARK] consider shuffle in coalesce()
Author: Davies Liu <davies@databricks.com>

Closes #10090 from davies/fix_coalesce.
2015-12-01 22:41:48 -08:00
jerryshao f292018f8e [SPARK-12002][STREAMING][PYSPARK] Fix python direct stream checkpoint recovery issue
Fixed a minor race condition in #10017

Closes #10017

Author: jerryshao <sshao@hortonworks.com>
Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10074 from zsxwing/review-pr10017.
2015-12-01 15:26:10 -08:00
Shixiong Zhu edb26e7f4e [SPARK-12058][HOTFIX] Disable KinesisStreamTests
KinesisStreamTests in test.py is broken because of #9403. See https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/46896/testReport/(root)/KinesisStreamTests/test_kinesis_stream/

Because Streaming Python didn’t work when merging https://github.com/apache/spark/pull/9403, the PR build didn’t report the Python test failure actually.

This PR just disabled the test to unblock #10039

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #10047 from zsxwing/disable-python-kinesis-test.
2015-11-30 16:31:59 -08:00
Jeff Zhang d8220885c4 [SPARK-11917][PYSPARK] Add SQLContext#dropTempTable to PySpark
Author: Jeff Zhang <zjffdu@apache.org>

Closes #9903 from zjffdu/SPARK-11917.
2015-11-26 19:15:22 -08:00
gatorsmile 068b6438d6 [SPARK-11980][SPARK-10621][SQL] Fix json_tuple and add test cases for
Added Python test cases for the function `isnan`, `isnull`, `nanvl` and `json_tuple`.

Fixed a bug in the function `json_tuple`

rxin , could you help me review my changes? Please let me know anything is missing.

Thank you! Have a good Thanksgiving day!

Author: gatorsmile <gatorsmile@gmail.com>

Closes #9977 from gatorsmile/json_tuple.
2015-11-25 23:24:33 -08:00
Shixiong Zhu d29e2ef4cf [SPARK-11935][PYSPARK] Send the Python exceptions in TransformFunction and TransformFunctionSerializer to Java
The Python exception track in TransformFunction and TransformFunctionSerializer is not sent back to Java. Py4j just throws a very general exception, which is hard to debug.

This PRs adds `getFailure` method to get the failure message in Java side.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9922 from zsxwing/SPARK-11935.
2015-11-25 11:47:21 -08:00
Davies Liu dc1d324fdf [SPARK-11969] [SQL] [PYSPARK] visualization of SQL query for pyspark
Currently, we does not have visualization for SQL query from Python, this PR fix that.

cc zsxwing

Author: Davies Liu <davies@databricks.com>

Closes #9949 from davies/pyspark_sql_ui.
2015-11-25 11:11:39 -08:00
felixcheung faabdfa2bd [SPARK-11984][SQL][PYTHON] Fix typos in doc for pivot for scala and python
Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9967 from felixcheung/pypivotdoc.
2015-11-25 10:36:35 -08:00
Jeff Zhang b9b6fbe89b [SPARK-11860][PYSAPRK][DOCUMENTATION] Invalid argument specification …
…for registerFunction [Python]

Straightforward change on the python doc

Author: Jeff Zhang <zjffdu@apache.org>

Closes #9901 from zjffdu/SPARK-11860.
2015-11-25 13:49:58 +00:00
Reynold Xin 151d7c2baf [SPARK-10621][SQL] Consistent naming for functions in SQL, Python, Scala
Author: Reynold Xin <rxin@databricks.com>

Closes #9948 from rxin/SPARK-10621.
2015-11-24 21:30:53 -08:00
Reynold Xin 25bbd3c16e [SPARK-11967][SQL] Consistent use of varargs for multiple paths in DataFrameReader
This patch makes it consistent to use varargs in all DataFrameReader methods, including Parquet, JSON, text, and the generic load function.

Also added a few more API tests for the Java API.

Author: Reynold Xin <rxin@databricks.com>

Closes #9945 from rxin/SPARK-11967.
2015-11-24 18:16:07 -08:00
Reynold Xin f315272279 [SPARK-11946][SQL] Audit pivot API for 1.6.
Currently pivot's signature looks like

```scala
scala.annotation.varargs
def pivot(pivotColumn: Column, values: Column*): GroupedData

scala.annotation.varargs
def pivot(pivotColumn: String, values: Any*): GroupedData
```

I think we can remove the one that takes "Column" types, since callers should always be passing in literals. It'd also be more clear if the values are not varargs, but rather Seq or java.util.List.

I also made similar changes for Python.

Author: Reynold Xin <rxin@databricks.com>

Closes #9929 from rxin/SPARK-11946.
2015-11-24 12:54:37 -08:00
Bryan Cutler 105745645b [SPARK-10560][PYSPARK][MLLIB][DOCS] Make StreamingLogisticRegressionWithSGD Python API equal to Scala one
This is to bring the API documentation of StreamingLogisticReressionWithSGD and StreamingLinearRegressionWithSGC in line with the Scala versions.

-Fixed the algorithm descriptions
-Added default values to parameter descriptions
-Changed StreamingLogisticRegressionWithSGD regParam to default to 0, as in the Scala version

Author: Bryan Cutler <bjcutler@us.ibm.com>

Closes #9141 from BryanCutler/StreamingLogisticRegressionWithSGD-python-api-sync.
2015-11-23 17:11:51 -08:00
Davies Liu 1d91202010 [SPARK-11836][SQL] udf/cast should not create new SQLContext
They should use the existing SQLContext.

Author: Davies Liu <davies@databricks.com>

Closes #9914 from davies/create_udf.
2015-11-23 13:44:30 -08:00
Shixiong Zhu be7a2cfd97 [SPARK-11870][STREAMING][PYSPARK] Rethrow the exceptions in TransformFunction and TransformFunctionSerializer
TransformFunction and TransformFunctionSerializer don't rethrow the exception, so when any exception happens, it just return None. This will cause some weird NPE and confuse people.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9847 from zsxwing/pyspark-streaming-exception.
2015-11-20 14:23:01 -08:00
Yanbo Liang 7216f40545 [SPARK-11875][ML][PYSPARK] Update doc for PySpark HasCheckpointInterval
* Update doc for PySpark ```HasCheckpointInterval``` that users can understand how to disable checkpoint.
* Update doc for PySpark ```cacheNodeIds``` of ```DecisionTreeParams``` to notify the relationship between ```cacheNodeIds``` and ```checkpointInterval```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9856 from yanboliang/spark-11875.
2015-11-19 22:14:01 -08:00
David Tolpin 599a8c6e2b [SPARK-11812][PYSPARK] invFunc=None works properly with python's reduceByKeyAndWindow
invFunc is optional and can be None. Instead of invFunc (the parameter) invReduceFunc (a local function) was checked for trueness (that is, not None, in this context). A local function is never None,
thus the case of invFunc=None (a common one when inverse reduction is not defined) was treated incorrectly, resulting in loss of data.

In addition, the docstring used wrong parameter names, also fixed.

Author: David Tolpin <david.tolpin@gmail.com>

Closes #9775 from dtolpin/master.
2015-11-19 13:57:23 -08:00
Yanbo Liang 603a721c21 [SPARK-11820][ML][PYSPARK] PySpark LiR & LoR should support weightCol
[SPARK-7685](https://issues.apache.org/jira/browse/SPARK-7685) and [SPARK-9642](https://issues.apache.org/jira/browse/SPARK-9642) have already supported setting weight column for ```LogisticRegression``` and ```LinearRegression```. It's a very important feature, PySpark should also support. mengxr

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9811 from yanboliang/spark-11820.
2015-11-18 13:32:06 -08:00
JihongMa 09ad9533d5 [SPARK-11720][SQL][ML] Handle edge cases when count = 0 or 1 for Stats function
return Double.NaN for mean/average when count == 0 for all numeric types that is converted to Double, Decimal type continue to return null.

Author: JihongMa <linlin200605@gmail.com>

Closes #9705 from JihongMA/SPARK-11720.
2015-11-18 13:03:37 -08:00
Jeff Zhang 3a6807fdf0 [SPARK-11804] [PYSPARK] Exception raise when using Jdbc predicates opt…
…ion in PySpark

Author: Jeff Zhang <zjffdu@apache.org>

Closes #9791 from zjffdu/SPARK-11804.
2015-11-18 08:18:54 -08:00
jerryshao 75a2922910 [SPARK-9065][STREAMING][PYSPARK] Add MessageHandler for Kafka Python API
Fixed the merge conflicts in #7410

Closes #7410

Author: Shixiong Zhu <shixiong@databricks.com>
Author: jerryshao <saisai.shao@intel.com>
Author: jerryshao <sshao@hortonworks.com>

Closes #9742 from zsxwing/pr7410.
2015-11-17 16:57:52 -08:00
Shixiong Zhu 928d631625 [SPARK-11740][STREAMING] Fix the race condition of two checkpoints in a batch
We will do checkpoint when generating a batch and completing a batch. When the processing time of a batch is greater than the batch interval, checkpointing for completing an old batch may run after checkpointing for generating a new batch. If this happens, checkpoint of an old batch actually has the latest information, so we want to recovery from it. This PR will use the latest checkpoint time as the file name, so that we can always recovery from the latest checkpoint file.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9707 from zsxwing/fix-checkpoint.
2015-11-17 14:48:29 -08:00
Daniel Jalova ace0db4714 [SPARK-6328][PYTHON] Python API for StreamingListener
Author: Daniel Jalova <djalova@us.ibm.com>

Closes #9186 from djalova/SPARK-6328.
2015-11-16 11:29:27 -08:00
Reynold Xin 42de5253f3 [SPARK-11745][SQL] Enable more JSON parsing options
This patch adds the following options to the JSON data source, for dealing with non-standard JSON files:
* `allowComments` (default `false`): ignores Java/C++ style comment in JSON records
* `allowUnquotedFieldNames` (default `false`): allows unquoted JSON field names
* `allowSingleQuotes` (default `true`): allows single quotes in addition to double quotes
* `allowNumericLeadingZeros` (default `false`): allows leading zeros in numbers (e.g. 00012)

To avoid passing a lot of options throughout the json package, I introduced a new JSONOptions case class to define all JSON config options.

Also updated documentation to explain these options.

Scala

![screen shot 2015-11-15 at 6 12 12 pm](https://cloud.githubusercontent.com/assets/323388/11172965/e3ace6ec-8bc4-11e5-805e-2d78f80d0ed6.png)

Python

![screen shot 2015-11-15 at 6 11 28 pm](https://cloud.githubusercontent.com/assets/323388/11172964/e23ed6ee-8bc4-11e5-8216-312f5983acd5.png)

Author: Reynold Xin <rxin@databricks.com>

Closes #9724 from rxin/SPARK-11745.
2015-11-16 00:06:14 -08:00
Andrew Ray a24477996e [SPARK-11690][PYSPARK] Add pivot to python api
This PR adds pivot to the python api of GroupedData with the same syntax as Scala/Java.

Author: Andrew Ray <ray.andrew@gmail.com>

Closes #9653 from aray/sql-pivot-python.
2015-11-13 10:31:17 -08:00
Shixiong Zhu ec80c0c2fc [SPARK-11706][STREAMING] Fix the bug that Streaming Python tests cannot report failures
This PR just checks the test results and returns 1 if the test fails, so that `run-tests.py` can mark it fail.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #9669 from zsxwing/streaming-python-tests.
2015-11-13 00:30:27 -08:00
Chris Snow 68ef61bb65 [SPARK-11658] simplify documentation for PySpark combineByKey
Author: Chris Snow <chsnow123@gmail.com>

Closes #9640 from snowch/patch-3.
2015-11-12 15:50:47 -08:00
Chris Snow 380dfcc0dc [SPARK-11671] documentation code example typo
Example for sqlContext.createDataDrame from pandas.DataFrame has a typo

Author: Chris Snow <chsnow123@gmail.com>

Closes #9639 from snowch/patch-2.
2015-11-12 15:42:30 -08:00
JihongMa d292f74831 [SPARK-11420] Updating Stddev support via Imperative Aggregate
switched stddev support from DeclarativeAggregate to ImperativeAggregate.

Author: JihongMa <linlin200605@gmail.com>

Closes #9380 from JihongMA/SPARK-11420.
2015-11-12 13:47:34 -08:00
Davies Liu bd70244b3c [SPARK-11463] [PYSPARK] only install signal in main thread
Only install signal in main thread, or it will fail to create context in not-main thread.

Author: Davies Liu <davies@databricks.com>

Closes #9574 from davies/python_signal.
2015-11-10 22:46:17 -08:00
Yu ISHIKAWA c0e48dfa61 [SPARK-11566] [MLLIB] [PYTHON] Refactoring GaussianMixtureModel.gaussians in Python
cc jkbradley

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #9534 from yu-iskw/SPARK-11566.
2015-11-10 16:42:28 -08:00
felixcheung 32790fe724 [SPARK-11567] [PYTHON] Add Python API for corr Aggregate function
like `df.agg(corr("col1", "col2")`

davies

Author: felixcheung <felixcheung_m@hotmail.com>

Closes #9536 from felixcheung/pyfunc.
2015-11-10 15:47:10 -08:00
Yin Huai e0701c7560 [SPARK-9830][SQL] Remove AggregateExpression1 and Aggregate Operator used to evaluate AggregateExpression1s
https://issues.apache.org/jira/browse/SPARK-9830

This PR contains the following main changes.
* Removing `AggregateExpression1`.
* Removing `Aggregate` operator, which is used to evaluate `AggregateExpression1`.
* Removing planner rule used to plan `Aggregate`.
* Linking `MultipleDistinctRewriter` to analyzer.
* Renaming `AggregateExpression2` to `AggregateExpression` and `AggregateFunction2` to `AggregateFunction`.
* Updating places where we create aggregate expression. The way to create aggregate expressions is `AggregateExpression(aggregateFunction, mode, isDistinct)`.
* Changing `val`s in `DeclarativeAggregate`s that touch children of this function to `lazy val`s (when we create aggregate expression in DataFrame API, children of an aggregate function can be unresolved).

Author: Yin Huai <yhuai@databricks.com>

Closes #9556 from yhuai/removeAgg1.
2015-11-10 11:06:29 -08:00
Yu ISHIKAWA 7dc9d8dba6 [SPARK-11610][MLLIB][PYTHON][DOCS] Make the docs of LDAModel.describeTopics in Python more specific
cc jkbradley

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #9577 from yu-iskw/SPARK-11610.
2015-11-09 16:25:29 -08:00
Nick Buroojy f138cb8733 [SPARK-9301][SQL] Add collect_set and collect_list aggregate functions
For now they are thin wrappers around the corresponding Hive UDAFs.

One limitation with these in Hive 0.13.0 is they only support aggregating primitive types.

I chose snake_case here instead of camelCase because it seems to be used in the majority of the multi-word fns.

Do we also want to add these to `functions.py`?

This approach was recommended here: https://github.com/apache/spark/pull/8592#issuecomment-154247089

marmbrus rxin

Author: Nick Buroojy <nick.buroojy@civitaslearning.com>

Closes #9526 from nburoojy/nick/udaf-alias.

(cherry picked from commit a6ee4f989d)
Signed-off-by: Michael Armbrust <michael@databricks.com>
2015-11-09 14:30:52 -08:00
Yu ISHIKAWA 88a3fdcc78 [SPARK-10280][MLLIB][PYSPARK][DOCS] Add @since annotation to pyspark.ml.classification
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8690 from yu-iskw/SPARK-10280.
2015-11-09 13:16:04 -08:00
Yu ISHIKAWA 2ff0e79a86 [SPARK-8467] [MLLIB] [PYSPARK] Add LDAModel.describeTopics() in Python
Could jkbradley and davies review it?

- Create a wrapper class: `LDAModelWrapper` for `LDAModel`. Because we can't deal with the return value of`describeTopics` in Scala from pyspark directly. `Array[(Array[Int], Array[Double])]` is too complicated to convert it.
- Add `loadLDAModel` in `PythonMLlibAPI`. Since `LDAModel` in Scala is an abstract class and we need to call `load` of `DistributedLDAModel`.

[[SPARK-8467] Add LDAModel.describeTopics() in Python - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8467)

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8643 from yu-iskw/SPARK-8467-2.
2015-11-06 22:56:29 -08:00
Michael Armbrust 105732dcc6 [HOTFIX] Fix python tests after #9527
#9527 missed updating the python tests.

Author: Michael Armbrust <michael@databricks.com>

Closes #9533 from marmbrus/hotfixTextValue.
2015-11-06 17:22:30 -08:00
Nong Li 1ab72b0860 [SPARK-11410] [PYSPARK] Add python bindings for repartition and sortW…
…ithinPartitions.

Author: Nong Li <nong@databricks.com>

Closes #9504 from nongli/spark-11410.
2015-11-06 15:48:20 -08:00
Imran Rashid 49f1a82037 [SPARK-10116][CORE] XORShiftRandom.hashSeed is random in high bits
https://issues.apache.org/jira/browse/SPARK-10116

This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`.

mengxr mkolod

Author: Imran Rashid <irashid@cloudera.com>

Closes #8314 from squito/SPARK-10116.
2015-11-06 20:06:24 +00:00
Yanbo Liang 9da7ceed81 [SPARK-11473][ML] R-like summary statistics with intercept for OLS via normal equation solver
Follow up [SPARK-9836](https://issues.apache.org/jira/browse/SPARK-9836), we should also support summary statistics for ```intercept```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9485 from yanboliang/spark-11473.
2015-11-05 09:56:18 -08:00
Yanbo Liang 2e86cf1b01 [SPARK-11527][ML][PYSPARK] PySpark AFTSurvivalRegressionModel should expose coefficients/intercept/scale
PySpark ```AFTSurvivalRegressionModel``` should expose coefficients/intercept/scale. mengxr vectorijk

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9492 from yanboliang/spark-11527.
2015-11-05 09:00:03 -08:00
Nick Evans 859dff56eb [SPARK-11378][STREAMING] make StreamingContext.awaitTerminationOrTimeout return properly
This adds a failing test checking that `awaitTerminationOrTimeout` returns the expected value, and then fixes that failing test with the addition of a `return`.

tdas zsxwing

Author: Nick Evans <me@nicolasevans.org>

Closes #9336 from manygrams/fix_await_termination_or_timeout.
2015-11-05 09:18:20 +00:00
Yu ISHIKAWA 411ff6afb4 [SPARK-10028][MLLIB][PYTHON] Add Python API for PrefixSpan
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #9469 from yu-iskw/SPARK-10028.
2015-11-04 15:28:19 -08:00
Reynold Xin 5051262d4c [SPARK-11489][SQL] Only include common first order statistics in GroupedData
We added a bunch of higher order statistics such as skewness and kurtosis to GroupedData. I don't think they are common enough to justify being listed, since users can always use the normal statistics aggregate functions.

That is to say, after this change, we won't support
```scala
df.groupBy("key").kurtosis("colA", "colB")
```

However, we will still support
```scala
df.groupBy("key").agg(kurtosis(col("colA")), kurtosis(col("colB")))
```

Author: Reynold Xin <rxin@databricks.com>

Closes #9446 from rxin/SPARK-11489.
2015-11-03 16:27:56 -08:00
Davies Liu 1d04dc95c0 [SPARK-11467][SQL] add Python API for stddev/variance
Add Python API for stddev/stddev_pop/stddev_samp/variance/var_pop/var_samp/skewness/kurtosis

Author: Davies Liu <davies@databricks.com>

Closes #9424 from davies/py_var.
2015-11-03 13:33:46 -08:00
vectorijk c020f7d9d4 [SPARK-10592] [ML] [PySpark] Deprecate weights and use coefficients instead in ML models
Deprecated in `LogisticRegression` and `LinearRegression`

Author: vectorijk <jiangkai@gmail.com>

Closes #9311 from vectorijk/spark-10592.
2015-11-02 16:12:04 -08:00
lihao ecfb3e73fd [SPARK-10286][ML][PYSPARK][DOCS] Add @since annotation to pyspark.ml.param and pyspark.ml.*
Author: lihao <lihaowhu@gmail.com>

Closes #9275 from lidinghao/SPARK-10286.
2015-11-02 16:09:22 -08:00
Xiangrui Meng 33ae7a35da [SPARK-11358][MLLIB] deprecate runs in k-means
This PR deprecates `runs` in k-means. `runs` introduces extra complexity and overhead in MLlib's k-means implementation. I haven't seen much usage with `runs` not equal to `1`. We don't have a unit test for it either. We can deprecate this method in 1.6, and void it in 1.7. It helps us simplify the implementation.

cc: srowen

Author: Xiangrui Meng <meng@databricks.com>

Closes #9322 from mengxr/SPARK-11358.
2015-11-02 13:42:16 -08:00
Jason White f92f334ca4 [SPARK-11437] [PYSPARK] Don't .take when converting RDD to DataFrame with provided schema
When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls `.take(10)` to verify the first 10 rows of the RDD match the provided schema. Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue affected cases where a schema was not provided.

Verifying the first 10 rows is of limited utility and causes the DAG to be executed non-lazily. If necessary, I believe this verification should be done lazily on all rows. However, since the caller is providing a schema to follow, I think it's acceptable to simply fail if the schema is incorrect.

marmbrus We chatted about this at SparkSummitEU. davies you made a similar change for the infer-schema path in https://github.com/apache/spark/pull/6606

Author: Jason White <jason.white@shopify.com>

Closes #9392 from JasonMWhite/createDataFrame_without_take.
2015-11-02 10:49:06 -08:00
Liang-Chi Hsieh 3dfa4ea526 [SPARK-11322] [PYSPARK] Keep full stack trace in captured exception
JIRA: https://issues.apache.org/jira/browse/SPARK-11322

As reported by JoshRosen in [databricks/spark-redshift/issues/89](https://github.com/databricks/spark-redshift/issues/89#issuecomment-149828308), the exception-masking behavior sometimes makes debugging harder. To deal with this issue, we should keep full stack trace in the captured exception.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #9283 from viirya/py-exception-stacktrace.
2015-10-28 21:45:00 -07:00
Reynold Xin 5aa0521911 [SPARK-11292] [SQL] Python API for text data source
Adds DataFrameReader.text and DataFrameWriter.text.

Author: Reynold Xin <rxin@databricks.com>

Closes #9259 from rxin/SPARK-11292.
2015-10-28 14:28:38 -07:00
Yanbo Liang f92b7b98e9 [SPARK-11367][ML][PYSPARK] Python LinearRegression should support setting solver
[SPARK-10668](https://issues.apache.org/jira/browse/SPARK-10668) has provided ```WeightedLeastSquares``` solver("normal") in ```LinearRegression``` with L2 regularization in Scala and R, Python ML ```LinearRegression``` should also support setting solver("auto", "normal", "l-bfgs")

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #9328 from yanboliang/spark-11367.
2015-10-28 08:54:20 -07:00
Sean Owen 826e1e304b [SPARK-11302][MLLIB] 2) Multivariate Gaussian Model with Covariance matrix returns incorrect answer in some cases
Fix computation of root-sigma-inverse in multivariate Gaussian; add a test and fix related Python mixture model test.

Supersedes https://github.com/apache/spark/pull/9293

Author: Sean Owen <sowen@cloudera.com>

Closes #9309 from srowen/SPARK-11302.2.
2015-10-27 23:07:37 -07:00
vectorijk 9dba5fb2b5 [SPARK-10024][PYSPARK] Python API RF and GBT related params clear up
implement {RandomForest, GBT, TreeEnsemble, TreeClassifier, TreeRegressor}Params for Python API
in pyspark/ml/{classification, regression}.py

Author: vectorijk <jiangkai@gmail.com>

Closes #9233 from vectorijk/spark-10024.
2015-10-27 13:55:03 -07:00
Mike Dusenberry 3bdbbc6c97 [SPARK-6488][MLLIB][PYTHON] Support addition/multiplication in PySpark's BlockMatrix
This PR adds addition and multiplication to PySpark's `BlockMatrix` class via `add` and `multiply` functions.

Author: Mike Dusenberry <mwdusenb@us.ibm.com>

Closes #9139 from dusenberrymw/SPARK-6488_Add_Addition_and_Multiplication_to_PySpark_BlockMatrix.
2015-10-27 11:05:14 -07:00
Nick Evans 8f888eea1a [SPARK-11270][STREAMING] Add improved equality testing for TopicAndPartition from the Kafka Streaming API
jerryshao tdas

I know this is kind of minor, and I know you all are busy, but this brings this class in line with the `OffsetRange` class, and makes tests a little more concise.

Instead of doing something like:
```
assert topic_and_partition_instance._topic == "foo"
assert topic_and_partition_instance._partition == 0
```

You can do something like:
```
assert topic_and_partition_instance == TopicAndPartition("foo", 0)
```

Before:
```
>>> from pyspark.streaming.kafka import TopicAndPartition
>>> TopicAndPartition("foo", 0) == TopicAndPartition("foo", 0)
False
```

After:
```
>>> from pyspark.streaming.kafka import TopicAndPartition
>>> TopicAndPartition("foo", 0) == TopicAndPartition("foo", 0)
True
```

I couldn't find any tests - am I missing something?

Author: Nick Evans <me@nicolasevans.org>

Closes #9236 from manygrams/topic_and_partition_equality.
2015-10-27 01:29:06 -07:00
noelsmith 5d4f6abec4 [SPARK-10271][PYSPARK][MLLIB] Added @since tags to pyspark.mllib.clustering
Duplicated the since decorator from pyspark.sql into pyspark (also tweaked to handle functions without docstrings).

Added since to methods + "versionadded::" to classes (derived from the git file history in pyspark).

Author: noelsmith <mail@noelsmith.com>

Closes #8627 from noel-smith/SPARK-10271-since-mllib-clustering.
2015-10-26 21:28:18 -07:00
Jeff Zhang 05c4bdb579 [SPARK-11279][PYSPARK] Add DataFrame#toDF in PySpark
Author: Jeff Zhang <zjffdu@apache.org>

Closes #9248 from zjffdu/SPARK-11279.
2015-10-26 09:25:19 +01:00
Yu ISHIKAWA 282a15f78e [SPARK-10277] [MLLIB] [PYSPARK] Add @since annotation to pyspark.mllib.regression
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8684 from yu-iskw/SPARK-10277.
2015-10-23 08:43:49 -07:00
Gábor Lipták 163d53e829 [SPARK-7021] Add JUnit output for Python unit tests
WIP

Author: Gábor Lipták <gliptak@gmail.com>

Closes #8323 from gliptak/SPARK-7021.
2015-10-22 15:27:11 -07:00
Jeff Zhang 5cdea7d1e5 [SPARK-11205][PYSPARK] Delegate to scala DataFrame API rather than p…
…rint in python

No test needed. Verify it manually in pyspark shell

Author: Jeff Zhang <zjffdu@apache.org>

Closes #9177 from zjffdu/SPARK-11205.
2015-10-20 23:58:27 -07:00
Xiangrui Meng 135ade9050 [MINOR][ML] fix doc warnings
Without an empty line, sphinx will treat doctest as docstring. holdenk

~~~
/Users/meng/src/spark/python/pyspark/ml/feature.py:docstring of pyspark.ml.feature.CountVectorizer:3: ERROR: Undefined substitution referenced: "label|raw |vectors | +-----+---------------+-------------------------+ |0 |[a, b, c] |(3,[0,1,2],[1.0,1.0,1.0])".
/Users/meng/src/spark/python/pyspark/ml/feature.py:docstring of pyspark.ml.feature.CountVectorizer:3: ERROR: Undefined substitution referenced: "1 |[a, b, b, c, a]|(3,[0,1,2],[2.0,2.0,1.0])".
~~~

Author: Xiangrui Meng <meng@databricks.com>

Closes #9188 from mengxr/py-count-vec-doc-fix.
2015-10-20 18:38:06 -07:00
Holden Karau aea7142c98 [SPARK-10767][PYSPARK] Make pyspark shared params codegen more consistent
Namely "." shows up in some places in the template when using the param docstring and not in others

Author: Holden Karau <holden@pigscanfly.ca>

Closes #9017 from holdenk/SPARK-10767-Make-pyspark-shared-params-codegen-more-consistent.
2015-10-20 16:51:32 -07:00
noelsmith 04521ea067 [SPARK-10269][PYSPARK][MLLIB] Add @since annotation to pyspark.mllib.classification
Duplicated the since decorator from pyspark.sql into pyspark (also tweaked to handle functions without docstrings).

Added since to methods + "versionadded::" to classes derived from the file history.

Note - some methods are inherited from the regression module (i.e. LinearModel.intercept) so these won't have version numbers in the API docs until that model is updated.

Author: noelsmith <mail@noelsmith.com>

Closes #8626 from noel-smith/SPARK-10269-since-mlib-classification.
2015-10-20 16:14:20 -07:00
noelsmith 82e9d9c81b [SPARK-10272][PYSPARK][MLLIB] Added @since tags to pyspark.mllib.evaluation
Duplicated the since decorator from pyspark.sql into pyspark (also tweaked to handle functions without docstrings).

Added since to public methods + "versionadded::" to classes (derived from the git file history in pyspark).

Note - I added also the tags to MultilabelMetrics even though it isn't declared as public in the __all__ statement... if that's incorrect - I'll remove.

Author: noelsmith <mail@noelsmith.com>

Closes #8628 from noel-smith/SPARK-10272-since-mllib-evalutation.
2015-10-20 15:05:02 -07:00
Holden Karau e18b571c33 [SPARK-10447][SPARK-3842][PYSPARK] upgrade pyspark to py4j0.9
Upgrade to Py4j0.9

Author: Holden Karau <holden@pigscanfly.ca>
Author: Holden Karau <holden@us.ibm.com>

Closes #8615 from holdenk/SPARK-10447-upgrade-pyspark-to-py4j0.9.
2015-10-20 10:52:49 -07:00
Davies Liu 232d7f8d42 [SPARK-11114][PYSPARK] add getOrCreate for SparkContext/SQLContext in Python
Also added SQLContext.newSession()

Author: Davies Liu <davies@databricks.com>

Closes #9122 from davies/py_create.
2015-10-19 16:18:20 -07:00
Brennon York d3180c25d8 [SPARK-7018][BUILD] Refactor dev/run-tests-jenkins into Python
This commit refactors the `run-tests-jenkins` script into Python. This refactoring was done by brennonyork in #7401; this PR contains a few minor edits from joshrosen in order to bring it up to date with other recent changes.

From the original PR description (by brennonyork):

Currently a few things are left out that, could and I think should, be smaller JIRA's after this.

1. There are still a few areas where we use environment variables where we don't need to (like `CURRENT_BLOCK`). I might get around to fixing this one in lieu of everything else, but wanted to point that out.
2. The PR tests are still written in bash. I opted to not change those and just rewrite the runner into Python. This is a great follow-on JIRA IMO.
3. All of the linting scripts are still in bash as well and would likely do to just add those in as follow-on JIRA's as well.

Closes #7401.

Author: Brennon York <brennon.york@capitalone.com>

Closes #9161 from JoshRosen/run-tests-jenkins-refactoring.
2015-10-18 22:45:27 -07:00
Mahmoud Lababidi a337c235a1 [SPARK-11158][SQL] Modified _verify_type() to be more informative on Errors by presenting the Object
The _verify_type() function had Errors that were raised when there were Type conversion issues but left out the Object in question. The Object is now added in the Error to reduce the strain on the user to debug through to figure out the Object that failed the Type conversion.

The use case for me was a Pandas DataFrame that contained 'nan' as values for columns of Strings.

Author: Mahmoud Lababidi <mahmoud@thehumangeo.com>
Author: Mahmoud Lababidi <lababidi@gmail.com>

Closes #9149 from lababidi/master.
2015-10-18 11:39:19 -07:00
Koert Kuipers 57f83e36d6 [SPARK-10185] [SQL] Feat sql comma separated paths
Make sure comma-separated paths get processed correcly in ResolvedDataSource for a HadoopFsRelationProvider

Author: Koert Kuipers <koert@tresata.com>

Closes #8416 from koertkuipers/feat-sql-comma-separated-paths.
2015-10-17 14:56:24 -07:00
zero323 8ac71d62d9 [SPARK-11084] [ML] [PYTHON] Check if index can contain non-zero value before binary search
At this moment `SparseVector.__getitem__` executes `np.searchsorted` first and checks if result is in an expected range after that. It is possible to check if index can contain non-zero value before executing `np.searchsorted`.

Author: zero323 <matthew.szymkiewicz@gmail.com>

Closes #9098 from zero323/sparse_vector_getitem_improved.
2015-10-16 15:53:26 -07:00
Bhargav Mangipudi 1ec0a0dc28 [SPARK-11050] [MLLIB] PySpark SparseVector can return wrong index in e…
…rror message

For negative indices in the SparseVector, we update the index value. If we have an incorrect index
at this point, the error message has the incorrect *updated* index instead of the original one. This
change contains the fix for the same.

Author: Bhargav Mangipudi <bhargav.mangipudi@gmail.com>

Closes #9069 from bhargav/spark-10759.
2015-10-16 14:36:05 -07:00
Joseph K. Bradley c75f058b72 [PYTHON] [MINOR] List modules in PySpark tests when given bad name
Output list of supported modules for python tests in error message when given bad module name.

CC: davies

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #9088 from jkbradley/python-tests-modules.
2015-10-13 12:03:46 -07:00
Ashwin Shankar 2e572c4135 [SPARK-8170] [PYTHON] Add signal handler to trap Ctrl-C in pyspark and cancel all running jobs
This patch adds a signal handler to trap Ctrl-C and cancels running job.

Author: Ashwin Shankar <ashankar@netflix.com>

Closes #9033 from ashwinshankar77/master.
2015-10-12 11:06:21 -07:00
Vladimir Vladimirov c1b4ce4326 [SPARK-10535] Sync up API for matrix factorization model between Scala and PySpark
Support for recommendUsersForProducts and recommendProductsForUsers in matrix factorization model for PySpark

Author: Vladimir Vladimirov <vladimir.vladimirov@magnetic.com>

Closes #8700 from smartkiwi/SPARK-10535_.
2015-10-09 14:16:13 -07:00
Bryan Cutler 5410747a84 [SPARK-10959] [PYSPARK] StreamingLogisticRegressionWithSGD does not train with given regParam and convergenceTol parameters
These params were being passed into the StreamingLogisticRegressionWithSGD constructor, but not transferred to the call for model training.  Same with StreamingLinearRegressionWithSGD.  I added the params as named arguments to the call and also fixed the intercept parameter, which was being passed as regularization value.

Author: Bryan Cutler <bjcutler@us.ibm.com>

Closes #9002 from BryanCutler/StreamingSGD-convergenceTol-bug-10959.
2015-10-08 22:21:07 -07:00
zero323 8e67882b90 [SPARK-10973] [ML] [PYTHON] __gettitem__ method throws IndexError exception when we…
__gettitem__ method throws IndexError exception when we try to access index after the last non-zero entry

    from pyspark.mllib.linalg import Vectors
    sv = Vectors.sparse(5, {1: 3})
    sv[0]
    ## 0.0
    sv[1]
    ## 3.0
    sv[2]
    ## Traceback (most recent call last):
    ##   File "<stdin>", line 1, in <module>
    ##   File "/python/pyspark/mllib/linalg/__init__.py", line 734, in __getitem__
    ##     row_ind = inds[insert_index]
    ## IndexError: index out of bounds

Author: zero323 <matthew.szymkiewicz@gmail.com>

Closes #9009 from zero323/sparse_vector_index_error.
2015-10-08 18:34:15 -07:00
Holden Karau 3aff0866a8 [SPARK-9774] [ML] [PYSPARK] Add python api for ml regression isotonicregression
Add the Python API for isotonicregression.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #8214 from holdenk/SPARK-9774-add-python-api-for-ml-regression-isotonicregression.
2015-10-07 17:50:35 -07:00
Evan Chen da936fbb74 [SPARK-10779] [PYSPARK] [MLLIB] Set initialModel for KMeans model in PySpark (spark.mllib)
Provide initialModel param for pyspark.mllib.clustering.KMeans

Author: Evan Chen <chene@us.ibm.com>

Closes #8967 from evanyc15/SPARK-10779-pyspark-mllib.
2015-10-07 15:04:53 -07:00
Xiangrui Meng 5e035403d4 [SPARK-10957] [ML] setParams changes quantileProbabilities unexpectly in PySpark's AFTSurvivalRegression
If user doesn't specify `quantileProbs` in `setParams`, it will get reset to the default value. We don't need special handling here. vectorijk yanboliang

Author: Xiangrui Meng <meng@databricks.com>

Closes #9001 from mengxr/SPARK-10957.
2015-10-06 14:58:42 -07:00
vectorijk 5952bdb7df [SPARK-10688] [ML] [PYSPARK] Python API for AFTSurvivalRegression
Implement Python API for AFTSurvivalRegression

Author: vectorijk <jiangkai@gmail.com>

Closes #8926 from vectorijk/spark-10688.
2015-10-06 12:43:28 -07:00
asokadiggs c1ad373f26 [SPARK-10782] [PYTHON] Update dropDuplicates documentation
Documentation for dropDuplicates() and drop_duplicates() is one and the same.  Resolved the error in the example for drop_duplicates using the same approach used for groupby and groupBy, by indicating that dropDuplicates and drop_duplicates are aliases.

Author: asokadiggs <asoka.diggs@intel.com>

Closes #8930 from asokadiggs/jira-10782.
2015-09-29 17:45:18 -04:00
Erik Shilts 7d399c9daa [SPARK-6919] [PYSPARK] Add asDict method to StatCounter
Add method to easily convert a StatCounter instance into a Python dict

https://issues.apache.org/jira/browse/SPARK-6919

Note: This is my original work and the existing Spark license applies.

Author: Erik Shilts <erik.shilts@opower.com>

Closes #5516 from eshilts/statcounter-asdict.
2015-09-29 13:38:15 -07:00
noelsmith ab41864f91 [SPARK-10415] [PYSPARK] [MLLIB] [DOCS] Enhance Navigation Sidebar in PySpark API
These are CSS/JavaScript changes changes to make navigation in the PySpark API a bit simpler by adding the following to the sidebar:

* Classes
* Functions
* Tags to highlight experimental features

![screen shot 2015-09-02 at 08 50 12](https://cloud.githubusercontent.com/assets/11915197/9634781/301f853a-518b-11e5-8d5c-fda202f6202f.png)

Online example here: https://dl.dropboxusercontent.com/u/20821334/pyspark-api-nav-enhance/pyspark.mllib.html

(The contribution is my original work and that I license the work to the project under the project's open source license)

Author: noelsmith <mail@noelsmith.com>

Closes #8571 from noel-smith/pyspark-api-nav-enhance.
2015-09-29 13:25:38 -07:00
Eric Liang 922338812c [SPARK-9681] [ML] Support R feature interactions in RFormula
This integrates the Interaction feature transformer with SparkR R formula support (i.e. support `:`).

To generate reasonable ML attribute names for feature interactions, it was necessary to add the ability to read attribute the original attribute names back from `StructField`, and also to specify custom group prefixes in `VectorAssembler`. This also has the side-benefit of cleaning up the double-underscores in the attributes generated for non-interaction terms.

mengxr

Author: Eric Liang <ekl@databricks.com>

Closes #8830 from ericl/interaction-2.
2015-09-25 00:43:22 -07:00
Reynold Xin 9952217749 [SPARK-10731] [SQL] Delegate to Scala's DataFrame.take implementation in Python DataFrame.
Python DataFrame.head/take now requires scanning all the partitions. This pull request changes them to delegate the actual implementation to Scala DataFrame (by calling DataFrame.take).

This is more of a hack for fixing this issue in 1.5.1. A more proper fix is to change executeCollect and executeTake to return InternalRow rather than Row, and thus eliminate the extra round-trip conversion.

Author: Reynold Xin <rxin@databricks.com>

Closes #8876 from rxin/SPARK-10731.
2015-09-23 16:43:21 -07:00
Liang-Chi Hsieh 1fcefef069 [SPARK-10446][SQL] Support to specify join type when calling join with usingColumns
JIRA: https://issues.apache.org/jira/browse/SPARK-10446

Currently the method `join(right: DataFrame, usingColumns: Seq[String])` only supports inner join. It is more convenient to have it support other join types.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #8600 from viirya/usingcolumns_df.
2015-09-21 23:46:00 -07:00
Jian Feng 0180b849db [SPARK-10577] [PYSPARK] DataFrame hint for broadcast join
https://issues.apache.org/jira/browse/SPARK-10577

Author: Jian Feng <jzhang.chs@gmail.com>

Closes #8801 from Jianfeng-chs/master.
2015-09-21 23:36:41 -07:00
Sean Owen bf20d6c9f9 [SPARK-10716] [BUILD] spark-1.5.0-bin-hadoop2.6.tgz file doesn't uncompress on OS X due to hidden file
Remove ._SUCCESS.crc hidden file that may cause problems in distribution tar archive, and is not used

Author: Sean Owen <sowen@cloudera.com>

Closes #8846 from srowen/SPARK-10716.
2015-09-21 23:29:59 -07:00
Holden Karau 1cd6741572 [SPARK-9821] [PYSPARK] pyspark-reduceByKey-should-take-a-custom-partitioner
from the issue:

In Scala, I can supply a custom partitioner to reduceByKey (and other aggregation/repartitioning methods like aggregateByKey and combinedByKey), but as far as I can tell from the Pyspark API, there's no way to do the same in Python.
Here's an example of my code in Scala:
weblogs.map(s => (getFileType(s), 1)).reduceByKey(new FileTypePartitioner(),_+_)
But I can't figure out how to do the same in Python. The closest I can get is to call repartition before reduceByKey like so:
weblogs.map(lambda s: (getFileType(s), 1)).partitionBy(3,hash_filetype).reduceByKey(lambda v1,v2: v1+v2).collect()
But that defeats the purpose, because I'm shuffling twice instead of once, so my performance is worse instead of better.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #8569 from holdenk/SPARK-9821-pyspark-reduceByKey-should-take-a-custom-partitioner.
2015-09-21 23:21:24 -07:00
noelsmith 7c4f852bfc [DOC] [PYSPARK] [MLLIB] Added newlines to docstrings to fix parameter formatting
Added newlines before `:param ...:` and `:return:` markup. Without these, parameter lists aren't formatted correctly in the API docs. I.e:

![screen shot 2015-09-21 at 21 49 26](https://cloud.githubusercontent.com/assets/11915197/10004686/de3c41d4-60aa-11e5-9c50-a46dcb51243f.png)

.. looks like this once newline is added:

![screen shot 2015-09-21 at 21 50 14](https://cloud.githubusercontent.com/assets/11915197/10004706/f86bfb08-60aa-11e5-8524-ae4436713502.png)

Author: noelsmith <mail@noelsmith.com>

Closes #8851 from noel-smith/docstring-missing-newline-fix.
2015-09-21 14:24:19 -07:00
Holden Karau ba882db6f4 [SPARK-9769] [ML] [PY] add python api for countvectorizermodel
From JIRA: Add Python API, user guide and example for ml.feature.CountVectorizerModel

Author: Holden Karau <holden@pigscanfly.ca>

Closes #8561 from holdenk/SPARK-9769-add-python-api-for-countvectorizermodel.
2015-09-21 13:06:23 -07:00
vinodkc 0144039517 [SPARK-10631] [DOCUMENTATION, MLLIB, PYSPARK] Added documentation for few APIs
There are some missing API docs in pyspark.mllib.linalg.Vector (including DenseVector and SparseVector). We should add them based on their Scala counterparts.

Author: vinodkc <vinod.kc.in@gmail.com>

Closes #8834 from vinodkc/fix_SPARK-10631.
2015-09-20 22:55:24 -07:00
Josh Rosen 2117eea71e [SPARK-10710] Remove ability to disable spilling in core and SQL
It does not make much sense to set `spark.shuffle.spill` or `spark.sql.planner.externalSort` to false: I believe that these configurations were initially added as "escape hatches" to guard against bugs in the external operators, but these operators are now mature and well-tested. In addition, these configurations are not handled in a consistent way anymore: SQL's Tungsten codepath ignores these configurations and will continue to use spilling operators. Similarly, Spark Core's `tungsten-sort` shuffle manager does not respect `spark.shuffle.spill=false`.

This pull request removes these configurations, adds warnings at the appropriate places, and deletes a large amount of code which was only used in code paths that did not support spilling.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #8831 from JoshRosen/remove-ability-to-disable-spilling.
2015-09-19 21:40:21 -07:00
Yanbo Liang 35e8ab9390 [SPARK-10615] [PYSPARK] change assertEquals to assertEqual
As ```assertEquals``` is deprecated, so we need to change ```assertEquals``` to ```assertEqual``` for existing python unit tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8814 from yanboliang/spark-10615.
2015-09-18 09:53:52 -07:00
Liang-Chi Hsieh 136c77d8bb [SPARK-10642] [PYSPARK] Fix crash when calling rdd.lookup() on tuple keys
JIRA: https://issues.apache.org/jira/browse/SPARK-10642

When calling `rdd.lookup()` on a RDD with tuple keys, `portable_hash` will return a long. That causes `DAGScheduler.submitJob` to throw `java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer`.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #8796 from viirya/fix-pyrdd-lookup.
2015-09-17 10:02:15 -07:00
Yu ISHIKAWA 268088b899 [SPARK-10282] [ML] [PYSPARK] [DOCS] Add @since annotation to pyspark.ml.recommendation
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8692 from yu-iskw/SPARK-10282.
2015-09-17 08:51:19 -07:00
Yu ISHIKAWA c74d38fd8f [SPARK-10274] [MLLIB] Add @since annotation to pyspark.mllib.fpm
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8665 from yu-iskw/SPARK-10274.
2015-09-17 08:50:46 -07:00
Yu ISHIKAWA 4a0b56e8db [SPARK-10279] [MLLIB] [PYSPARK] [DOCS] Add @since annotation to pyspark.mllib.util
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8689 from yu-iskw/SPARK-10279.
2015-09-17 08:50:00 -07:00
Yu ISHIKAWA 39b44cb52e [SPARK-10278] [MLLIB] [PYSPARK] Add @since annotation to pyspark.mllib.tree
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8685 from yu-iskw/SPARK-10278.
2015-09-17 08:48:45 -07:00
Yu ISHIKAWA 0ded87a4d4 [SPARK-10281] [ML] [PYSPARK] [DOCS] Add @since annotation to pyspark.ml.clustering
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8691 from yu-iskw/SPARK-10281.
2015-09-17 08:47:21 -07:00
Yu ISHIKAWA 29bf8aa5a5 [SPARK-10283] [ML] [PYSPARK] [DOCS] Add @since annotation to pyspark.ml.regression
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8693 from yu-iskw/SPARK-10283.
2015-09-17 08:45:20 -07:00
Yu ISHIKAWA c633ed3260 [SPARK-10284] [ML] [PYSPARK] [DOCS] Add @since annotation to pyspark.ml.tuning
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8694 from yu-iskw/SPARK-10284.
2015-09-17 08:43:59 -07:00
Yu ISHIKAWA d9b7f3e4db [SPARK-10276] [MLLIB] [PYSPARK] Add @since annotation to pyspark.mllib.recommendation
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8677 from yu-iskw/SPARK-10276.
2015-09-16 04:34:14 -07:00
Vinod K C 95b6a8103f [SPARK-10516] [ MLLIB] Added values property in DenseVector
Author: Vinod K C <vinod.kc@huawei.com>

Closes #8682 from vinodkc/fix_SPARK-10516.
2015-09-15 23:25:51 -07:00
noelsmith 7ca30b505c [PYSPARK] [MLLIB] [DOCS] Replaced addversion with versionadded in mllib.random
Missed this when reviewing `pyspark.mllib.random` for SPARK-10275.

Author: noelsmith <mail@noelsmith.com>

Closes #8773 from noel-smith/mllib-random-versionadded-fix.
2015-09-15 12:23:20 -07:00
Yu ISHIKAWA a2249359d5 [SPARK-10275] [MLLIB] Add @since annotation to pyspark.mllib.random
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #8666 from yu-iskw/SPARK-10275.
2015-09-14 21:59:40 -07:00
noelsmith 610971ecfe [SPARK-10273] Add @since annotation to pyspark.mllib.feature
Duplicated the since decorator from pyspark.sql into pyspark (also tweaked to handle functions without docstrings).

Added since to methods + "versionadded::" to classes (derived from the git file history in pyspark).

Author: noelsmith <mail@noelsmith.com>

Closes #8633 from noel-smith/SPARK-10273-since-mllib-feature.
2015-09-14 21:58:52 -07:00
Yanbo Liang 4ae4d54794 [SPARK-9793] [MLLIB] [PYSPARK] PySpark DenseVector, SparseVector implement __eq__ and __hash__ correctly
PySpark DenseVector, SparseVector ```__eq__``` method should use semantics equality, and DenseVector can compared with SparseVector.
Implement PySpark DenseVector, SparseVector ```__hash__``` method based on the first 16 entries. That will make PySpark Vector objects can be used in collections.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8166 from yanboliang/spark-9793.
2015-09-14 21:37:43 -07:00
Davies Liu 5520418100 [SPARK-10542] [PYSPARK] fix serialize namedtuple
Author: Davies Liu <davies@databricks.com>

Closes #8707 from davies/fix_namedtuple.
2015-09-14 19:46:34 -07:00
Yanbo Liang ce6f3f163b [SPARK-10194] [MLLIB] [PYSPARK] SGD algorithms need convergenceTol parameter in Python
[SPARK-3382](https://issues.apache.org/jira/browse/SPARK-3382) added a ```convergenceTol``` parameter for GradientDescent-based methods in Scala. We need that parameter in Python; otherwise, Python users will not be able to adjust that behavior (or even reproduce behavior from previous releases since the default changed).

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8457 from yanboliang/spark-10194.
2015-09-14 12:08:52 -07:00
JihongMa f4a22808e0 [SPARK-6548] Adding stddev to DataFrame functions
Adding STDDEV support for DataFrame using 1-pass online /parallel algorithm to compute variance. Please review the code change.

Author: JihongMa <linlin200605@gmail.com>
Author: Jihong MA <linlin200605@gmail.com>
Author: Jihong MA <jihongma@jihongs-mbp.usca.ibm.com>
Author: Jihong MA <jihongma@Jihongs-MacBook-Pro.local>

Closes #6297 from JihongMA/SPARK-SQL.
2015-09-12 10:17:15 -07:00
0x0FFF c34fc19765 [SPARK-9014] [SQL] Allow Python spark API to use built-in exponential operator
This PR addresses (SPARK-9014)[https://issues.apache.org/jira/browse/SPARK-9014]
Added functionality: `Column` object in Python now supports exponential operator `**`
Example:
```
from pyspark.sql import *
df = sqlContext.createDataFrame([Row(a=2)])
df.select(3**df.a,df.a**3,df.a**df.a).collect()
```
Outputs:
```
[Row(POWER(3.0, a)=9.0, POWER(a, 3.0)=8.0, POWER(a, a)=4.0)]
```

Author: 0x0FFF <programmerag@gmail.com>

Closes #8658 from 0x0FFF/SPARK-9014.
2015-09-11 15:19:04 -07:00
Icaro Medeiros c373866774 [PYTHON] Fixed typo in exception message
Just fixing a typo in exception message, raised when attempting to pickle SparkContext.

Author: Icaro Medeiros <icaro.medeiros@gmail.com>

Closes #8724 from icaromedeiros/master.
2015-09-11 21:46:52 +01:00
Yuhao Yang 5f46444765 [SPARK-8530] [ML] add python API for MinMaxScaler
jira: https://issues.apache.org/jira/browse/SPARK-8530

add python API for MinMaxScaler
jira for MinMaxScaler: https://issues.apache.org/jira/browse/SPARK-7514

Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #7150 from hhbyyh/pythonMinMax.
2015-09-11 10:32:35 -07:00
Joseph K. Bradley 2e3a280754 [MINOR] [MLLIB] [ML] [DOC] Minor doc fixes for StringIndexer and MetadataUtils
Changes:
* Make Scala doc for StringIndexerInverse clearer.  Also remove Scala doc from transformSchema, so that the doc is inherited.
* MetadataUtils.scala: “ Helper utilities for tree-based algorithms” —> not just trees anymore

CC: holdenk mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #8679 from jkbradley/doc-fixes-1.5.
2015-09-11 08:55:35 -07:00
Yanbo Liang b01b262606 [SPARK-9773] [ML] [PySpark] Add Python API for MultilayerPerceptronClassifier
Add Python API for ```MultilayerPerceptronClassifier```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8067 from yanboliang/SPARK-9773.
2015-09-11 08:52:28 -07:00
Yanbo Liang b656e6134f [SPARK-10026] [ML] [PySpark] Implement some common Params for regression in PySpark
LinearRegression and LogisticRegression lack of some Params for Python, and some Params are not shared classes which lead we need to write them for each class. These kinds of Params are list here:
```scala
HasElasticNetParam
HasFitIntercept
HasStandardization
HasThresholds
```
Here we implement them in shared params at Python side and make LinearRegression/LogisticRegression parameters peer with Scala one.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8508 from yanboliang/spark-10026.
2015-09-11 08:50:35 -07:00
Yanbo Liang a140dd77c6 [SPARK-10027] [ML] [PySpark] Add Python API missing methods for ml.feature
Missing method of ml.feature are listed here:
```StringIndexer``` lacks of parameter ```handleInvalid```.
```StringIndexerModel``` lacks of method ```labels```.
```VectorIndexerModel``` lacks of methods ```numFeatures``` and ```categoryMaps```.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8313 from yanboliang/spark-10027.
2015-09-10 20:43:38 -07:00
Yanbo Liang 89562a172f [SPARK-7544] [SQL] [PySpark] pyspark.sql.types.Row implements __getitem__
pyspark.sql.types.Row implements ```__getitem__```

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8333 from yanboliang/spark-7544.
2015-09-10 13:54:20 -07:00
Yanbo Liang 56a0fe5c6e [SPARK-9772] [PYSPARK] [ML] Add Python API for ml.feature.VectorSlicer
Add Python API for ml.feature.VectorSlicer.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8102 from yanboliang/SPARK-9772.
2015-09-09 18:02:33 -07:00
Holden Karau 2f6fd5256c [SPARK-9654] [ML] [PYSPARK] Add IndexToString to PySpark
Adds IndexToString to PySpark.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #7976 from holdenk/SPARK-9654-add-string-indexer-inverse-in-pyspark.
2015-09-08 22:13:05 -07:00
noelsmith 0e2f216331 [SPARK-10094] Pyspark ML Feature transformers marked as experimental
Modified class-level docstrings to mark all feature transformers in pyspark.ml as experimental.

Author: noelsmith <mail@noelsmith.com>

Closes #8623 from noel-smith/SPARK-10094-mark-pyspark-ml-trans-exp.
2015-09-08 21:26:20 -07:00
Davies Liu 3a11e50e21 [SPARK-10373] [PYSPARK] move @since into pyspark from sql
cc mengxr

Author: Davies Liu <davies@databricks.com>

Closes #8657 from davies/move_since.
2015-09-08 20:56:22 -07:00
Tathagata Das 7a4f326c00 [SPARK-10440] [STREAMING] [DOCS] Update python API stuff in the programming guides and python docs
- Fixed information around Python API tags in streaming programming guides
- Added missing stuff in python docs

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #8595 from tdas/SPARK-10440.
2015-09-04 23:16:39 -10:00
0x0FFF 6cd98c1878 [SPARK-10417] [SQL] Iterating through Column results in infinite loop
`pyspark.sql.column.Column` object has `__getitem__` method, which makes it iterable for Python. In fact it has `__getitem__` to address the case when the column might be a list or dict, for you to be able to access certain element of it in DF API. The ability to iterate over it is just a side effect that might cause confusion for the people getting familiar with Spark DF (as you might iterate this way on Pandas DF for instance)

Issue reproduction:
```
df = sqlContext.jsonRDD(sc.parallelize(['{"name": "El Magnifico"}']))
for i in df["name"]: print i
```

Author: 0x0FFF <programmerag@gmail.com>

Closes #8574 from 0x0FFF/SPARK-10417.
2015-09-02 13:36:36 -07:00
0x0FFF 00d9af5e19 [SPARK-10392] [SQL] Pyspark - Wrong DateType support on JDBC connection
This PR addresses issue [SPARK-10392](https://issues.apache.org/jira/browse/SPARK-10392)
The problem is that for "start of epoch" date (01 Jan 1970) PySpark class DateType returns 0 instead of the `datetime.date` due to implementation of its return statement

Issue reproduction on master:
```
>>> from pyspark.sql.types import *
>>> a = DateType()
>>> a.fromInternal(0)
0
>>> a.fromInternal(1)
datetime.date(1970, 1, 2)
```

Author: 0x0FFF <programmerag@gmail.com>

Closes #8556 from 0x0FFF/SPARK-10392.
2015-09-01 14:58:49 -07:00
0x0FFF bf550a4b55 [SPARK-10162] [SQL] Fix the timezone omitting for PySpark Dataframe filter function
This PR addresses [SPARK-10162](https://issues.apache.org/jira/browse/SPARK-10162)
The issue is with DataFrame filter() function, if datetime.datetime is passed to it:
* Timezone information of this datetime is ignored
* This datetime is assumed to be in local timezone, which depends on the OS timezone setting

Fix includes both code change and regression test. Problem reproduction code on master:
```python
import pytz
from datetime import datetime
from pyspark.sql import *
from pyspark.sql.types import *
sqc = SQLContext(sc)
df = sqc.createDataFrame([], StructType([StructField("dt", TimestampType())]))

m1 = pytz.timezone('UTC')
m2 = pytz.timezone('Etc/GMT+3')

df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
```
It gives the same timestamp ignoring time zone:
```
>>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
Filter (dt#0 > 946713600000000)
 Scan PhysicalRDD[dt#0]

>>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
Filter (dt#0 > 946713600000000)
 Scan PhysicalRDD[dt#0]
```
After the fix:
```
>>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
Filter (dt#0 > 946684800000000)
 Scan PhysicalRDD[dt#0]

>>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
Filter (dt#0 > 946695600000000)
 Scan PhysicalRDD[dt#0]
```
PR [8536](https://github.com/apache/spark/pull/8536) was occasionally closed by me dropping the repo

Author: 0x0FFF <programmerag@gmail.com>

Closes #8555 from 0x0FFF/SPARK-10162.
2015-09-01 14:34:59 -07:00
Holden Karau e6e483cc4d [SPARK-9679] [ML] [PYSPARK] Add Python API for Stop Words Remover
Add a python API for the Stop Words Remover.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #8118 from holdenk/SPARK-9679-python-StopWordsRemover.
2015-09-01 10:48:57 -07:00
Yanbo Liang 52ea399e6e [SPARK-10355] [ML] [PySpark] Add Python API for SQLTransformer
Add Python API for SQLTransformer

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8527 from yanboliang/spark-10355.
2015-08-31 16:11:27 -07:00
Yanbo Liang 5b3245d6df [SPARK-8472] [ML] [PySpark] Python API for DCT
Add Python API for ml.feature.DCT.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8485 from yanboliang/spark-8472.
2015-08-31 15:50:41 -07:00
noelsmith 7583681e6b [SPARK-10188] [PYSPARK] Pyspark CrossValidator with RMSE selects incorrect model
* Added isLargerBetter() method to Pyspark Evaluator to match the Scala version.
* JavaEvaluator delegates isLargerBetter() to underlying Scala object.
* Added check for isLargerBetter() in CrossValidator to determine whether to use argmin or argmax.
* Added test cases for where smaller is better (RMSE) and larger is better (R-Squared).

(This contribution is my original work and that I license the work to the project under Sparks' open source license)

Author: noelsmith <mail@noelsmith.com>

Closes #8399 from noel-smith/pyspark-rmse-xval-fix.
2015-08-27 23:59:30 -07:00
Yanbo Liang ce97834dc0 [SPARK-9964] [PYSPARK] [SQL] PySpark DataFrameReader accept RDD of String for JSON
PySpark DataFrameReader should could accept an RDD of Strings (like the Scala version does) for JSON, rather than only taking a path.
If this PR is merged, it should be duplicated to cover the other input types (not just JSON).

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8444 from yanboliang/spark-9964.
2015-08-26 22:19:11 -07:00
Davies Liu d41d6c4820 [SPARK-10305] [SQL] fix create DataFrame from Python class
cc jkbradley

Author: Davies Liu <davies@databricks.com>

Closes #8470 from davies/fix_create_df.
2015-08-26 16:04:44 -07:00
Sean Owen 69c9c17716 [SPARK-9613] [CORE] Ban use of JavaConversions and migrate all existing uses to JavaConverters
Replace `JavaConversions` implicits with `JavaConverters`

Most occurrences I've seen so far are necessary conversions; a few have been avoidable. None are in critical code as far as I see, yet.

Author: Sean Owen <sowen@cloudera.com>

Closes #8033 from srowen/SPARK-9613.
2015-08-25 12:33:13 +01:00
zsxwing 4e0395ddb7 [SPARK-10168] [STREAMING] Fix the issue that maven publishes wrong artifact jars
This PR removed the `outputFile` configuration from pom.xml and updated `tests.py` to search jars for both sbt build and maven build.

I ran ` mvn -Pkinesis-asl -DskipTests clean install` locally, and verified the jars in my local repository were correct. I also checked Python tests for maven build, and it passed all tests.

Author: zsxwing <zsxwing@gmail.com>

Closes #8373 from zsxwing/SPARK-10168 and squashes the following commits:

e0b5818 [zsxwing] Fix the sbt build
c697627 [zsxwing] Add the jar pathes to the exception message
be1d8a5 [zsxwing] Fix the issue that maven publishes wrong artifact jars
2015-08-24 12:38:01 -07:00
Tathagata Das 053d94fcf3 [SPARK-10142] [STREAMING] Made python checkpoint recovery handle non-local checkpoint paths and existing SparkContexts
The current code only checks checkpoint files in local filesystem, and always tries to create a new Python SparkContext (even if one already exists). The solution is to do the following:
1. Use the same code path as Java to check whether a valid checkpoint exists
2. Create a new Python SparkContext only if there no active one.

There is not test for the path as its hard to test with distributed filesystem paths in a local unit test. I am going to test it with a distributed file system manually to verify that this patch works.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #8366 from tdas/SPARK-10142 and squashes the following commits:

3afa666 [Tathagata Das] Added tests
2dd4ae5 [Tathagata Das] Added the check to not create a context if one already exists
9bf151b [Tathagata Das] Made python checkpoint recovery use java to find the checkpoint files
2015-08-23 19:24:32 -07:00
jerryshao d89cc38b33 [SPARK-10122] [PYSPARK] [STREAMING] Fix getOffsetRanges bug in PySpark-Streaming transform function
Details of the bug and explanations can be seen in [SPARK-10122](https://issues.apache.org/jira/browse/SPARK-10122).

tdas , please help to review.

Author: jerryshao <sshao@hortonworks.com>

Closes #8347 from jerryshao/SPARK-10122 and squashes the following commits:

4039b16 [jerryshao] Fix getOffsetRanges in transform() bug
2015-08-21 13:15:35 -07:00
MechCoder 52c60537a2 [MINOR] [SQL] Fix sphinx warnings in PySpark SQL
Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #8171 from MechCoder/sql_sphinx.
2015-08-20 10:05:31 -07:00
zsxwing 1f29d502e7 [SPARK-9812] [STREAMING] Fix Python 3 compatibility issue in PySpark Streaming and some docs
This PR includes the following fixes:
1. Use `range` instead of `xrange` in `queue_stream.py` to support Python 3.
2. Fix the issue that `utf8_decoder` will return `bytes` rather than `str` when receiving an empty `bytes` in Python 3.
3. Fix the commands in docs so that the user can copy them directly to the command line. The previous commands was broken in the middle of a path, so when copying to the command line, the path would be split to two parts by the extra spaces, which forces the user to fix it manually.

Author: zsxwing <zsxwing@gmail.com>

Closes #8315 from zsxwing/SPARK-9812.
2015-08-19 18:36:01 -07:00
Davies Liu 08887369c8 [SPARK-10073] [SQL] Python withColumn should replace the old column
DataFrame.withColumn in Python should be consistent with the Scala one (replacing the existing column  that has the same name).

cc marmbrus

Author: Davies Liu <davies@databricks.com>

Closes #8300 from davies/with_column.
2015-08-19 13:56:40 -07:00
Feynman Liang 28a98464ea [SPARK-10097] Adds shouldMaximize flag to ml.evaluation.Evaluator
Previously, users of evaluator (`CrossValidator` and `TrainValidationSplit`) would only maximize the metric in evaluator, leading to a hacky solution which negated metrics to be minimized and caused erroneous negative values to be reported to the user.

This PR adds a `isLargerBetter` attribute to the `Evaluator` base class, instructing users of `Evaluator` on whether the chosen metric should be maximized or minimized.

CC jkbradley

Author: Feynman Liang <fliang@databricks.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #8290 from feynmanliang/SPARK-10097.
2015-08-19 11:35:05 -07:00
Moussa Taifi 865a3df3d5 [DOCS] [SQL] [PYSPARK] Fix typo in ntile function
Fix typo in ntile function.

Author: Moussa Taifi <moutai10@gmail.com>

Closes #8261 from moutai/patch-2.
2015-08-19 09:42:41 +01:00
Yanbo Liang 0076e82123 [SPARK-9768] [PYSPARK] [ML] Add Python API and user guide for ml.feature.ElementwiseProduct
Add Python API, user guide and example for ml.feature.ElementwiseProduct.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8061 from yanboliang/SPARK-9768.
2015-08-17 17:25:41 -07:00
Joseph K. Bradley 1db7179fae [SPARK-9805] [MLLIB] [PYTHON] [STREAMING] Added _eventually for ml streaming pyspark tests
Recently, PySpark ML streaming tests have been flaky, most likely because of the batches not being processed in time.  Proposal: Replace the use of _ssc_wait (which waits for a fixed amount of time) with a method which waits for a fixed amount of time but can terminate early based on a termination condition method.  With this, we can extend the waiting period (to make tests less flaky) but also stop early when possible (making tests faster on average, which I verified locally).

CC: mengxr tdas freeman-lab

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #8087 from jkbradley/streaming-ml-tests.
2015-08-15 18:48:20 -07:00
Wenchen Fan 1150a19b18 [SPARK-8670] [SQL] Nested columns can't be referenced in pyspark
This bug is caused by a wrong column-exist-check in `__getitem__` of pyspark dataframe. `DataFrame.apply` accepts not only top level column names, but also nested column name like `a.b`, so we should remove that check from `__getitem__`.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #8202 from cloud-fan/nested.
2015-08-14 14:09:46 -07:00
Davies Liu 11ed2b180e [SPARK-9978] [PYSPARK] [SQL] fix Window.orderBy and doc of ntile()
Author: Davies Liu <davies@databricks.com>

Closes #8213 from davies/fix_window.
2015-08-14 13:55:29 -07:00
MechCoder ffa05c84fe [SPARK-9828] [PYSPARK] Mutable values should not be default arguments
Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #8110 from MechCoder/spark-9828.
2015-08-14 12:46:05 -07:00
Davies Liu 693949ba40 [SPARK-8976] [PYSPARK] fix open mode in python3
This bug only happen on Python 3 and Windows.

I tested this manually with python 3 and disable python daemon, no unit test yet.

Author: Davies Liu <davies@databricks.com>

Closes #8181 from davies/open_mode.
2015-08-13 17:33:37 -07:00
Davies Liu a8d2f4c5f9 [SPARK-9942] [PYSPARK] [SQL] ignore exceptions while try to import pandas
If pandas is broken (can't be imported, raise other exceptions other than ImportError), pyspark can't be imported, we should ignore all the exceptions.

Author: Davies Liu <davies@databricks.com>

Closes #8173 from davies/fix_pandas.
2015-08-13 14:03:55 -07:00
Xiangrui Meng 65fec798ce [MINOR] [DOC] fix mllib pydoc warnings
Switch to correct Sphinx syntax. MechCoder

Author: Xiangrui Meng <meng@databricks.com>

Closes #8169 from mengxr/mllib-pydoc-fix.
2015-08-13 10:16:40 -07:00
Xiangrui Meng 68f9957149 [SPARK-9918] [MLLIB] remove runs from k-means and rename epsilon to tol
This requires some discussion. I'm not sure whether `runs` is a useful parameter. It certainly complicates the implementation. We might want to optimize the k-means implementation with block matrix operations. In this case, having `runs` may not be worth the trade-off. Also it increases the communication cost in a single job, which might cause other issues.

This PR also renames `epsilon` to `tol` to have consistent naming among algorithms. The Python constructor is updated to include all parameters.

jkbradley yu-iskw

Author: Xiangrui Meng <meng@databricks.com>

Closes #8148 from mengxr/SPARK-9918 and squashes the following commits:

149b9e5 [Xiangrui Meng] fix constructor in Python and rename epsilon to tol
3cc15b3 [Xiangrui Meng] fix test and change initStep to initSteps in python
a0a0274 [Xiangrui Meng] remove runs from k-means in the pipeline API
2015-08-12 23:04:59 -07:00
Reynold Xin a17384fa34 [SPARK-9907] [SQL] Python crc32 is mistakenly calling md5
Author: Reynold Xin <rxin@databricks.com>

Closes #8138 from rxin/SPARK-9907.
2015-08-12 15:27:52 -07:00
Joseph K. Bradley 551def5d69 [SPARK-9789] [ML] Added logreg threshold param back
Reinstated LogisticRegression.threshold Param for binary compatibility.  Param thresholds overrides threshold, if set.

CC: mengxr dbtsai feynmanliang

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #8079 from jkbradley/logreg-reinstate-threshold.
2015-08-12 14:27:13 -07:00
Yanbo Liang 762bacc16a [SPARK-9766] [ML] [PySpark] check and add miss docs for PySpark ML
Check and add miss docs for PySpark ML (this issue only check miss docs for o.a.s.ml not o.a.s.mllib).

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #8059 from yanboliang/SPARK-9766.
2015-08-12 13:24:18 -07:00
Brennan Ashton 60103ecd3d [SPARK-9726] [PYTHON] PySpark DF join no longer accepts on=None
rxin

First pull request for Spark so let me know if I am missing anything
The contribution is my original work and I license the work to the project under the project's open source license.

Author: Brennan Ashton <bashton@brennanashton.com>

Closes #8016 from btashton/patch-1.
2015-08-12 11:57:30 -07:00
Tathagata Das 5b8bb1b213 [SPARK-9572] [STREAMING] [PYSPARK] Added StreamingContext.getActiveOrCreate() in Python
Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #8080 from tdas/SPARK-9572 and squashes the following commits:

64a231d [Tathagata Das] Fix based on comments
741a0d0 [Tathagata Das] Fixed style
f4f094c [Tathagata Das] Tweaked test
9afcdbe [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-9572
e21488d [Tathagata Das] Minor update
1a371d9 [Tathagata Das] Addressed comments.
60479da [Tathagata Das] Fixed indent
9c2da9c [Tathagata Das] Fixed bugs
b5bd32c [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-9572
b55b348 [Tathagata Das] Removed prints
5781728 [Tathagata Das] Fix style issues
b711214 [Tathagata Das] Reverted run-tests.py
643b59d [Tathagata Das] Revert unnecessary change
150e58c [Tathagata Das] Added StreamingContext.getActiveOrCreate() in Python
2015-08-11 12:02:28 -07:00
Tathagata Das 0f90d6055e [SPARK-9640] [STREAMING] [TEST] Do not run Python Kinesis tests when the Kinesis assembly JAR has not been generated
Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #7961 from tdas/SPARK-9640 and squashes the following commits:

974ce19 [Tathagata Das] Undo changes related to SPARK-9727
004ae26 [Tathagata Das] style fixes
9bbb97d [Tathagata Das] Minor style fies
e6a677e [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into SPARK-9640
ca90719 [Tathagata Das] Removed extra line
ba9cfc7 [Tathagata Das] Improved kinesis test selection logic
88d59bd [Tathagata Das] updated test modules
871fcc8 [Tathagata Das] Fixed SparkBuild
94be631 [Tathagata Das] Fixed style
b858196 [Tathagata Das] Fixed conditions and few other things based on PR comments.
e292e64 [Tathagata Das] Added filters for Kinesis python tests
2015-08-10 23:41:53 -07:00
Prabeesh K 853809e948 [SPARK-5155] [PYSPARK] [STREAMING] Mqtt streaming support in Python
This PR is based on #4229, thanks prabeesh.

Closes #4229

Author: Prabeesh K <prabsmails@gmail.com>
Author: zsxwing <zsxwing@gmail.com>
Author: prabs <prabsmails@gmail.com>
Author: Prabeesh K <prabeesh.k@namshi.com>

Closes #7833 from zsxwing/pr4229 and squashes the following commits:

9570bec [zsxwing] Fix the variable name and check null in finally
4a9c79e [zsxwing] Fix pom.xml indentation
abf5f18 [zsxwing] Merge branch 'master' into pr4229
935615c [zsxwing] Fix the flaky MQTT tests
47278c5 [zsxwing] Include the project class files
478f844 [zsxwing] Add unpack
5f8a1d4 [zsxwing] Make the maven build generate the test jar for Python MQTT tests
734db99 [zsxwing] Merge branch 'master' into pr4229
126608a [Prabeesh K] address the comments
b90b709 [Prabeesh K] Merge pull request #1 from zsxwing/pr4229
d07f454 [zsxwing] Register StreamingListerner before starting StreamingContext; Revert unncessary changes; fix the python unit test
a6747cb [Prabeesh K] wait for starting the receiver before publishing data
87fc677 [Prabeesh K] address the comments:
97244ec [zsxwing] Make sbt build the assembly test jar for streaming mqtt
80474d1 [Prabeesh K] fix
1f0cfe9 [Prabeesh K] python style fix
e1ee016 [Prabeesh K] scala style fix
a5a8f9f [Prabeesh K] added Python test
9767d82 [Prabeesh K] implemented Python-friendly class
a11968b [Prabeesh K] fixed python style
795ec27 [Prabeesh K] address comments
ee387ae [Prabeesh K] Fix assembly jar location of mqtt-assembly
3f4df12 [Prabeesh K] updated version
b34c3c1 [prabs] adress comments
3aa7fff [prabs] Added Python streaming mqtt word count example
b7d42ff [prabs] Mqtt streaming support in Python
2015-08-10 16:33:23 -07:00
Davies Liu ac507a03c3 [SPARK-6902] [SQL] [PYSPARK] Row should be read-only
Raise an read-only exception when user try to mutable a Row.

Author: Davies Liu <davies@databricks.com>

Closes #8009 from davies/readonly_row and squashes the following commits:

8722f3f [Davies Liu] add tests
05a3d36 [Davies Liu] Row should be read-only
2015-08-08 08:38:18 -07:00
Davies Liu 74a6541aa8 [SPARK-4561] [PYSPARK] [SQL] turn Row into dict recursively
Add an option `recursive` to `Row.asDict()`, when True (default is False), it will convert the nested Row into dict.

Author: Davies Liu <davies@databricks.com>

Closes #8006 from davies/as_dict and squashes the following commits:

922cc5a [Davies Liu] turn Row into dict recursively
2015-08-08 08:36:14 -07:00
Reynold Xin 05d04e10a8 [SPARK-9733][SQL] Improve physical plan explain for data sources
All data sources show up as "PhysicalRDD" in physical plan explain. It'd be better if we can show the name of the data source.

Without this patch:
```
== Physical Plan ==
NewAggregate with UnsafeHybridAggregationIterator ArrayBuffer(date#0, cat#1) ArrayBuffer((sum(CAST((CAST(count#2, IntegerType) + 1), LongType))2,mode=Final,isDistinct=false))
 Exchange hashpartitioning(date#0,cat#1)
  NewAggregate with UnsafeHybridAggregationIterator ArrayBuffer(date#0, cat#1) ArrayBuffer((sum(CAST((CAST(count#2, IntegerType) + 1), LongType))2,mode=Partial,isDistinct=false))
   PhysicalRDD [date#0,cat#1,count#2], MapPartitionsRDD[3] at
```

With this patch:
```
== Physical Plan ==
TungstenAggregate(key=[date#0,cat#1], value=[(sum(CAST((CAST(count#2, IntegerType) + 1), LongType)),mode=Final,isDistinct=false)]
 Exchange hashpartitioning(date#0,cat#1)
  TungstenAggregate(key=[date#0,cat#1], value=[(sum(CAST((CAST(count#2, IntegerType) + 1), LongType)),mode=Partial,isDistinct=false)]
   ConvertToUnsafe
    Scan ParquetRelation[file:/scratch/rxin/spark/sales4][date#0,cat#1,count#2]
```

Author: Reynold Xin <rxin@databricks.com>

Closes #8024 from rxin/SPARK-9733 and squashes the following commits:

811b90e [Reynold Xin] Fixed Python test case.
52cab77 [Reynold Xin] Cast.
eea9ccc [Reynold Xin] Fix test case.
fcecb22 [Reynold Xin] [SPARK-9733][SQL] Improve explain message for data source scan node.
2015-08-07 13:41:45 -07:00
Reynold Xin 4309262ec9 [SPARK-9700] Pick default page size more intelligently.
Previously, we use 64MB as the default page size, which was way too big for a lot of Spark applications (especially for single node).

This patch changes it so that the default page size, if unset by the user, is determined by the number of cores available and the total execution memory available.

Author: Reynold Xin <rxin@databricks.com>

Closes #8012 from rxin/pagesize and squashes the following commits:

16f4756 [Reynold Xin] Fixed failing test.
5afd570 [Reynold Xin] private...
0d5fb98 [Reynold Xin] Update default value.
674a6cd [Reynold Xin] Address review feedback.
dc00e05 [Reynold Xin] Merge with master.
73ebdb6 [Reynold Xin] [SPARK-9700] Pick default page size more intelligently.
2015-08-06 23:18:29 -07:00
Yin Huai baf4587a56 [SPARK-9691] [SQL] PySpark SQL rand function treats seed 0 as no seed
https://issues.apache.org/jira/browse/SPARK-9691

jkbradley rxin

Author: Yin Huai <yhuai@databricks.com>

Closes #7999 from yhuai/pythonRand and squashes the following commits:

4187e0c [Yin Huai] Regression test.
a985ef9 [Yin Huai] Use "if seed is not None" instead "if seed" because "if seed" returns false when seed is 0.
2015-08-06 17:03:14 -07:00
Reynold Xin 5e1b0ef079 [SPARK-9659][SQL] Rename inSet to isin to match Pandas function.
Inspiration drawn from this blog post: https://lab.getbase.com/pandarize-spark-dataframes/

Author: Reynold Xin <rxin@databricks.com>

Closes #7977 from rxin/isin and squashes the following commits:

9b1d3d6 [Reynold Xin] Added return.
2197d37 [Reynold Xin] Fixed test case.
7c1b6cf [Reynold Xin] Import warnings.
4f4a35d [Reynold Xin] [SPARK-9659][SQL] Rename inSet to isin to match Pandas function.
2015-08-06 10:39:16 -07:00
MechCoder 076ec05681 [SPARK-9533] [PYSPARK] [ML] Add missing methods in Word2Vec ML
After https://github.com/apache/spark/pull/7263 it is pretty straightforward to Python wrappers.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #7930 from MechCoder/spark-9533 and squashes the following commits:

1bea394 [MechCoder] make getVectors a lazy val
5522756 [MechCoder] [SPARK-9533] [PySpark] [ML] Add missing methods in Word2Vec ML
2015-08-06 10:09:58 -07:00
zhichao.li aead18ffca [SPARK-8266] [SQL] add function translate
![translate](http://www.w3resource.com/PostgreSQL/postgresql-translate-function.png)

Author: zhichao.li <zhichao.li@intel.com>

Closes #7709 from zhichao-li/translate and squashes the following commits:

9418088 [zhichao.li] refine checking condition
f2ab77a [zhichao.li] clone string
9d88f2d [zhichao.li] fix indent
6aa2962 [zhichao.li] style
e575ead [zhichao.li] add python api
9d4bab0 [zhichao.li] add special case for fodable and refactor unittest
eda7ad6 [zhichao.li] update to use TernaryExpression
cdfd4be [zhichao.li] add function translate
2015-08-06 09:02:30 -07:00
Yijie Shen 8c320e45b5 [SPARK-6591] [SQL] Python data source load options should auto convert common types into strings
JIRA: https://issues.apache.org/jira/browse/SPARK-6591

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7926 from yjshen/py_dsload_opt and squashes the following commits:

b207832 [Yijie Shen] fix style
efdf834 [Yijie Shen] resolve comment
7a8f6a2 [Yijie Shen] lowercase
822e769 [Yijie Shen] convert load opts to string
2015-08-05 17:28:23 -07:00
Mike Dusenberry 34dcf10104 [SPARK-6486] [MLLIB] [PYTHON] Add BlockMatrix to PySpark.
mengxr This adds the `BlockMatrix` to PySpark.  I have the conversions to `IndexedRowMatrix` and `CoordinateMatrix` ready as well, so once PR #7554 is completed (which relies on PR #7746), this PR can be finished.

Author: Mike Dusenberry <mwdusenb@us.ibm.com>

Closes #7761 from dusenberrymw/SPARK-6486_Add_BlockMatrix_to_PySpark and squashes the following commits:

27195c2 [Mike Dusenberry] Adding one more check to _convert_to_matrix_block_tuple, and a few minor documentation changes.
ae50883 [Mike Dusenberry] Minor update: BlockMatrix should inherit from DistributedMatrix.
b8acc1c [Mike Dusenberry] Moving BlockMatrix to pyspark.mllib.linalg.distributed, updating the logic to match that of the other distributed matrices, adding conversions, and adding documentation.
c014002 [Mike Dusenberry] Using properties for better documentation.
3bda6ab [Mike Dusenberry] Adding documentation.
8fb3095 [Mike Dusenberry] Small cleanup.
e17af2e [Mike Dusenberry] Adding BlockMatrix to PySpark.
2015-08-05 07:40:50 -07:00
Pedro Rodriguez d34548587a [SPARK-8231] [SQL] Add array_contains
This PR is based on #7580 , thanks to EntilZha

PR for work on https://issues.apache.org/jira/browse/SPARK-8231

Currently, I have an initial implementation for contains. Based on discussion on JIRA, it should behave same as Hive: https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/udf/generic/GenericUDFArrayContains.java#L102-L128

Main points are:
1. If the array is empty, null, or the value is null, return false
2. If there is a type mismatch, throw error
3. If comparison is not supported, throw error

Closes #7580

Author: Pedro Rodriguez <prodriguez@trulia.com>
Author: Pedro Rodriguez <ski.rodriguez@gmail.com>
Author: Davies Liu <davies@databricks.com>

Closes #7949 from davies/array_contains and squashes the following commits:

d3c08bc [Davies Liu] use foreach() to avoid copy
bc3d1fe [Davies Liu] fix array_contains
719e37d [Davies Liu] Merge branch 'master' of github.com:apache/spark into array_contains
e352cf9 [Pedro Rodriguez] fixed diff from master
4d5b0ff [Pedro Rodriguez] added docs and another type check
ffc0591 [Pedro Rodriguez] fixed unit test
7a22deb [Pedro Rodriguez] Changed test to use strings instead of long/ints which are different between python 2 an 3
b5ffae8 [Pedro Rodriguez] fixed pyspark test
4e7dce3 [Pedro Rodriguez] added more docs
3082399 [Pedro Rodriguez] fixed unit test
46f9789 [Pedro Rodriguez] reverted change
d3ca013 [Pedro Rodriguez] Fixed type checking to match hive behavior, then added tests to insure this
8528027 [Pedro Rodriguez] added more tests
686e029 [Pedro Rodriguez] fix scala style
d262e9d [Pedro Rodriguez] reworked type checking code and added more tests
2517a58 [Pedro Rodriguez] removed unused import
28b4f71 [Pedro Rodriguez] fixed bug with type conversions and re-added tests
12f8795 [Pedro Rodriguez] fix scala style checks
e8a20a9 [Pedro Rodriguez] added python df (broken atm)
65b562c [Pedro Rodriguez] made array_contains nullable false
33b45aa [Pedro Rodriguez] reordered test
9623c64 [Pedro Rodriguez] fixed test
4b4425b [Pedro Rodriguez] changed Arrays in tests to Seqs
72cb4b1 [Pedro Rodriguez] added checkInputTypes and docs
69c46fb [Pedro Rodriguez] added tests and codegen
9e0bfc4 [Pedro Rodriguez] initial attempt at implementation
2015-08-04 22:34:02 -07:00
Davies Liu 2b67fdb60b [SPARK-9513] [SQL] [PySpark] Add python API for DataFrame functions
This adds Python API for those DataFrame functions that is introduced in 1.5.

There is issue with serialize byte_array in Python 3, so some of functions (for BinaryType) does not have tests.

cc rxin

Author: Davies Liu <davies@databricks.com>

Closes #7922 from davies/python_functions and squashes the following commits:

8ad942f [Davies Liu] fix test
5fb6ec3 [Davies Liu] fix bugs
3495ed3 [Davies Liu] fix issues
ea5f7bb [Davies Liu] Add python API for DataFrame functions
2015-08-04 19:25:24 -07:00
Mike Dusenberry 571d5b5363 [SPARK-6485] [MLLIB] [PYTHON] Add CoordinateMatrix/RowMatrix/IndexedRowMatrix to PySpark.
This PR adds the RowMatrix, IndexedRowMatrix, and CoordinateMatrix distributed matrices to PySpark.  Each distributed matrix class acts as a wrapper around the Scala/Java counterpart by maintaining a reference to the Java object.  New distributed matrices can be created using factory methods added to DistributedMatrices, which creates the Java distributed matrix and then wraps it with the corresponding PySpark class.  This design allows for simple conversion between the various distributed matrices, and lets us re-use the Scala code.  Serialization between Python and Java is implemented using DataFrames as needed for IndexedRowMatrix and CoordinateMatrix for simplicity.  Associated documentation and unit-tests have also been added.  To facilitate code review, this PR implements access to the rows/entries as RDDs, the number of rows & columns, and conversions between the various distributed matrices (not including BlockMatrix), and does not implement the other linear algebra functions of the matrices, although this will be very simple to add now.

Author: Mike Dusenberry <mwdusenb@us.ibm.com>

Closes #7554 from dusenberrymw/SPARK-6485_Add_CoordinateMatrix_RowMatrix_IndexedMatrix_to_PySpark and squashes the following commits:

bb039cb [Mike Dusenberry] Minor documentation update.
b887c18 [Mike Dusenberry] Updating the matrix conversion logic again to make it even cleaner.  Now, we allow the 'rows' parameter in the constructors to be either an RDD or the Java matrix object. If 'rows' is an RDD, we create a Java matrix object, wrap it, and then store that.  If 'rows' is a Java matrix object of the correct type, we just wrap and store that directly.  This is only for internal usage, and publicly, we still require 'rows' to be an RDD.  We no longer store the 'rows' RDD, and instead just compute it from the Java object when needed.  The point of this is that when we do matrix conversions, we do the conversion on the Scala/Java side, which returns a Java object, so we should use that directly, but exposing 'java_matrix' parameter in the public API is not ideal. This non-public feature of allowing 'rows' to be a Java matrix object is documented in the '__init__' constructor docstrings, which are not part of the generated public API, and doctests are also included.
7f0dcb6 [Mike Dusenberry] Updating module docstring.
cfc1be5 [Mike Dusenberry] Use 'new SQLContext(matrix.rows.sparkContext)' rather than 'SQLContext.getOrCreate', as the later doesn't guarantee that the SparkContext will be the same as for the matrix.rows data.
687e345 [Mike Dusenberry] Improving conversion performance.  This adds an optional 'java_matrix' parameter to the constructors, and pulls the conversion logic out into a '_create_from_java' function. Now, if the constructors are given a valid Java distributed matrix object as 'java_matrix', they will store those internally, rather than create a new one on the Scala/Java side.
3e50b6e [Mike Dusenberry] Moving the distributed matrices to pyspark.mllib.linalg.distributed.
308f197 [Mike Dusenberry] Using properties for better documentation.
1633f86 [Mike Dusenberry] Minor documentation cleanup.
f0c13a7 [Mike Dusenberry] CoordinateMatrix should inherit from DistributedMatrix.
ffdd724 [Mike Dusenberry] Updating doctests to make documentation cleaner.
3fd4016 [Mike Dusenberry] Updating docstrings.
27cd5f6 [Mike Dusenberry] Simplifying input conversions in the constructors for each distributed matrix.
a409cf5 [Mike Dusenberry] Updating doctests to be less verbose by using lists instead of DenseVectors explicitly.
d19b0ba [Mike Dusenberry] Updating code and documentation to note that a vector-like object (numpy array, list, etc.) can be used in place of explicit Vector object, and adding conversions when necessary to RowMatrix construction.
4bd756d [Mike Dusenberry] Adding param documentation to IndexedRow and MatrixEntry.
c6bded5 [Mike Dusenberry] Move conversion logic from tuples to IndexedRow or MatrixEntry types from within the IndexedRowMatrix and CoordinateMatrix constructors to separate _convert_to_indexed_row and _convert_to_matrix_entry functions.
329638b [Mike Dusenberry] Moving the Experimental tag to the top of each docstring.
0be6826 [Mike Dusenberry] Simplifying doctests by removing duplicated rows/entries RDDs within the various tests.
c0900df [Mike Dusenberry] Adding the colons that were accidentally not inserted.
4ad6819 [Mike Dusenberry] Documenting the  and  parameters.
3b854b9 [Mike Dusenberry] Minor updates to documentation.
10046e8 [Mike Dusenberry] Updating documentation to use class constructors instead of the removed DistributedMatrices factory methods.
119018d [Mike Dusenberry] Adding static  methods to each of the distributed matrix classes to consolidate conversion logic.
4d7af86 [Mike Dusenberry] Adding type checks to the constructors.  Although it is slightly verbose, it is better for the user to have a good error message than a cryptic stacktrace.
93b6a3d [Mike Dusenberry] Pulling the DistributedMatrices Python class out of this pull request.
f6f3c68 [Mike Dusenberry] Pulling the DistributedMatrices Scala class out of this pull request.
6a3ecb7 [Mike Dusenberry] Updating pattern matching.
08f287b [Mike Dusenberry] Slight reformatting of the documentation.
a245dc0 [Mike Dusenberry] Updating Python doctests for compatability between Python 2 & 3. Since Python 3 removed the idea of a separate 'long' type, all values that would have been outputted as a 'long' (ex: '4L') will now be treated as an 'int' and outputed as one (ex: '4').  The doctests now explicitly convert to ints so that both Python 2 and 3 will have the same output.  This is fine since the values are all small, and thus can be easily represented as ints.
4d3a37e [Mike Dusenberry] Reformatting a few long Python doctest lines.
7e3ca16 [Mike Dusenberry] Fixing long lines.
f721ead [Mike Dusenberry] Updating documentation for each of the distributed matrices.
ab0e8b6 [Mike Dusenberry] Updating unit test to be more useful.
dda2f89 [Mike Dusenberry] Added wrappers for the conversions between the various distributed matrices.  Added logic to be able to access the rows/entries of the distributed matrices, which requires serialization through DataFrames for IndexedRowMatrix and CoordinateMatrix types. Added unit tests.
0cd7166 [Mike Dusenberry] Implemented the CoordinateMatrix API in PySpark, following the idea of the IndexedRowMatrix API, including using DataFrames for serialization.
3c369cb [Mike Dusenberry] Updating the architecture a bit to make conversions between the various distributed matrix types easier.  The different distributed matrix classes are now only wrappers around the Java objects, and take the Java object as an argument during construction.  This way, we can call  for example on an , which returns a reference to a Java RowMatrix object, and then construct a PySpark RowMatrix object wrapped around the Java object.  This is analogous to the behavior of PySpark RDDs and DataFrames.  We now delegate creation of the various distributed matrices from scratch in PySpark to the factory methods on .
4bdd09b [Mike Dusenberry] Implemented the IndexedRowMatrix API in PySpark, following the idea of the RowMatrix API.  Note that for the IndexedRowMatrix, we use DataFrames to serialize the data between Python and Scala/Java, so we accept PySpark RDDs, then convert to a DataFrame, then convert back to RDDs on the Scala/Java side before constructing the IndexedRowMatrix.
23bf1ec [Mike Dusenberry] Updating documentation to add PySpark RowMatrix. Inserting newline above doctest so that it renders properly in API docs.
b194623 [Mike Dusenberry] Updating design to have a PySpark RowMatrix simply create and keep a reference to a wrapper over a Java RowMatrix.  Updating DistributedMatrices factory methods to accept numRows and numCols with default values.  Updating PySpark DistributedMatrices factory method to simply create a PySpark RowMatrix. Adding additional doctests for numRows and numCols parameters.
bc2d220 [Mike Dusenberry] Adding unit tests for RowMatrix methods.
d7e316f [Mike Dusenberry] Implemented the RowMatrix API in PySpark by doing the following: Added a DistributedMatrices class to contain factory methods for creating the various distributed matrices.  Added a factory method for creating a RowMatrix from an RDD of Vectors.  Added a createRowMatrix function to the PythonMLlibAPI to interface with the factory method.  Added DistributedMatrix, DistributedMatrices, and RowMatrix classes to the pyspark.mllib.linalg api.
2015-08-04 16:30:03 -07:00
Joseph K. Bradley e375456063 [SPARK-9447] [ML] [PYTHON] Added HasRawPredictionCol, HasProbabilityCol to RandomForestClassifier
Added HasRawPredictionCol, HasProbabilityCol to RandomForestClassifier, plus doc tests for those columns.

CC: holdenk yanboliang

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #7903 from jkbradley/rf-prob-python and squashes the following commits:

c62a83f [Joseph K. Bradley] made unit test more robust
14eeba2 [Joseph K. Bradley] added HasRawPredictionCol, HasProbabilityCol to RandomForestClassifier in PySpark
2015-08-04 14:54:26 -07:00
Holden Karau 5a23213c14 [SPARK-8069] [ML] Add multiclass thresholds for ProbabilisticClassifier
This PR replaces the old "threshold" with a generalized "thresholds" Param.  We keep getThreshold,setThreshold for backwards compatibility for binary classification.

Note that the primary author of this PR is holdenk

Author: Holden Karau <holden@pigscanfly.ca>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #7909 from jkbradley/holdenk-SPARK-8069-add-cutoff-aka-threshold-to-random-forest and squashes the following commits:

3952977 [Joseph K. Bradley] fixed pyspark doc test
85febc8 [Joseph K. Bradley] made python unit tests a little more robust
7eb1d86 [Joseph K. Bradley] small cleanups
6cc2ed8 [Joseph K. Bradley] Fixed remaining merge issues.
0255e44 [Joseph K. Bradley] Many cleanups for thresholds, some more tests
7565a60 [Holden Karau] fix pep8 style checks, add a getThreshold method similar to our LogisticRegression.scala one for API compat
be87f26 [Holden Karau] Convert threshold to thresholds in the python code, add specialized support for Array[Double] to shared parems codegen, etc.
6747dad [Holden Karau] Override raw2prediction for ProbabilisticClassifier, fix some tests
25df168 [Holden Karau] Fix handling of thresholds in LogisticRegression
c02d6c0 [Holden Karau] No default for thresholds
5e43628 [Holden Karau] CR feedback and fixed the renamed test
f3fbbd1 [Holden Karau] revert the changes to random forest :(
51f581c [Holden Karau] Add explicit types to public methods, fix long line
f7032eb [Holden Karau] Fix a java test bug, remove some unecessary changes
adf15b4 [Holden Karau] rename the classifier suite test to ProbabilisticClassifierSuite now that we only have it in Probabilistic
398078a [Holden Karau] move the thresholding around a bunch based on the design doc
4893bdc [Holden Karau] Use numtrees of 3 since previous result was tied (one tree for each) and the switch from different max methods picked a different element (since they were equal I think this is ok)
638854c [Holden Karau] Add a scala RandomForestClassifierSuite test based on corresponding python test
e09919c [Holden Karau] Fix return type, I need more coffee....
8d92cac [Holden Karau] Use ClassifierParams as the head
3456ed3 [Holden Karau] Add explicit return types even though just test
a0f3b0c [Holden Karau] scala style fixes
6f14314 [Holden Karau] Since hasthreshold/hasthresholds is in root classifier now
ffc8dab [Holden Karau] Update the sharedParams
0420290 [Holden Karau] Allow us to override the get methods selectively
978e77a [Holden Karau] Move HasThreshold into classifier params and start defining the overloaded getThreshold/getThresholds functions
1433e52 [Holden Karau] Revert "try and hide threshold but chainges the API so no dice there"
1f09a2e [Holden Karau] try and hide threshold but chainges the API so no dice there
efb9084 [Holden Karau] move setThresholds only to where its used
6b34809 [Holden Karau] Add a test with thresholding for the RFCS
74f54c3 [Holden Karau] Fix creation of vote array
1986fa8 [Holden Karau] Setting the thresholds only makes sense if the underlying class hasn't overridden predict, so lets push it down.
2f44b18 [Holden Karau] Add a global default of null for thresholds param
f338cfc [Holden Karau] Wait that wasn't a good idea, Revert "Some progress towards unifying threshold and thresholds"
634b06f [Holden Karau] Some progress towards unifying threshold and thresholds
85c9e01 [Holden Karau] Test passes again... little fnur
099c0f3 [Holden Karau] Move thresholds around some more (set on model not trainer)
0f46836 [Holden Karau] Start adding a classifiersuite
f70eb5e [Holden Karau] Fix test compile issues
a7d59c8 [Holden Karau] Move thresholding into Classifier trait
5d999d2 [Holden Karau] Some more progress, start adding a test (maybe try and see if we can find a better thing to use for the base of the test)
1fed644 [Holden Karau] Use thresholds to scale scores in random forest classifcation
31d6bf2 [Holden Karau] Start threading the threshold info through
0ef228c [Holden Karau] Add hasthresholds
2015-08-04 10:12:22 -07:00
Xiangrui Meng e4765a4683 [SPARK-9544] [MLLIB] add Python API for RFormula
Add Python API for RFormula. Similar to other feature transformers in Python. This is just a thin wrapper over the Scala implementation. ericl MechCoder

Author: Xiangrui Meng <meng@databricks.com>

Closes #7879 from mengxr/SPARK-9544 and squashes the following commits:

3d5ff03 [Xiangrui Meng] add an doctest for . and -
5e969a5 [Xiangrui Meng] fix pydoc
1cd41f8 [Xiangrui Meng] organize imports
3c18b10 [Xiangrui Meng] add Python API for RFormula
2015-08-03 13:59:35 -07:00
Yanbo Liang 4cdd8ecd66 [SPARK-9536] [SPARK-9537] [SPARK-9538] [ML] [PYSPARK] ml.classification support raw and probability prediction for PySpark
Make the following ml.classification class support raw and probability prediction for PySpark:
```scala
NaiveBayesModel
DecisionTreeClassifierModel
LogisticRegressionModel
```

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #7866 from yanboliang/spark-9536-9537 and squashes the following commits:

2934dab [Yanbo Liang] ml.NaiveBayes, ml.DecisionTreeClassifier and ml.LogisticRegression support probability prediction
2015-08-02 22:19:27 -07:00
HuJiayin 00cd92f32f [SPARK-8269] [SQL] string function: initcap
This PR is based on #7208 , thanks to HuJiayin

Closes #7208

Author: HuJiayin <jiayin.hu@intel.com>
Author: Davies Liu <davies@databricks.com>

Closes #7850 from davies/initcap and squashes the following commits:

54472e9 [Davies Liu] fix python test
17ffe51 [Davies Liu] Merge branch 'master' of github.com:apache/spark into initcap
ca46390 [Davies Liu] Merge branch 'master' of github.com:apache/spark into initcap
3a906e4 [Davies Liu] implement title case in UTF8String
8b2506a [HuJiayin] Update functions.py
2cd43e5 [HuJiayin] fix python style check
b616c0e [HuJiayin] add python api
1f5a0ef [HuJiayin] add codegen
7e0c604 [HuJiayin] Merge branch 'master' of https://github.com/apache/spark into initcap
6a0b958 [HuJiayin] add column
c79482d [HuJiayin] support soundex
7ce416b [HuJiayin] support initcap rebase code
2015-08-01 21:44:57 -07:00
zhichao.li c5166f7a69 [SPARK-8263] [SQL] substr/substring should also support binary type
This is based on #7641, thanks to zhichao-li

Closes #7641

Author: zhichao.li <zhichao.li@intel.com>
Author: Davies Liu <davies@databricks.com>

Closes #7848 from davies/substr and squashes the following commits:

461b709 [Davies Liu] remove bytearry from tests
b45377a [Davies Liu] Merge branch 'master' of github.com:apache/spark into substr
01d795e [zhichao.li] scala style
99aa130 [zhichao.li] add substring to dataframe
4f68bfe [zhichao.li] add binary type support for substring
2015-08-01 08:48:46 -07:00
Cheng Hao cf6c9ca32a [SPARK-8232] [SQL] Add sort_array support
This PR is based on #7581 , just fix the conflict.

Author: Cheng Hao <hao.cheng@intel.com>
Author: Davies Liu <davies@databricks.com>

Closes #7851 from davies/sort_array and squashes the following commits:

a80ef66 [Davies Liu] fix conflict
7cfda65 [Davies Liu] Merge branch 'master' of github.com:apache/spark into sort_array
664c960 [Cheng Hao] update the sort_array by using the ArrayData
276d2d5 [Cheng Hao] add empty line
0edab9c [Cheng Hao] Add asending/descending support for sort_array
80fc0f8 [Cheng Hao] Add type checking
a42b678 [Cheng Hao] Add sort_array support
2015-08-01 08:32:29 -07:00
Davies Liu 60ea7ab4bb Revert "[SPARK-8232] [SQL] Add sort_array support"
This reverts commit 67ad4e21fc.
2015-08-01 00:41:15 -07:00
Cheng Hao 67ad4e21fc [SPARK-8232] [SQL] Add sort_array support
Add expression `sort_array` support.

Author: Cheng Hao <hao.cheng@intel.com>

This patch had conflicts when merged, resolved by
Committer: Davies Liu <davies.liu@gmail.com>

Closes #7581 from chenghao-intel/sort_array and squashes the following commits:

664c960 [Cheng Hao] update the sort_array by using the ArrayData
276d2d5 [Cheng Hao] add empty line
0edab9c [Cheng Hao] Add asending/descending support for sort_array
80fc0f8 [Cheng Hao] Add type checking
a42b678 [Cheng Hao] Add sort_array support
2015-07-31 23:11:22 -07:00
zhichao.li 6996bd2e81 [SPARK-8264][SQL]add substring_index function
This PR is based on #7533 , thanks to zhichao-li

Closes #7533

Author: zhichao.li <zhichao.li@intel.com>
Author: Davies Liu <davies@databricks.com>

Closes #7843 from davies/str_index and squashes the following commits:

391347b [Davies Liu] add python api
3ce7802 [Davies Liu] fix substringIndex
f2d29a1 [Davies Liu] Merge branch 'master' of github.com:apache/spark into str_index
515519b [zhichao.li] add foldable and remove null checking
9546991 [zhichao.li] scala style
67c253a [zhichao.li] hide some apis and clean code
b19b013 [zhichao.li] add codegen and clean code
ac863e9 [zhichao.li] reduce the calling of numChars
12e108f [zhichao.li] refine unittest
d92951b [zhichao.li] add lastIndexOf
52d7b03 [zhichao.li] add substring_index function
2015-07-31 21:18:01 -07:00
HuJiayin 4d5a6e7b60 [SPARK-8271][SQL]string function: soundex
This PR brings SQL function soundex(), see https://issues.apache.org/jira/browse/HIVE-9738

It's based on #7115 , thanks to HuJiayin

Author: HuJiayin <jiayin.hu@intel.com>
Author: Davies Liu <davies@databricks.com>

Closes #7812 from davies/soundex and squashes the following commits:

fa75941 [Davies Liu] Merge branch 'master' of github.com:apache/spark into soundex
a4bd6d8 [Davies Liu] fix soundex
2538908 [HuJiayin] add codegen soundex
d15d329 [HuJiayin] add back ut
ded1a14 [HuJiayin] Merge branch 'master' of https://github.com/apache/spark
e2dec2c [HuJiayin] support soundex rebase code
2015-07-31 16:05:26 -07:00
zsxwing 3afc1de89c [SPARK-8564] [STREAMING] Add the Python API for Kinesis
This PR adds the Python API for Kinesis, including a Python example and a simple unit test.

Author: zsxwing <zsxwing@gmail.com>

Closes #6955 from zsxwing/kinesis-python and squashes the following commits:

e42e471 [zsxwing] Merge branch 'master' into kinesis-python
455f7ea [zsxwing] Remove streaming_kinesis_asl_assembly module and simply add the source folder to streaming_kinesis_asl module
32e6451 [zsxwing] Merge remote-tracking branch 'origin/master' into kinesis-python
5082d28 [zsxwing] Fix the syntax error for Python 2.6
fca416b [zsxwing] Fix wrong comparison
96670ff [zsxwing] Fix the compilation error after merging master
756a128 [zsxwing] Merge branch 'master' into kinesis-python
6c37395 [zsxwing] Print stack trace for debug
7c5cfb0 [zsxwing] RUN_KINESIS_TESTS -> ENABLE_KINESIS_TESTS
cc9d071 [zsxwing] Fix the python test errors
466b425 [zsxwing] Add python tests for Kinesis
e33d505 [zsxwing] Merge remote-tracking branch 'origin/master' into kinesis-python
3da2601 [zsxwing] Fix the kinesis folder
687446b [zsxwing] Fix the error message and the maven output path
add2beb [zsxwing] Merge branch 'master' into kinesis-python
4957c0b [zsxwing] Add the Python API for Kinesis
2015-07-31 12:09:48 -07:00
Yanbo Liang 69b62f76fc [SPARK-9214] [ML] [PySpark] support ml.NaiveBayes for Python
support ml.NaiveBayes for Python

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #7568 from yanboliang/spark-9214 and squashes the following commits:

5ee3fd6 [Yanbo Liang] fix typos
3ecd046 [Yanbo Liang] fix typos
f9c94d1 [Yanbo Liang] change lambda_ to smoothing and fix other issues
180452a [Yanbo Liang] fix typos
7dda1f4 [Yanbo Liang] support ml.NaiveBayes for Python
2015-07-30 23:03:48 -07:00
Ram Sriharsha 4e5919bfb4 [SPARK-7690] [ML] Multiclass classification Evaluator
Multiclass Classification Evaluator for ML Pipelines. F1 score, precision, recall, weighted precision and weighted recall are supported as available metrics.

Author: Ram Sriharsha <rsriharsha@hw11853.local>

Closes #7475 from harsha2010/SPARK-7690 and squashes the following commits:

9bf4ec7 [Ram Sriharsha] fix indentation
3f09a85 [Ram Sriharsha] cleanup doc
16115ae [Ram Sriharsha] code review fixes
032d2a3 [Ram Sriharsha] fix test
eec9865 [Ram Sriharsha] Fix Python Indentation
1dbeffd [Ram Sriharsha] Merge branch 'master' into SPARK-7690
68cea85 [Ram Sriharsha] Merge branch 'master' into SPARK-7690
54c03de [Ram Sriharsha] [SPARK-7690][ml][WIP] Multiclass Evaluator for ML Pipeline
2015-07-30 23:02:11 -07:00
Daoyuan Wang 83670fc9e6 [SPARK-8176] [SPARK-8197] [SQL] function to_date/ trunc
This PR is based on #6988 , thanks to adrian-wang .

This brings two SQL functions: to_date() and trunc().

Closes #6988

Author: Daoyuan Wang <daoyuan.wang@intel.com>
Author: Davies Liu <davies@databricks.com>

Closes #7805 from davies/to_date and squashes the following commits:

2c7beba [Davies Liu] Merge branch 'master' of github.com:apache/spark into to_date
310dd55 [Daoyuan Wang] remove dup test in rebase
980b092 [Daoyuan Wang] resolve rebase conflict
a476c5a [Daoyuan Wang] address comments from davies
d44ea5f [Daoyuan Wang] function to_date, trunc
2015-07-30 19:22:38 -07:00
Xiangrui Meng df32669514 [SPARK-7157][SQL] add sampleBy to DataFrame
This was previously committed but then reverted due to test failures (see #6769).

Author: Xiangrui Meng <meng@databricks.com>

Closes #7755 from rxin/SPARK-7157 and squashes the following commits:

fbf9044 [Xiangrui Meng] fix python test
542bd37 [Xiangrui Meng] update test
604fe6d [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7157
f051afd [Xiangrui Meng] use udf instead of building expression
f4e9425 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7157
8fb990b [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7157
103beb3 [Xiangrui Meng] add Java-friendly sampleBy
991f26f [Xiangrui Meng] fix seed
4a14834 [Xiangrui Meng] move sampleBy to stat
832f7cc [Xiangrui Meng] add sampleBy to DataFrame
2015-07-30 17:16:03 -07:00
Xiangrui Meng ca71cc8c8b [SPARK-9408] [PYSPARK] [MLLIB] Refactor linalg.py to /linalg
This is based on MechCoder 's PR https://github.com/apache/spark/pull/7731. Hopefully it could pass tests. MechCoder I tried to make minimal changes. If this passes Jenkins, we can merge this one first and then try to move `__init__.py` to `local.py` in a separate PR.

Closes #7731

Author: Xiangrui Meng <meng@databricks.com>

Closes #7746 from mengxr/SPARK-9408 and squashes the following commits:

0e05a3b [Xiangrui Meng] merge master
1135551 [Xiangrui Meng] add a comment for str(...)
c48cae0 [Xiangrui Meng] update tests
173a805 [Xiangrui Meng] move linalg.py to linalg/__init__.py
2015-07-30 16:57:38 -07:00
Daoyuan Wang 1abf7dc16c [SPARK-8186] [SPARK-8187] [SPARK-8194] [SPARK-8198] [SPARK-9133] [SPARK-9290] [SQL] functions: date_add, date_sub, add_months, months_between, time-interval calculation
This PR is based on #7589 , thanks to adrian-wang

Added SQL function date_add, date_sub, add_months, month_between, also add a rule for
add/subtract of date/timestamp and interval.

Closes #7589

cc rxin

Author: Daoyuan Wang <daoyuan.wang@intel.com>
Author: Davies Liu <davies@databricks.com>

Closes #7754 from davies/date_add and squashes the following commits:

e8c633a [Davies Liu] Merge branch 'master' of github.com:apache/spark into date_add
9e8e085 [Davies Liu] Merge branch 'master' of github.com:apache/spark into date_add
6224ce4 [Davies Liu] fix conclict
bd18cd4 [Davies Liu] Merge branch 'master' of github.com:apache/spark into date_add
e47ff2c [Davies Liu] add python api, fix date functions
01943d0 [Davies Liu] Merge branch 'master' into date_add
522e91a [Daoyuan Wang] fix
e8a639a [Daoyuan Wang] fix
42df486 [Daoyuan Wang] fix style
87c4b77 [Daoyuan Wang] function add_months, months_between and some fixes
1a68e03 [Daoyuan Wang] poc of time interval calculation
c506661 [Daoyuan Wang] function date_add , date_sub
2015-07-30 13:21:46 -07:00
Josh Rosen 520ec0ff9d [SPARK-8850] [SQL] Enable Unsafe mode by default
This pull request enables Unsafe mode by default in Spark SQL. In order to do this, we had to fix a number of small issues:

**List of fixed blockers**:

- [x] Make some default buffer sizes configurable so that HiveCompatibilitySuite can run properly (#7741).
- [x] Memory leak on grouped aggregation of empty input (fixed by #7560 to fix this)
- [x] Update planner to also check whether codegen is enabled before planning unsafe operators.
- [x] Investigate failing HiveThriftBinaryServerSuite test.  This turns out to be caused by a ClassCastException that occurs when Exchange tries to apply an interpreted RowOrdering to an UnsafeRow when range partitioning an RDD.  This could be fixed by #7408, but a shorter-term fix is to just skip the Unsafe exchange path when RangePartitioner is used.
- [x] Memory leak exceptions masking exceptions that actually caused tasks to fail (will be fixed by #7603).
- [x]  ~~https://issues.apache.org/jira/browse/SPARK-9162, to implement code generation for ScalaUDF.  This is necessary for `UDFSuite` to pass.  For now, I've just ignored this test in order to try to find other problems while we wait for a fix.~~ This is no longer necessary as of #7682.
- [x] Memory leaks from Limit after UnsafeExternalSort cause the memory leak detector to fail tests. This is a huge problem in the HiveCompatibilitySuite (fixed by f4ac642a4e5b2a7931c5e04e086bb10e263b1db6).
- [x] Tests in `AggregationQuerySuite` are failing due to NaN-handling issues in UnsafeRow, which were fixed in #7736.
- [x] `org.apache.spark.sql.ColumnExpressionSuite.rand` needs to be updated so that the planner check also matches `TungstenProject`.
- [x] After having lowered the buffer sizes to 4MB so that most of HiveCompatibilitySuite runs:
  - [x] Wrong answer in `join_1to1` (fixed by #7680)
  - [x] Wrong answer in `join_nulls` (fixed by #7680)
  - [x] Managed memory OOM / leak in `lateral_view`
  - [x] Seems to hang indefinitely in `partcols1`.  This might be a deadlock in script transformation or a bug in error-handling code? The hang was fixed by #7710.
  - [x] Error while freeing memory in `partcols1`: will be fixed by #7734.
- [x] After fixing the `partcols1` hang, it appears that a number of later tests have issues as well.
- [x] Fix thread-safety bug in codegen fallback expression evaluation (#7759).

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7564 from JoshRosen/unsafe-by-default and squashes the following commits:

83c0c56 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-by-default
f4cc859 [Josh Rosen] Merge remote-tracking branch 'origin/master' into unsafe-by-default
963f567 [Josh Rosen] Reduce buffer size for R tests
d6986de [Josh Rosen] Lower page size in PySpark tests
013b9da [Josh Rosen] Also match TungstenProject in checkNumProjects
5d0b2d3 [Josh Rosen] Add task completion callback to avoid leak in limit after sort
ea250da [Josh Rosen] Disable unsafe Exchange path when RangePartitioning is used
715517b [Josh Rosen] Enable Unsafe by default
2015-07-30 10:45:32 -07:00
Xiangrui Meng 81464f2a82 [MINOR] [MLLIB] fix doc for RegexTokenizer
This is #7791 for Python. hhbyyh

Author: Xiangrui Meng <meng@databricks.com>

Closes #7798 from mengxr/regex-tok-py and squashes the following commits:

baa2dcd [Xiangrui Meng] fix doc for RegexTokenizer
2015-07-30 09:45:17 -07:00
Davies Liu e044705b44 [SPARK-9116] [SQL] [PYSPARK] support Python only UDT in __main__
Also we could create a Python UDT without having a Scala one, it's important for Python users.

cc mengxr JoshRosen

Author: Davies Liu <davies@databricks.com>

Closes #7453 from davies/class_in_main and squashes the following commits:

4dfd5e1 [Davies Liu] add tests for Python and Scala UDT
793d9b2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main
dc65f19 [Davies Liu] address comment
a9a3c40 [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main
a86e1fc [Davies Liu] fix serialization
ad528ba [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main
63f52ef [Davies Liu] fix pylint check
655b8a9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into class_in_main
316a394 [Davies Liu] support Python UDT with UTF
0bcb3ef [Davies Liu] fix bug in mllib
de986d6 [Davies Liu] fix test
83d65ac [Davies Liu] fix bug in StructType
55bb86e [Davies Liu] support Python UDT in __main__ (without Scala one)
2015-07-29 22:30:49 -07:00
Alex Angelini f5dd11339f Fix reference to self.names in StructType
`names` is not defined in this context, I think you meant `self.names`.

davies

Author: Alex Angelini <alex.louis.angelini@gmail.com>

Closes #7766 from angelini/fix_struct_type_names and squashes the following commits:

01543a1 [Alex Angelini] Fix reference to self.names in StructType
2015-07-29 22:25:38 -07:00
Holden Karau 37c2d1927c [SPARK-9016] [ML] make random forest classifiers implement classification trait
Implement the classification trait for RandomForestClassifiers. The plan is to use this in the future to providing thresholding for RandomForestClassifiers (as well as other classifiers that implement that trait).

Author: Holden Karau <holden@pigscanfly.ca>

Closes #7432 from holdenk/SPARK-9016-make-random-forest-classifiers-implement-classification-trait and squashes the following commits:

bf22fa6 [Holden Karau] Add missing imports for testing suite
e948f0d [Holden Karau] Check the prediction generation from rawprediciton
25320c3 [Holden Karau] Don't supply numClasses when not needed, assert model classes are as expected
1a67e04 [Holden Karau] Use old decission tree stuff instead
673e0c3 [Holden Karau] Merge branch 'master' into SPARK-9016-make-random-forest-classifiers-implement-classification-trait
0d15b96 [Holden Karau] FIx typo
5eafad4 [Holden Karau] add a constructor for rootnode + num classes
fc6156f [Holden Karau] scala style fix
2597915 [Holden Karau] take num classes in constructor
3ccfe4a [Holden Karau] Merge in master, make pass numClasses through randomforest for training
222a10b [Holden Karau] Increase numtrees to 3 in the python test since before the two were equal and the argmax was selecting the last one
16aea1c [Holden Karau] Make tests match the new models
b454a02 [Holden Karau] Make the Tree classifiers extends the Classifier base class
77b4114 [Holden Karau] Import vectors lib
2015-07-29 18:18:29 -07:00
MechCoder 198d181dfb [SPARK-7105] [PYSPARK] [MLLIB] Support model save/load in GMM
This PR introduces save / load for GMM's in python API.

Also I refactored `GaussianMixtureModel` and inherited it from `JavaModelWrapper` with model being `GaussianMixtureModelWrapper`, a wrapper which provides convenience methods to `GaussianMixtureModel` (due to serialization and deserialization issues) and I moved the creation of gaussians to the scala backend.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #7617 from MechCoder/python_gmm_save_load and squashes the following commits:

9c305aa [MechCoder] [SPARK-7105] [PySpark] [MLlib] Support model save/load in GMM
2015-07-28 15:00:25 -07:00
JD 723db13e06 [Spark-8668][SQL] Adding expr to functions
Author: JD <jd@csh.rit.edu>
Author: Joseph Batchik <josephbatchik@gmail.com>

Closes #7606 from JDrit/expr and squashes the following commits:

ad7f607 [Joseph Batchik] fixing python linter error
9d6daea [Joseph Batchik] removed order by per @rxin's comment
707d5c6 [Joseph Batchik] Added expr to fuctions.py
79df83c [JD] added example to the docs
b89eec8 [JD] moved function up as per @rxin's comment
4960909 [JD] updated per @JoshRosen's comment
2cb329c [JD] updated per @rxin's comment
9a9ad0c [JD] removing unused import
6dc26d0 [JD] removed split
7f2222c [JD] Adding expr function as per SPARK-8668
2015-07-25 00:34:59 -07:00
Cheolsoo Park 9a11396113 [SPARK-9270] [PYSPARK] allow --name option in pyspark
This is continuation of #7512 which added `--name` option to spark-shell. This PR adds the same option to pyspark.

Note that `--conf spark.app.name` in command-line has no effect in spark-shell and pyspark. Instead, `--name` must be used. This is in fact inconsistency with spark-sql which doesn't accept `--name` option while it accepts `--conf spark.app.name`. I am not fixing this inconsistency in this PR. IMO, one of `--name` and `--conf spark.app.name` is needed not both. But since I cannot decide which to choose, I am not making any change here.

Author: Cheolsoo Park <cheolsoop@netflix.com>

Closes #7610 from piaozhexiu/SPARK-9270 and squashes the following commits:

763e86d [Cheolsoo Park] Update windows script
400b7f9 [Cheolsoo Park] Allow --name option to pyspark
2015-07-24 11:56:55 -07:00
Yanbo Liang 52de3acca4 [SPARK-9122] [MLLIB] [PySpark] spark.mllib regression support batch predict
spark.mllib support batch predict for LinearRegressionModel, RidgeRegressionModel and LassoModel.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #7614 from yanboliang/spark-9122 and squashes the following commits:

4e610c0 [Yanbo Liang] spark.mllib regression support batch predict
2015-07-23 18:53:07 -07:00
Davies Liu 8a94eb23d5 [SPARK-9069] [SPARK-9264] [SQL] remove unlimited precision support for DecimalType
Romove Decimal.Unlimited (change to support precision up to 38, to match with Hive and other databases).

In order to keep backward source compatibility, Decimal.Unlimited is still there, but change to Decimal(38, 18).

If no precision and scale is provide, it's Decimal(10, 0) as before.

Author: Davies Liu <davies@databricks.com>

Closes #7605 from davies/decimal_unlimited and squashes the following commits:

aa3f115 [Davies Liu] fix tests and style
fb0d20d [Davies Liu] address comments
bfaae35 [Davies Liu] fix style
df93657 [Davies Liu] address comments and clean up
06727fd [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_unlimited
4c28969 [Davies Liu] fix tests
8d783cc [Davies Liu] fix tests
788631c [Davies Liu] fix double with decimal in Union/except
1779bde [Davies Liu] fix scala style
c9c7c78 [Davies Liu] remove Decimal.Unlimited
2015-07-23 18:31:13 -07:00
Xiangrui Meng ecfb312767 [SPARK-9243] [Documentation] null -> zero in crosstab doc
We forgot to update doc. brkyvz

Author: Xiangrui Meng <meng@databricks.com>

Closes #7608 from mengxr/SPARK-9243 and squashes the following commits:

0ea3236 [Xiangrui Meng] null -> zero in crosstab doc
2015-07-23 10:32:11 -07:00
Josh Rosen b217230f2a [SPARK-9144] Remove DAGScheduler.runLocallyWithinThread and spark.localExecution.enabled
Spark has an option called spark.localExecution.enabled; according to the docs:

> Enables Spark to run certain jobs, such as first() or take() on the driver, without sending tasks to the cluster. This can make certain jobs execute very quickly, but may require shipping a whole partition of data to the driver.

This feature ends up adding quite a bit of complexity to DAGScheduler, especially in the runLocallyWithinThread method, but as far as I know nobody uses this feature (I searched the mailing list and haven't seen any recent mentions of the configuration nor stacktraces including the runLocally method). As a step towards scheduler complexity reduction, I propose that we remove this feature and all code related to it for Spark 1.5.

This pull request simply brings #7484 up to date.

Author: Josh Rosen <joshrosen@databricks.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #7585 from rxin/remove-local-exec and squashes the following commits:

84bd10e [Reynold Xin] Python fix.
1d9739a [Reynold Xin] Merge pull request #7484 from JoshRosen/remove-localexecution
eec39fa [Josh Rosen] Remove allowLocal(); deprecate user-facing uses of it.
b0835dc [Josh Rosen] Remove local execution code in DAGScheduler
8975d96 [Josh Rosen] Remove local execution tests.
ffa8c9b [Josh Rosen] Remove documentation for configuration
2015-07-22 21:04:04 -07:00
MechCoder 5307c9d3f7 [SPARK-9223] [PYSPARK] [MLLIB] Support model save/load in LDA
Since save / load has been merged in LDA, it takes no time to write the wrappers in Python as well.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #7587 from MechCoder/python_lda_save_load and squashes the following commits:

c8e4ea7 [MechCoder] [SPARK-9223] [PySpark] Support model save/load in LDA
2015-07-22 17:22:12 -07:00
Matei Zaharia fe26584a1f [SPARK-9244] Increase some memory defaults
There are a few memory limits that people hit often and that we could
make higher, especially now that memory sizes have grown.

- spark.akka.frameSize: This defaults at 10 but is often hit for map
  output statuses in large shuffles. This memory is not fully allocated
  up-front, so we can just make this larger and still not affect jobs
  that never sent a status that large. We increase it to 128.

- spark.executor.memory: Defaults at 512m, which is really small. We
  increase it to 1g.

Author: Matei Zaharia <matei@databricks.com>

Closes #7586 from mateiz/configs and squashes the following commits:

ce0038a [Matei Zaharia] [SPARK-9244] Increase some memory defaults
2015-07-22 15:28:09 -07:00
Pedro Rodriguez 560c658a74 [SPARK-8230][SQL] Add array/map size method
Pull Request for: https://issues.apache.org/jira/browse/SPARK-8230

Primary issue resolved is to implement array/map size for Spark SQL. Code is ready for review by a committer. Chen Hao is on the JIRA ticket, but I don't know his username on github, rxin is also on JIRA ticket.

Things to review:
1. Where to put added functions namespace wise, they seem to be part of a few operations on collections which includes `sort_array` and `array_contains`. Hence the name given `collectionOperations.scala` and `_collection_functions` in python.
2. In Python code, should it be in a `1.5.0` function array or in a collections array?
3. Are there any missing methods on the `Size` case class? Looks like many of these functions have generated Java code, is that also needed in this case?
4. Something else?

Author: Pedro Rodriguez <ski.rodriguez@gmail.com>
Author: Pedro Rodriguez <prodriguez@trulia.com>

Closes #7462 from EntilZha/SPARK-8230 and squashes the following commits:

9a442ae [Pedro Rodriguez] fixed functions and sorted __all__
9aea3bb [Pedro Rodriguez] removed imports from python docs
15d4bf1 [Pedro Rodriguez] Added null test case and changed to nullSafeCodeGen
d88247c [Pedro Rodriguez] removed python code
bd5f0e4 [Pedro Rodriguez] removed duplicate function from rebase/merge
59931b4 [Pedro Rodriguez] fixed compile bug instroduced when merging
c187175 [Pedro Rodriguez] updated code to add size to __all__ directly and removed redundent pretty print
130839f [Pedro Rodriguez] fixed failing test
aa9bade [Pedro Rodriguez] fix style
e093473 [Pedro Rodriguez] updated python code with docs, switched classes/traits implemented, added (failing) expression tests
0449377 [Pedro Rodriguez] refactored code to use better abstract classes/traits and implementations
9a1a2ff [Pedro Rodriguez] added unit tests for map size
2bfbcb6 [Pedro Rodriguez] added unit test for size
20df2b4 [Pedro Rodriguez] Finished working version of size function and added it to python
b503e75 [Pedro Rodriguez] First attempt at implementing size for maps and arrays
99a6a5c [Pedro Rodriguez] fixed failing test
cac75ac [Pedro Rodriguez] fix style
933d843 [Pedro Rodriguez] updated python code with docs, switched classes/traits implemented, added (failing) expression tests
42bb7d4 [Pedro Rodriguez] refactored code to use better abstract classes/traits and implementations
f9c3b8a [Pedro Rodriguez] added unit tests for map size
2515d9f [Pedro Rodriguez] added documentation
0e60541 [Pedro Rodriguez] added unit test for size
acf9853 [Pedro Rodriguez] Finished working version of size function and added it to python
84a5d38 [Pedro Rodriguez] First attempt at implementing size for maps and arrays
2015-07-21 00:53:20 -07:00
Cheng Hao 8c8f0ef59e [SPARK-8255] [SPARK-8256] [SQL] Add regex_extract/regex_replace
Add expressions `regex_extract` & `regex_replace`

Author: Cheng Hao <hao.cheng@intel.com>

Closes #7468 from chenghao-intel/regexp and squashes the following commits:

e5ea476 [Cheng Hao] minor update for documentation
ef96fd6 [Cheng Hao] update the code gen
72cf28f [Cheng Hao] Add more log for compilation error
4e11381 [Cheng Hao] Add regexp_replace / regexp_extract support
2015-07-21 00:48:07 -07:00
Cheng Lian d38c5029a2 [SPARK-9100] [SQL] Adds DataFrame reader/writer shortcut methods for ORC
This PR adds DataFrame reader/writer shortcut methods for ORC in both Scala and Python.

Author: Cheng Lian <lian@databricks.com>

Closes #7444 from liancheng/spark-9100 and squashes the following commits:

284d043 [Cheng Lian] Fixes PySpark test cases and addresses PR comments
e0b09fb [Cheng Lian] Adds DataFrame reader/writer shortcut methods for ORC
2015-07-21 15:08:44 +08:00
Joseph K. Bradley a5d05819af [SPARK-9198] [MLLIB] [PYTHON] Fixed typo in pyspark sparsevector doc tests
Several places in the PySpark SparseVector docs have one defined as:
```
SparseVector(4, [2, 4], [1.0, 2.0])
```
The index 4 goes out of bounds (but this is not checked).

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #7541 from jkbradley/sparsevec-doc-typo-fix and squashes the following commits:

c806a65 [Joseph K. Bradley] fixed doc test
e2dcb23 [Joseph K. Bradley] Fixed typo in pyspark sparsevector doc tests
2015-07-20 16:49:55 -07:00
Davies Liu 9f913c4fd6 [SPARK-9114] [SQL] [PySpark] convert returned object from UDF into internal type
This PR also remove the duplicated code between registerFunction and UserDefinedFunction.

cc JoshRosen

Author: Davies Liu <davies@databricks.com>

Closes #7450 from davies/fix_return_type and squashes the following commits:

e80bf9f [Davies Liu] remove debugging code
f94b1f6 [Davies Liu] fix mima
8f9c58b [Davies Liu] convert returned object from UDF into internal type
2015-07-20 12:14:47 -07:00
Mateusz Buśkiewicz 02181fb6d1 [SPARK-9101] [PySpark] Add missing NullType
JIRA: https://issues.apache.org/jira/browse/SPARK-9101

Author: Mateusz Buśkiewicz <mateusz.buskiewicz@getbase.com>

Closes #7499 from sixers/spark-9101 and squashes the following commits:

dd75aa6 [Mateusz Buśkiewicz] [SPARK-9101] [PySpark] Test for selecting null literal
97e3f2f [Mateusz Buśkiewicz] [SPARK-9101] [PySpark] Add missing NullType to _atomic_types in pyspark.sql.types
2015-07-20 12:00:48 -07:00
MechCoder d0b4e93f7e [SPARK-8996] [MLLIB] [PYSPARK] Python API for Kolmogorov-Smirnov Test
Python API for the KS-test

Statistics.kolmogorovSmirnovTest(data, distName, *params)
I'm not quite sure how to support the callable function since it is not serializable.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #7430 from MechCoder/spark-8996 and squashes the following commits:

2dd009d [MechCoder] minor
021d233 [MechCoder] Remove one wrapper and other minor stuff
49d07ab [MechCoder] [SPARK-8996] [MLlib] Python API for Kolmogorov-Smirnov Test
2015-07-20 09:00:01 -07:00
Nicholas Hwang a803ac3e06 [SPARK-9021] [PYSPARK] Change RDD.aggregate() to do reduce(mapPartitions()) instead of mapPartitions.fold()
I'm relatively new to Spark and functional programming, so forgive me if this pull request is just a result of my misunderstanding of how Spark should be used.

Currently, if one happens to use a mutable object as `zeroValue` for `RDD.aggregate()`, possibly unexpected behavior can occur.

This is because pyspark's current implementation of `RDD.aggregate()` does not serialize or make a copy of `zeroValue` before handing it off to `RDD.mapPartitions(...).fold(...)`. This results in a single reference to `zeroValue` being used for both `RDD.mapPartitions()` and `RDD.fold()` on each partition. This can result in strange accumulator values being fed into each partition's call to `RDD.fold()`, as the `zeroValue` may have been changed in-place during the `RDD.mapPartitions()` call.

As an illustrative example, submit the following to `spark-submit`:
```
from pyspark import SparkConf, SparkContext
import collections

def updateCounter(acc, val):
    print 'update acc:', acc
    print 'update val:', val
    acc[val] += 1
    return acc

def comboCounter(acc1, acc2):
    print 'combo acc1:', acc1
    print 'combo acc2:', acc2
    acc1.update(acc2)
    return acc1

def main():
    conf = SparkConf().setMaster("local").setAppName("Aggregate with Counter")
    sc = SparkContext(conf = conf)

    print '======= AGGREGATING with ONE PARTITION ======='
    print sc.parallelize(range(1,10), 1).aggregate(collections.Counter(), updateCounter, comboCounter)

    print '======= AGGREGATING with TWO PARTITIONS ======='
    print sc.parallelize(range(1,10), 2).aggregate(collections.Counter(), updateCounter, comboCounter)

if __name__ == "__main__":
    main()
```

One probably expects this to output the following:
```
Counter({1: 1, 2: 1, 3: 1, 4: 1, 5: 1, 6: 1, 7: 1, 8: 1, 9: 1})
```

But it instead outputs this (regardless of the number of partitions):
```
Counter({1: 2, 2: 2, 3: 2, 4: 2, 5: 2, 6: 2, 7: 2, 8: 2, 9: 2})
```

This is because (I believe) `zeroValue` gets passed correctly to each partition, but after `RDD.mapPartitions()` completes, the `zeroValue` object has been updated and is then passed to `RDD.fold()`, which results in all items being double-counted within each partition before being finally reduced at the calling node.

I realize that this type of calculation is typically done by `RDD.mapPartitions(...).reduceByKey(...)`, but hopefully this illustrates some potentially confusing behavior. I also noticed that other `RDD` methods use this `deepcopy` approach to creating unique copies of `zeroValue` (i.e., `RDD.aggregateByKey()` and `RDD.foldByKey()`), and that the Scala implementations do seem to serialize the `zeroValue` object appropriately to prevent this type of behavior.

Author: Nicholas Hwang <moogling@gmail.com>

Closes #7378 from njhwang/master and squashes the following commits:

659bb27 [Nicholas Hwang] Fixed RDD.aggregate() to perform a reduce operation on collected mapPartitions results, similar to how fold currently is implemented. This prevents an initial combOp being performed on each partition with zeroValue (which leads to unexpected behavior if zeroValue is a mutable object) before being combOp'ed with other partition results.
8d8d694 [Nicholas Hwang] Changed dict construction to be compatible with Python 2.6 (cannot use list comprehensions to make dicts)
56eb2ab [Nicholas Hwang] Fixed whitespace after colon to conform with PEP8
391de4a [Nicholas Hwang] Removed used of collections.Counter from RDD tests for Python 2.6 compatibility; used defaultdict(int) instead. Merged treeAggregate test with mutable zero value into aggregate test to reduce code duplication.
2fa4e4b [Nicholas Hwang] Merge branch 'master' of https://github.com/njhwang/spark
ba528bd [Nicholas Hwang] Updated comments regarding protection of zeroValue from mutation in RDD.aggregate(). Added regression tests for aggregate(), fold(), aggregateByKey(), foldByKey(), and treeAggregate(), all with both 1 and 2 partition RDDs. Confirmed that aggregate() is the only problematic implementation as of commit 257236c3e1. Also replaced some parallelizations of ranges with xranges, per the documentation's recommendations of preferring xrange over range.
7820391 [Nicholas Hwang] Updated comments regarding protection of zeroValue from mutation in RDD.aggregate(). Added regression tests for aggregate(), fold(), aggregateByKey(), foldByKey(), and treeAggregate(), all with both 1 and 2 partition RDDs. Confirmed that aggregate() is the only problematic implementation as of commit 257236c3e1.
90d1544 [Nicholas Hwang] Made sure RDD.aggregate() makes a deepcopy of zeroValue for all partitions; this ensures that the mapPartitions call works with unique copies of zeroValue in each partition, and prevents a single reference to zeroValue being used for both map and fold calls on each partition (resulting in possibly unexpected behavior).
2015-07-19 10:30:28 -07:00
Reynold Xin 3427937ea2 [SQL] Make date/time functions more consistent with other database systems.
This pull request fixes some of the problems in #6981.

- Added date functions to `__all__` so they get exposed
- Rename day_of_month -> dayofmonth
- Rename day_in_year -> dayofyear
- Rename week_of_year -> weekofyear
- Removed "day" from Scala/Python API since it is ambiguous. Only leaving the alias in SQL.

Author: Reynold Xin <rxin@databricks.com>

This patch had conflicts when merged, resolved by
Committer: Reynold Xin <rxin@databricks.com>

Closes #7506 from rxin/datetime and squashes the following commits:

0cb24d9 [Reynold Xin] Export all functions in Python.
e44a4a0 [Reynold Xin] Removed day function from Scala and Python.
9c08fdc [Reynold Xin] [SQL] Make date/time functions more consistent with other database systems.
2015-07-19 01:17:22 -07:00
Liang-Chi Hsieh 9b644c4130 [SPARK-9166][SQL][PYSPARK] Capture and hide IllegalArgumentException in Python API
JIRA: https://issues.apache.org/jira/browse/SPARK-9166

Simply capture and hide `IllegalArgumentException` in Python API.

Author: Liang-Chi Hsieh <viirya@appier.com>

Closes #7497 from viirya/hide_illegalargument and squashes the following commits:

8324dce [Liang-Chi Hsieh] Fix python style.
9ace67d [Liang-Chi Hsieh] Also check exception message.
8b2ce5c [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into hide_illegalargument
7be016a [Liang-Chi Hsieh] Capture and hide IllegalArgumentException in Python.
2015-07-19 00:32:56 -07:00
Tarek Auel 83b682beec [SPARK-8199][SPARK-8184][SPARK-8183][SPARK-8182][SPARK-8181][SPARK-8180][SPARK-8179][SPARK-8177][SPARK-8178][SPARK-9115][SQL] date functions
Jira:
https://issues.apache.org/jira/browse/SPARK-8199
https://issues.apache.org/jira/browse/SPARK-8184
https://issues.apache.org/jira/browse/SPARK-8183
https://issues.apache.org/jira/browse/SPARK-8182
https://issues.apache.org/jira/browse/SPARK-8181
https://issues.apache.org/jira/browse/SPARK-8180
https://issues.apache.org/jira/browse/SPARK-8179
https://issues.apache.org/jira/browse/SPARK-8177
https://issues.apache.org/jira/browse/SPARK-8179
https://issues.apache.org/jira/browse/SPARK-9115

Regarding `day`and `dayofmonth` are both necessary?

~~I am going to add `Quarter` to this PR as well.~~ Done.

~~As soon as the Scala coding is reviewed and discussed, I'll add the python api.~~ Done

Author: Tarek Auel <tarek.auel@googlemail.com>
Author: Tarek Auel <tarek.auel@gmail.com>

Closes #6981 from tarekauel/SPARK-8199 and squashes the following commits:

f7b4c8c [Tarek Auel] [SPARK-8199] fixed bug in tests
bb567b6 [Tarek Auel] [SPARK-8199] fixed test
3e095ba [Tarek Auel] [SPARK-8199] style and timezone fix
256c357 [Tarek Auel] [SPARK-8199] code cleanup
5983dcc [Tarek Auel] [SPARK-8199] whitespace fix
6e0c78f [Tarek Auel] [SPARK-8199] removed setTimeZone in tests, according to cloud-fans comment in #7488
4afc09c [Tarek Auel] [SPARK-8199] concise leap year handling
ea6c110 [Tarek Auel] [SPARK-8199] fix after merging master
70238e0 [Tarek Auel] Merge branch 'master' into SPARK-8199
3c6ae2e [Tarek Auel] [SPARK-8199] removed binary search
fb98ba0 [Tarek Auel] [SPARK-8199] python docstring fix
cdfae27 [Tarek Auel] [SPARK-8199] cleanup & python docstring fix
746b80a [Tarek Auel] [SPARK-8199] build fix
0ad6db8 [Tarek Auel] [SPARK-8199] minor fix
523542d [Tarek Auel] [SPARK-8199] address comments
2259299 [Tarek Auel] [SPARK-8199] day_of_month alias
d01b977 [Tarek Auel] [SPARK-8199] python underscore
56c4a92 [Tarek Auel] [SPARK-8199] update python docu
e223bc0 [Tarek Auel] [SPARK-8199] refactoring
d6aa14e [Tarek Auel] [SPARK-8199] fixed Hive compatibility
b382267 [Tarek Auel] [SPARK-8199] fixed bug in day calculation; removed set TimeZone in HiveCompatibilitySuite for test purposes; removed Hive tests for second and minute, because we can cast '2015-03-18' to a timestamp and extract a minute/second from it
1b2e540 [Tarek Auel] [SPARK-8119] style fix
0852655 [Tarek Auel] [SPARK-8119] changed from ExpectsInputTypes to implicit casts
ec87c69 [Tarek Auel] [SPARK-8119] bug fixing and refactoring
1358cdc [Tarek Auel] Merge remote-tracking branch 'origin/master' into SPARK-8199
740af0e [Tarek Auel] implement date function using a calculation based on days
4fb66da [Tarek Auel] WIP: date functions on calculation only
1a436c9 [Tarek Auel] wip
f775f39 [Tarek Auel] fixed return type
ad17e96 [Tarek Auel] improved implementation
c42b444 [Tarek Auel] Removed merge conflict file
ccb723c [Tarek Auel] [SPARK-8199] style and fixed merge issues
10e4ad1 [Tarek Auel] Merge branch 'master' into date-functions-fast
7d9f0eb [Tarek Auel] [SPARK-8199] git renaming issue
f3e7a9f [Tarek Auel] [SPARK-8199] revert change in DataFrameFunctionsSuite
6f5d95c [Tarek Auel] [SPARK-8199] fixed year interval
d9f8ac3 [Tarek Auel] [SPARK-8199] implement fast track
7bc9d93 [Tarek Auel] Merge branch 'master' into SPARK-8199
5a105d9 [Tarek Auel] [SPARK-8199] rebase after #6985 got merged
eb6760d [Tarek Auel] Merge branch 'master' into SPARK-8199
f120415 [Tarek Auel] improved runtime
a8edebd [Tarek Auel] use Calendar instead of SimpleDateFormat
5fe74e1 [Tarek Auel] fixed python style
3bfac90 [Tarek Auel] fixed style
356df78 [Tarek Auel] rely on cast mechanism of Spark. Simplified implementation
02efc5d [Tarek Auel] removed doubled code
a5ea120 [Tarek Auel] added python api; changed test to be more meaningful
b680db6 [Tarek Auel] added codegeneration to all functions
c739788 [Tarek Auel] added support for quarter SPARK-8178
849fb41 [Tarek Auel] fixed stupid test
638596f [Tarek Auel] improved codegen
4d8049b [Tarek Auel] fixed tests and added type check
5ebb235 [Tarek Auel] resolved naming conflict
d0e2f99 [Tarek Auel] date functions
2015-07-18 22:48:05 -07:00
Yu ISHIKAWA 34a889db85 [SPARK-7879] [MLLIB] KMeans API for spark.ml Pipelines
I Implemented the KMeans API for spark.ml Pipelines. But it doesn't include clustering abstractions for spark.ml (SPARK-7610). It would fit for another issues. And I'll try it later, since we are trying to add the hierarchical clustering algorithms in another issue. Thanks.

[SPARK-7879] KMeans API for spark.ml Pipelines - ASF JIRA https://issues.apache.org/jira/browse/SPARK-7879

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #6756 from yu-iskw/SPARK-7879 and squashes the following commits:

be752de [Yu ISHIKAWA] Add assertions
a14939b [Yu ISHIKAWA] Fix the dashed line's length in pyspark.ml.rst
4c61693 [Yu ISHIKAWA] Remove the test about whether "features" and "prediction" columns exist or not in Python
fb2417c [Yu ISHIKAWA] Use getInt, instead of get
f397be4 [Yu ISHIKAWA] Switch the comparisons.
ca78b7d [Yu ISHIKAWA] Add the Scala docs about the constraints of each parameter.
effc650 [Yu ISHIKAWA] Using expertSetParam and expertGetParam
c8dc6e6 [Yu ISHIKAWA] Remove an unnecessary test
19a9d63 [Yu ISHIKAWA] Include spark.ml.clustering to python tests
1abb19c [Yu ISHIKAWA] Add the statements about spark.ml.clustering into pyspark.ml.rst
f8338bc [Yu ISHIKAWA] Add the placeholders in Python
4a03003 [Yu ISHIKAWA] Test for contains in Python
6566c8b [Yu ISHIKAWA] Use `get`, instead of `apply`
288e8d5 [Yu ISHIKAWA] Using `contains` to check the column names
5a7d574 [Yu ISHIKAWA] Renamce `validateInitializationMode` to `validateInitMode` and remove throwing exception
97cfae3 [Yu ISHIKAWA] Fix the type of return value of `KMeans.copy`
e933723 [Yu ISHIKAWA] Remove the default value of seed from the Model class
978ee2c [Yu ISHIKAWA] Modify the docs of KMeans, according to mllib's KMeans
2ec80bc [Yu ISHIKAWA] Fit on 1 line
e186be1 [Yu ISHIKAWA] Make a few variables, setters and getters be expert ones
b2c205c [Yu ISHIKAWA] Rename the method `getInitializationSteps` to `getInitSteps` and `setInitializationSteps` to `setInitSteps` in Scala and Python
f43f5b4 [Yu ISHIKAWA] Rename the method `getInitializationMode` to `getInitMode` and `setInitializationMode` to `setInitMode` in Scala and Python
3cb5ba4 [Yu ISHIKAWA] Modify the description about epsilon and the validation
4fa409b [Yu ISHIKAWA] Add a comment about the default value of epsilon
2f392e1 [Yu ISHIKAWA] Make some variables `final` and Use `IntParam` and `DoubleParam`
19326f8 [Yu ISHIKAWA] Use `udf`, instead of callUDF
4d2ad1e [Yu ISHIKAWA] Modify the indentations
0ae422f [Yu ISHIKAWA] Add a test for `setParams`
4ff7913 [Yu ISHIKAWA] Add "ml.clustering" to `javacOptions` in SparkBuild.scala
11ffdf1 [Yu ISHIKAWA] Use `===` and the variable
220a176 [Yu ISHIKAWA] Set a random seed in the unit testing
92c3efc [Yu ISHIKAWA] Make the points for a test be fewer
c758692 [Yu ISHIKAWA] Modify the parameters of KMeans in Python
6aca147 [Yu ISHIKAWA] Add some unit testings to validate the setter methods
687cacc [Yu ISHIKAWA] Alias mllib.KMeans as MLlibKMeans in KMeansSuite.scala
a4dfbef [Yu ISHIKAWA] Modify the last brace and indentations
5bedc51 [Yu ISHIKAWA] Remve an extra new line
444c289 [Yu ISHIKAWA] Add the validation for `runs`
e41989c [Yu ISHIKAWA] Modify how to validate `initStep`
7ea133a [Yu ISHIKAWA] Change how to validate `initMode`
7991e15 [Yu ISHIKAWA] Add a validation for `k`
c2df35d [Yu ISHIKAWA] Make `predict` private
93aa2ff [Yu ISHIKAWA] Use `withColumn` in `transform`
d3a79f7 [Yu ISHIKAWA] Remove the inhefited docs
e9532e1 [Yu ISHIKAWA] make `parentModel` of KMeansModel private
8559772 [Yu ISHIKAWA] Remove the `paramMap` parameter of KMeans
6684850 [Yu ISHIKAWA] Rename `initializationSteps` to `initSteps`
99b1b96 [Yu ISHIKAWA] Rename `initializationMode` to `initMode`
79ea82b [Yu ISHIKAWA] Modify the parameters of KMeans docs
6569bcd [Yu ISHIKAWA] Change how to set the default values with `setDefault`
20a795a [Yu ISHIKAWA] Change how to set the default values with `setDefault`
11c2a12 [Yu ISHIKAWA] Limit the imports
badb481 [Yu ISHIKAWA] Alias spark.mllib.{KMeans, KMeansModel}
f80319a [Yu ISHIKAWA] Rebase mater branch and add copy methods
85d92b1 [Yu ISHIKAWA] Add `KMeans.setPredictionCol`
aa9469d [Yu ISHIKAWA] Fix a python test suite error caused by python 3.x
c2d6bcb [Yu ISHIKAWA] ADD Java test suites of the KMeans API for spark.ml Pipeline
598ed2e [Yu ISHIKAWA] Implement the KMeans API for spark.ml Pipelines in Python
63ad785 [Yu ISHIKAWA] Implement the KMeans API for spark.ml Pipelines in Scala
2015-07-17 18:30:04 -07:00
Yanbo Liang 830666f6fe [SPARK-8792] [ML] Add Python API for PCA transformer
Add Python API for PCA transformer

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #7190 from yanboliang/spark-8792 and squashes the following commits:

8f4ac31 [Yanbo Liang] address comments
8a79cc0 [Yanbo Liang] Add Python API for PCA transformer
2015-07-17 14:08:06 -07:00
Davies Liu f9a82a884e [SPARK-9138] [MLLIB] fix Vectors.dense
Vectors.dense() should accept numbers directly, like the one in Scala. We already use it in doctests, it worked by luck.

cc mengxr jkbradley

Author: Davies Liu <davies@databricks.com>

Closes #7476 from davies/fix_vectors_dense and squashes the following commits:

e0fd292 [Davies Liu] fix Vectors.dense
2015-07-17 12:43:58 -07:00
Lianhui Wang 49351c7f59 [SPARK-8646] PySpark does not run on YARN if master not provided in command line
andrewor14 davies vanzin can you take a look at this? thanks

Author: Lianhui Wang <lianhuiwang09@gmail.com>

Closes #7438 from lianhuiwang/SPARK-8646 and squashes the following commits:

cb3f12d [Lianhui Wang] add whitespace
6d874a6 [Lianhui Wang] support pyspark for yarn-client
2015-07-16 19:31:45 -07:00
Cheng Hao 42dea3acf9 [SPARK-8245][SQL] FormatNumber/Length Support for Expression
- `BinaryType` for `Length`
- `FormatNumber`

Author: Cheng Hao <hao.cheng@intel.com>

Closes #7034 from chenghao-intel/expression and squashes the following commits:

e534b87 [Cheng Hao] python api style issue
601bbf5 [Cheng Hao] add python API support
3ebe288 [Cheng Hao] update as feedback
52274f7 [Cheng Hao] add support for udf_format_number and length for binary
2015-07-15 21:47:21 -07:00
Feynman Liang 536533cad8 [SPARK-9005] [MLLIB] Fix RegressionMetrics computation of explainedVariance
Fixes implementation of `explainedVariance` and `r2` to be consistent with their definitions as described in [SPARK-9005](https://issues.apache.org/jira/browse/SPARK-9005).

Author: Feynman Liang <fliang@databricks.com>

Closes #7361 from feynmanliang/SPARK-9005-RegressionMetrics-bugs and squashes the following commits:

f1112fc [Feynman Liang] Add explainedVariance formula
1a3d098 [Feynman Liang] SROwen code review comments
08a0e1b [Feynman Liang] Fix pyspark tests
db8605a [Feynman Liang] Style fix
bde9761 [Feynman Liang] Fix RegressionMetrics tests, relax assumption predictor is unbiased
c235de0 [Feynman Liang] Fix RegressionMetrics tests
4c4e56f [Feynman Liang] Fix RegressionMetrics computation of explainedVariance and r2
2015-07-15 13:32:25 -07:00
MechCoder 20bb10f864 [SPARK-8706] [PYSPARK] [PROJECT INFRA] Add pylint checks to PySpark
This adds Pylint checks to PySpark.

For now this lazy installs using easy_install to /dev/pylint (similar to the pep8 script).
We still need to figure out what rules to be allowed.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #7241 from MechCoder/pylint and squashes the following commits:

2fc7291 [MechCoder] Remove pylint test fail
6d883a2 [MechCoder] Silence warnings and make pylint tests fail to check if it works in jenkins
f3a5e17 [MechCoder] undefined-variable
ca8b749 [MechCoder] Minor changes
71629f8 [MechCoder] remove trailing whitespace
8498ff9 [MechCoder] Remove blacklisted arguments and pointless statements check
1dbd094 [MechCoder] Disable all checks for now
8b8aa8a [MechCoder] Add pylint configuration file
7871bb1 [MechCoder] [SPARK-8706] [PySpark] [Project infra] Add pylint checks to PySpark
2015-07-15 08:25:53 -07:00
Yu ISHIKAWA 4692769655 [SPARK-6259] [MLLIB] Python API for LDA
I implemented the Python API for LDA. But I didn't implemented a method for `LDAModel.describeTopics()`, beause it's a little hard to implement it now. And adding document about that and an example code would fit for another issue.

TODO: LDAModel.describeTopics() in Python must be also implemented. But it would be nice to fit for another issue. Implementing it is a little hard, since the return value of `describeTopics` in Scala consists of Tuple classes.

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #6791 from yu-iskw/SPARK-6259 and squashes the following commits:

6855f59 [Yu ISHIKAWA] LDA inherits object
28bd165 [Yu ISHIKAWA] Change the place of testing code
d7a332a [Yu ISHIKAWA] Remove the doc comment about the optimizer's default value
083e226 [Yu ISHIKAWA] Add the comment about the supported values and the default value of `optimizer`
9f8bed8 [Yu ISHIKAWA] Simplify casting
faa9764 [Yu ISHIKAWA] Add some comments for the LDA paramters
98f645a [Yu ISHIKAWA] Remove the interface for `describeTopics`. Because it is not implemented.
57ac03d [Yu ISHIKAWA] Remove the unnecessary import in Python unit testing
73412c3 [Yu ISHIKAWA] Fix the typo
2278829 [Yu ISHIKAWA] Fix the indentation
39514ec [Yu ISHIKAWA] Modify how to cast the input data
8117e18 [Yu ISHIKAWA] Fix the validation problems by `lint-scala`
77fd1b7 [Yu ISHIKAWA] Not use LabeledPoint
68f0653 [Yu ISHIKAWA] Support some parameters for `ALS.train()` in Python
25ef2ac [Yu ISHIKAWA] Resolve conflicts with rebasing
2015-07-14 23:27:42 -07:00
Davies Liu 79c35826e6 Revert "[SPARK-8706] [PYSPARK] [PROJECT INFRA] Add pylint checks to PySpark"
This reverts commit 9b62e9375f.
2015-07-13 11:30:36 -07:00
MechCoder 9b62e9375f [SPARK-8706] [PYSPARK] [PROJECT INFRA] Add pylint checks to PySpark
This adds Pylint checks to PySpark.

For now this lazy installs using easy_install to /dev/pylint (similar to the pep8 script).
We still need to figure out what rules to be allowed.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #7241 from MechCoder/pylint and squashes the following commits:

8496834 [MechCoder] Silence warnings and make pylint tests fail to check if it works in jenkins
57393a3 [MechCoder] undefined-variable
a8e2547 [MechCoder] Minor changes
7753810 [MechCoder] remove trailing whitespace
75c5d2b [MechCoder] Remove blacklisted arguments and pointless statements check
6bde250 [MechCoder] Disable all checks for now
3464666 [MechCoder] Add pylint configuration file
d28109f [MechCoder] [SPARK-8706] [PySpark] [Project infra] Add pylint checks to PySpark
2015-07-13 09:47:53 -07:00
Davies Liu 20b474335c [SPARK-9006] [PYSPARK] fix microsecond loss in Python 3
It may loss a microsecond if using timestamp as float, should be `int` instead.

Author: Davies Liu <davies@databricks.com>

Closes #7363 from davies/fix_microsecond and squashes the following commits:

36f6007 [Davies Liu] fix microsecond loss in Python 3
2015-07-12 20:25:06 -07:00
Scott Taylor 6e1c7e2798 [SPARK-7735] [PYSPARK] Raise Exception on non-zero exit from pipe commands
This will allow problems with piped commands to be detected.
This will also allow tasks to be retried where errors are rare (such as network problems in piped commands).

Author: Scott Taylor <github@megatron.me.uk>

Closes #6262 from megatron-me-uk/patch-2 and squashes the following commits:

04ae1d5 [Scott Taylor] Remove spurious empty line
98fa101 [Scott Taylor] fix blank line style error
574b564 [Scott Taylor] Merge pull request #2 from megatron-me-uk/patch-4
0c1e762 [Scott Taylor] Update rdd pipe method for checkCode
ab9a2e1 [Scott Taylor] Update rdd pipe tests for checkCode
eb4801c [Scott Taylor] fix fail_condition
b0ac3a4 [Scott Taylor] Merge pull request #1 from megatron-me-uk/megatron-me-uk-patch-1
a307d13 [Scott Taylor] update rdd tests to test pipe modes
34fcdc3 [Scott Taylor] add optional argument 'mode' for rdd.pipe
a0c0161 [Scott Taylor] fix generator issue
8a9ef9c [Scott Taylor] make check_return_code an iterator
0486ae3 [Scott Taylor] style fixes
8ed89a6 [Scott Taylor] Chain generators to prevent potential deadlock
4153b02 [Scott Taylor] fix list.sort returns None
491d3fc [Scott Taylor] Pass a function handle to assertRaises
3344a21 [Scott Taylor] wrap assertRaises with QuietTest
3ab8c7a [Scott Taylor] remove whitespace for style
cc1a73d [Scott Taylor] fix style issues in pipe test
8db4073 [Scott Taylor] Add a test for rdd pipe functions
1b3dc4e [Scott Taylor] fix missing space around operator style
0974f98 [Scott Taylor] add space between words in multiline string
45f4977 [Scott Taylor] fix line too long style error
5745d85 [Scott Taylor] Remove space to fix style
f552d49 [Scott Taylor] Catch non-zero exit from pipe commands
2015-07-10 19:29:32 -07:00
Davies Liu b6fc0adf68 add inline comment for python tests 2015-07-10 17:44:21 -07:00
Davies Liu 05ac023dc8 [HOTFIX] fix flaky test in PySpark SQL
It may loss precision in microseconds when using float for it.

Author: Davies Liu <davies@databricks.com>

Closes #7344 from davies/fix_date_test and squashes the following commits:

249ec61 [Davies Liu] fix flaky test
2015-07-10 13:05:23 -07:00
Davies Liu c9e2ef52bb [SPARK-7902] [SPARK-6289] [SPARK-8685] [SQL] [PYSPARK] Refactor of serialization for Python DataFrame
This PR fix the long standing issue of serialization between Python RDD and DataFrame, it change to using a customized Pickler for InternalRow to enable customized unpickling (type conversion, especially for UDT), now we can support UDT for UDF, cc mengxr .

There is no generated `Row` anymore.

Author: Davies Liu <davies@databricks.com>

Closes #7301 from davies/sql_ser and squashes the following commits:

81bef71 [Davies Liu] address comments
e9217bd [Davies Liu] add regression tests
db34167 [Davies Liu] Refactor of serialization for Python DataFrame
2015-07-09 14:43:38 -07:00
jerryshao 3ccebf36c5 [SPARK-8389] [STREAMING] [PYSPARK] Expose KafkaRDDs offsetRange in Python
This PR propose a simple way to expose OffsetRange in Python code, also the usage of offsetRanges is similar to Scala/Java way, here in Python we could get OffsetRange like:

```
dstream.foreachRDD(lambda r: KafkaUtils.offsetRanges(r))
```

Reason I didn't follow the way what SPARK-8389 suggested is that: Python Kafka API has one more step to decode the message compared to Scala/Java, Which makes Python API return a transformed RDD/DStream, not directly wrapped so-called JavaKafkaRDD, so it is hard to backtrack to the original RDD to get the offsetRange.

Author: jerryshao <saisai.shao@intel.com>

Closes #7185 from jerryshao/SPARK-8389 and squashes the following commits:

4c6d320 [jerryshao] Another way to fix subclass deserialization issue
e6a8011 [jerryshao] Address the comments
fd13937 [jerryshao] Fix serialization bug
7debf1c [jerryshao] bug fix
cff3893 [jerryshao] refactor the code according to the comments
2aabf9e [jerryshao] Style fix
848c708 [jerryshao] Add HasOffsetRanges for Python
2015-07-09 13:54:44 -07:00
lewuathe f88b12537e [SPARK-6266] [MLLIB] PySpark SparseVector missing doc for size, indices, values
Write missing pydocs in `SparseVector` attributes.

Author: lewuathe <lewuathe@me.com>

Closes #7290 from Lewuathe/SPARK-6266 and squashes the following commits:

51d9895 [lewuathe] Update docs
0480d35 [lewuathe] Merge branch 'master' into SPARK-6266
ba42cf3 [lewuathe] [SPARK-6266] PySpark SparseVector missing doc for size, indices, values
2015-07-09 08:16:26 -07:00
Yijie Shen a290814877 [SPARK-8866][SQL] use 1us precision for timestamp type
JIRA: https://issues.apache.org/jira/browse/SPARK-8866

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #7283 from yijieshen/micro_timestamp and squashes the following commits:

dc735df [Yijie Shen] update CastSuite to avoid round error
714eaea [Yijie Shen] add timestamp_udf into blacklist due to precision lose
c3ca2f4 [Yijie Shen] fix unhandled case in CurrentTimestamp
8d4aa6b [Yijie Shen] use 1us precision for timestamp type
2015-07-08 20:20:17 -07:00
Davies Liu 74d8d3d928 [SPARK-8450] [SQL] [PYSARK] cleanup type converter for Python DataFrame
This PR fixes the converter for Python DataFrame, especially for DecimalType

Closes #7106

Author: Davies Liu <davies@databricks.com>

Closes #7131 from davies/decimal_python and squashes the following commits:

4d3c234 [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_python
20531d6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_python
7d73168 [Davies Liu] fix conflit
6cdd86a [Davies Liu] Merge branch 'master' of github.com:apache/spark into decimal_python
7104e97 [Davies Liu] improve type infer
9cd5a21 [Davies Liu] run python tests with SPARK_PREPEND_CLASSES
829a05b [Davies Liu] fix UDT in python
c99e8c5 [Davies Liu] fix mima
c46814a [Davies Liu] convert decimal for Python DataFrames
2015-07-08 18:22:53 -07:00
Yanbo Liang 381cb161ba [SPARK-8068] [MLLIB] Add confusionMatrix method at class MulticlassMetrics in pyspark/mllib
Add confusionMatrix method at class MulticlassMetrics in pyspark/mllib

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #7286 from yanboliang/spark-8068 and squashes the following commits:

6109fe1 [Yanbo Liang] Add confusionMatrix method at class MulticlassMetrics in pyspark/mllib
2015-07-08 16:21:28 -07:00
MechCoder 2b40365d76 [SPARK-7785] [MLLIB] [PYSPARK] Add __str__ and __repr__ to Matrices
Adding __str__ and  __repr__ to DenseMatrix and SparseMatrix

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #6342 from MechCoder/spark-7785 and squashes the following commits:

7b9a82c [MechCoder] Add tests for greater than 16 elements
b88e9dd [MechCoder] Increment limit to 16
1425a01 [MechCoder] Change tests
36bd166 [MechCoder] Change str and repr representation
97f0da9 [MechCoder] zip is same as izip in python3
94ca4b2 [MechCoder] Added doctests and iterate over values instead of colPtrs
b26fa89 [MechCoder] minor
394dde9 [MechCoder] [SPARK-7785] Add __str__ and __repr__ to Matrices
2015-07-08 13:19:27 -07:00
MechCoder 35d781e71b [SPARK-8704] [ML] [PySpark] Add missing methods in StandardScaler
Add std, mean to StandardScalerModel
getVectors, findSynonyms to Word2Vec Model
setFeatures and getFeatures to hashingTF

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #7086 from MechCoder/missing_model_methods and squashes the following commits:

9fbae90 [MechCoder] Add type
6e3d6b2 [MechCoder] [SPARK-8704] Add missing methods in StandardScaler (ML and PySpark)
2015-07-07 12:35:40 -07:00
MechCoder 738c10748b [SPARK-8823] [MLLIB] [PYSPARK] Optimizations for SparseVector dot products
Follow up for https://github.com/apache/spark/pull/5946

Currently we iterate over indices and values in SparseVector and can be vectorized.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #7222 from MechCoder/sparse_optim and squashes the following commits:

dcb51d3 [MechCoder] [SPARK-8823] [MLlib] [PySpark] Optimizations for SparseVector dot product
2015-07-07 08:59:52 -07:00
MechCoder 1dbc4a155f [SPARK-8711] [ML] Add additional methods to PySpark ML tree models
Add numNodes and depth to treeModels, add treeWeights to ensemble Models.
Add __repr__ to all models.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #7095 from MechCoder/missing_methods_tree and squashes the following commits:

23b08be [MechCoder] private [spark]
38a0860 [MechCoder] rename pyTreeWeights to javaTreeWeights
6d16ad8 [MechCoder] Fix Python 3 Error
47d7023 [MechCoder] Use np.allclose and treeEnsembleModel -> TreeEnsembleMethods
819098c [MechCoder] [SPARK-8711] [ML] Add additional methods ot PySpark ML tree models
2015-07-07 08:58:08 -07:00
Yanbo Liang 0effe180f4 [SPARK-8765] [MLLIB] Fix PySpark PowerIterationClustering test issue
PySpark PowerIterationClustering test failure due to bad demo data.
If the data is small,  PowerIterationClustering will behavior indeterministic.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #7177 from yanboliang/spark-8765 and squashes the following commits:

392ae54 [Yanbo Liang] fix model.assignments output
5ec3f1e [Yanbo Liang] fix PySpark PowerIterationClustering test issue
2015-07-06 16:15:12 -07:00
Xiangrui Meng 96c5eeec39 Revert "[SPARK-7212] [MLLIB] Add sequence learning flag"
This reverts commit 25f574eb9a. After speaking to some users and developers, we realized that FP-growth doesn't meet the requirement for frequent sequence mining. PrefixSpan (SPARK-6487) would be the correct algorithm for it. feynmanliang

Author: Xiangrui Meng <meng@databricks.com>

Closes #7240 from mengxr/SPARK-7212.revert and squashes the following commits:

2b3d66b [Xiangrui Meng] Revert "[SPARK-7212] [MLLIB] Add sequence learning flag"
2015-07-06 16:11:22 -07:00
Davies Liu 37e4d92142 [SPARK-8784] [SQL] Add Python API for hex and unhex
Add Python API for hex/unhex,  also cleanup Hex/Unhex

Author: Davies Liu <davies@databricks.com>

Closes #7223 from davies/hex and squashes the following commits:

6f1249d [Davies Liu] no explicit rule to cast string into binary
711a6ed [Davies Liu] fix test
f9fe5a3 [Davies Liu] Merge branch 'master' of github.com:apache/spark into hex
f032fbb [Davies Liu] Merge branch 'hex' of github.com:davies/spark into hex
49e325f [Davies Liu] Merge branch 'master' of github.com:apache/spark into hex
b31fc9a [Davies Liu] Update math.scala
25156b7 [Davies Liu] address comments and fix test
c3af78c [Davies Liu] address commments
1a24082 [Davies Liu] Add Python API for hex and unhex
2015-07-06 13:31:31 -07:00
Tarek Auel 6b3574e687 [SPARK-8270][SQL] levenshtein distance
Jira: https://issues.apache.org/jira/browse/SPARK-8270

Info: I can not build the latest master, it stucks during the build process: `[INFO] Dependency-reduced POM written at: /Users/tarek/test/spark/bagel/dependency-reduced-pom.xml`

Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7214 from tarekauel/SPARK-8270 and squashes the following commits:

ab348b9 [Tarek Auel] Merge branch 'master' into SPARK-8270
a2ad318 [Tarek Auel] [SPARK-8270] changed order of fields
d91b12c [Tarek Auel] [SPARK-8270] python fix
adbd075 [Tarek Auel] [SPARK-8270] fixed typo
23185c9 [Tarek Auel] [SPARK-8270] levenshtein distance
2015-07-04 01:10:52 -07:00
MechCoder f0fac2aa80 [SPARK-7401] [MLLIB] [PYSPARK] Vectorize dot product and sq_dist between SparseVector and DenseVector
Currently we iterate over indices which can be vectorized.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #5946 from MechCoder/spark-7203 and squashes the following commits:

034d086 [MechCoder] Vectorize dot calculation for numpy arrays for ndim=2
bce2b07 [MechCoder] fix doctest
fcad0a3 [MechCoder] Remove type checks for list, pyarray etc
0ee5dd4 [MechCoder] Add tests and other isinstance changes
e5f1de0 [MechCoder] [SPARK-7401] Vectorize dot product and sq_dist
2015-07-03 15:49:32 -07:00
zhichao.li ab535b9a1d [SPARK-8226] [SQL] Add function shiftrightunsigned
Author: zhichao.li <zhichao.li@intel.com>

Closes #7035 from zhichao-li/shiftRightUnsigned and squashes the following commits:

6bcca5a [zhichao.li] change coding style
3e9f5ae [zhichao.li] python style
d85ae0b [zhichao.li] add shiftrightunsigned
2015-07-03 15:39:16 -07:00
Reynold Xin e589e71a29 Revert "[SPARK-8784] [SQL] Add Python API for hex and unhex"
This reverts commit fc7aebd94a.
2015-07-02 16:25:10 -07:00
Yu ISHIKAWA 488bad319a [SPARK-7104] [MLLIB] Support model save/load in Python's Word2Vec
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #6821 from yu-iskw/SPARK-7104 and squashes the following commits:

975136b [Yu ISHIKAWA] Organize import
0ef58b6 [Yu ISHIKAWA] Use rmtree, instead of removedirs
cb21653 [Yu ISHIKAWA] Add an explicit type for `Word2VecModelWrapper.save`
1d468ef [Yu ISHIKAWA] [SPARK-7104][MLlib] Support model save/load in Python's Word2Vec
2015-07-02 15:55:16 -07:00
Davies Liu fc7aebd94a [SPARK-8784] [SQL] Add Python API for hex and unhex
Also improve the performance of hex/unhex

Author: Davies Liu <davies@databricks.com>

Closes #7181 from davies/hex and squashes the following commits:

f032fbb [Davies Liu] Merge branch 'hex' of github.com:davies/spark into hex
49e325f [Davies Liu] Merge branch 'master' of github.com:apache/spark into hex
b31fc9a [Davies Liu] Update math.scala
25156b7 [Davies Liu] address comments and fix test
c3af78c [Davies Liu] address commments
1a24082 [Davies Liu] Add Python API for hex and unhex
2015-07-02 15:43:02 -07:00
Yijie Shen 52302a8039 [SPARK-8407] [SQL] complex type constructors: struct and named_struct
This is a follow up of [SPARK-8283](https://issues.apache.org/jira/browse/SPARK-8283) ([PR-6828](https://github.com/apache/spark/pull/6828)), to support both `struct` and `named_struct` in Spark SQL.

After [#6725](https://github.com/apache/spark/pull/6828), the semantic of [`CreateStruct`](https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypes.scala#L56) methods have changed a little and do not limited to cols of `NamedExpressions`, it will name non-NamedExpression fields following the hive convention, col1, col2 ...

This PR would both loosen [`struct`](https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L723) to take children of `Expression` type and add `named_struct` support.

Author: Yijie Shen <henry.yijieshen@gmail.com>

Closes #6874 from yijieshen/SPARK-8283 and squashes the following commits:

4cd3375ac [Yijie Shen] change struct documentation
d599d0b [Yijie Shen] rebase code
9a7039e [Yijie Shen] fix reviews and regenerate golden answers
b487354 [Yijie Shen] replace assert using checkAnswer
f07e114 [Yijie Shen] tiny fix
9613be9 [Yijie Shen] review fix
7fef712 [Yijie Shen] Fix checkInputTypes' implementation using foldable and nullable
60812a7 [Yijie Shen] Fix type check
828d694 [Yijie Shen] remove unnecessary resolved assertion inside dataType method
fd3cd8e [Yijie Shen] remove type check from eval
7a71255 [Yijie Shen] tiny fix
ccbbd86 [Yijie Shen] Fix reviews
47da332 [Yijie Shen] remove nameStruct API from DataFrame
917e680 [Yijie Shen] Fix reviews
4bd75ad [Yijie Shen] loosen struct method in functions.scala to take Expression children
0acb7be [Yijie Shen] Add CreateNamedStruct in both DataFrame function API and FunctionRegistery
2015-07-02 10:12:25 -07:00
Tarek Auel 5b3338130d [SPARK-8223] [SPARK-8224] [SQL] shift left and shift right
Jira:
https://issues.apache.org/jira/browse/SPARK-8223
https://issues.apache.org/jira/browse/SPARK-8224

~~I am aware of #7174 and will update this pr, if it's merged.~~ Done
I don't know if #7034 can simplify this, but we can have a look on it, if it gets merged

rxin In the Jira ticket the function as no second argument. I added a `numBits` argument that allows to specify the number of bits. I guess this improves the usability. I wanted to add `shiftleft(value)` as well, but the `selectExpr` dataframe tests crashes, if I have both. I order to do this, I added the following to the functions.scala `def shiftRight(e: Column): Column = ShiftRight(e.expr, lit(1).expr)`, but as I mentioned this doesn't pass tests like `df.selectExpr("shiftRight(a)", ...` (not enough arguments exception).

If we need the bitwise shift in order to be hive compatible, I suggest to add `shiftLeft` and something like `shiftLeftX`

Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7178 from tarekauel/8223 and squashes the following commits:

8023bb5 [Tarek Auel] [SPARK-8223][SPARK-8224] fixed test
f3f64e6 [Tarek Auel] [SPARK-8223][SPARK-8224] Integer -> Int
f628706 [Tarek Auel] [SPARK-8223][SPARK-8224] removed toString; updated function description
3b56f2a [Tarek Auel] Merge remote-tracking branch 'origin/master' into 8223
5189690 [Tarek Auel] [SPARK-8223][SPARK-8224] minor fix and style fix
9434a28 [Tarek Auel] Merge remote-tracking branch 'origin/master' into 8223
44ee324 [Tarek Auel] [SPARK-8223][SPARK-8224] docu fix
ac7fe9d [Tarek Auel] [SPARK-8223][SPARK-8224] right and left bit shift
2015-07-02 10:02:19 -07:00
Reynold Xin 9fd13d5613 [SPARK-8770][SQL] Create BinaryOperator abstract class.
Our current BinaryExpression abstract class is not for generic binary expressions, i.e. it requires left/right children to have the same type. However, due to its name, contributors build new binary expressions that don't have that assumption (e.g. Sha) and still extend BinaryExpression.

This patch creates a new BinaryOperator abstract class, and update the analyzer o only apply type casting rule there. This patch also adds the notion of "prettyName" to expressions, which defines the user-facing name for the expression.

Author: Reynold Xin <rxin@databricks.com>

Closes #7174 from rxin/binary-opterator and squashes the following commits:

f31900d [Reynold Xin] [SPARK-8770][SQL] Create BinaryOperator abstract class.
fceb216 [Reynold Xin] Merge branch 'master' of github.com:apache/spark into binary-opterator
d8518cf [Reynold Xin] Updated Python tests.
2015-07-01 21:14:13 -07:00
Davies Liu f958f27e20 [SPARK-8766] support non-ascii character in column names
Use UTF-8 to encode the name of column in Python 2, or it may failed to encode with default encoding ('ascii').

This PR also fix a bug when there is Java exception without error message.

Author: Davies Liu <davies@databricks.com>

Closes #7165 from davies/non_ascii and squashes the following commits:

02cb61a [Davies Liu] fix tests
3b09d31 [Davies Liu] add encoding in header
867754a [Davies Liu] support non-ascii character in column names
2015-07-01 16:43:18 -07:00
zsxwing 75b9fe4c5f [SPARK-8378] [STREAMING] Add the Python API for Flume
Author: zsxwing <zsxwing@gmail.com>

Closes #6830 from zsxwing/flume-python and squashes the following commits:

78dfdac [zsxwing] Fix the compile error in the test code
f1bf3c0 [zsxwing] Address TD's comments
0449723 [zsxwing] Add sbt goal streaming-flume-assembly/assembly
e93736b [zsxwing] Fix the test case for determine_modules_to_test
9d5821e [zsxwing] Fix pyspark_core dependencies
f9ee681 [zsxwing] Merge branch 'master' into flume-python
7a55837 [zsxwing] Add streaming_flume_assembly to run-tests.py
b96b0de [zsxwing] Merge branch 'master' into flume-python
ce85e83 [zsxwing] Fix incompatible issues for Python 3
01cbb3d [zsxwing] Add import sys
152364c [zsxwing] Fix the issue that StringIO doesn't work in Python 3
14ba0ff [zsxwing] Add flume-assembly for sbt building
b8d5551 [zsxwing] Merge branch 'master' into flume-python
4762c34 [zsxwing] Fix the doc
0336579 [zsxwing] Refactor Flume unit tests and also add tests for Python API
9f33873 [zsxwing] Add the Python API for Flume
2015-07-01 11:59:24 -07:00
Joseph K. Bradley b8faa32875 [SPARK-8765] [MLLIB] [PYTHON] removed flaky python PIC test
See failure: [https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36133/console]

CC yanboliang  mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #7164 from jkbradley/pic-python-test and squashes the following commits:

156d55b [Joseph K. Bradley] removed flaky python PIC test
2015-07-01 11:57:52 -07:00
lewuathe 184de91d15 [SPARK-6263] [MLLIB] Python MLlib API missing items: Utils
Implement missing API in pyspark.

MLUtils
* appendBias
* loadVectors

`kFold` is also missing however I am not sure `ClassTag` can be passed or restored through python.

Author: lewuathe <lewuathe@me.com>

Closes #5707 from Lewuathe/SPARK-6263 and squashes the following commits:

16863ea [lewuathe] Merge master
3fc27e7 [lewuathe] Merge branch 'master' into SPARK-6263
6084e9c [lewuathe] Resolv conflict
d2aa2a0 [lewuathe] Resolv conflict
9c329d8 [lewuathe] Fix efficiency
3a12a2d [lewuathe] Merge branch 'master' into SPARK-6263
1d4714b [lewuathe] Fix style
b29e2bc [lewuathe] Remove scipy dependencies
e32eb40 [lewuathe] Merge branch 'master' into SPARK-6263
25d3c9d [lewuathe] Remove unnecessary imports
7ec04db [lewuathe] Resolv conflict
1502d13 [lewuathe] Resolv conflict
d6bd416 [lewuathe] Check existence of scipy.sparse
5d555b1 [lewuathe] Construct scipy.sparse matrix
c345a44 [lewuathe] Merge branch 'master' into SPARK-6263
b8b5ef7 [lewuathe] Fix unnecessary sort method
d254be7 [lewuathe] Merge branch 'master' into SPARK-6263
62a9c7e [lewuathe] Fix appendBias return type
454c73d [lewuathe] Merge branch 'master' into SPARK-6263
a353354 [lewuathe] Remove unnecessary appendBias implementation
44295c2 [lewuathe] Merge branch 'master' into SPARK-6263
64f72ad [lewuathe] Merge branch 'master' into SPARK-6263
c728046 [lewuathe] Fix style
2980569 [lewuathe] [SPARK-6263] Python MLlib API missing items: Utils
2015-07-01 11:14:07 -07:00
cocoatomo fdcad6ef48 [SPARK-8763] [PYSPARK] executing run-tests.py with Python 2.6 fails with absence of subprocess.check_output function
Running run-tests.py with Python 2.6 cause following error:

```
Running PySpark tests. Output is in python//Users/tomohiko/.jenkins/jobs/pyspark_test/workspace/python/unit-tests.log
Will test against the following Python executables: ['python2.6', 'python3.4', 'pypy']
Will test the following Python modules: ['pyspark-core', 'pyspark-ml', 'pyspark-mllib', 'pyspark-sql', 'pyspark-streaming']
Traceback (most recent call last):
  File "./python/run-tests.py", line 196, in <module>
    main()
  File "./python/run-tests.py", line 159, in main
    python_implementation = subprocess.check_output(
AttributeError: 'module' object has no attribute 'check_output'
...
```

The cause of this error is using subprocess.check_output function, which exists since Python 2.7.
(ref. https://docs.python.org/2.7/library/subprocess.html#subprocess.check_output)

Author: cocoatomo <cocoatomo77@gmail.com>

Closes #7161 from cocoatomo/issues/8763-test-fails-py26 and squashes the following commits:

cf4f901 [cocoatomo] [SPARK-8763] backport process.check_output function from Python 2.7
2015-07-01 09:37:09 -07:00
x1- b6e76edf30 [SPARK-8535] [PYSPARK] PySpark : Can't create DataFrame from Pandas dataframe with no explicit column name
Because implicit name of `pandas.columns` are Int, but `StructField` json expect `String`.
So I think `pandas.columns` are should be convert to `String`.

### issue

* [SPARK-8535 PySpark : Can't create DataFrame from Pandas dataframe with no explicit column name](https://issues.apache.org/jira/browse/SPARK-8535)

Author: x1- <viva008@gmail.com>

Closes #7124 from x1-/SPARK-8535 and squashes the following commits:

d68fd38 [x1-] modify unit-test using pandas.
ea1897d [x1-] For implicit name of pandas.columns are Int, so should be convert to String.
2015-06-30 20:35:46 -07:00
Tarek Auel ccdb05222a [SPARK-8727] [SQL] Missing python api; md5, log2
Jira: https://issues.apache.org/jira/browse/SPARK-8727

Author: Tarek Auel <tarek.auel@gmail.com>
Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #7114 from tarekauel/missing-python and squashes the following commits:

ef4c61b [Tarek Auel] [SPARK-8727] revert dataframe change
4029d4d [Tarek Auel] removed dataframe pi and e unit test
66f0d2b [Tarek Auel] removed pi and e from python api and dataframe api; added _to_java_column(col) for strlen
4d07318 [Tarek Auel] fixed python unit test
45f2bee [Tarek Auel] fixed result of pi and e
c39f47b [Tarek Auel] add python api
bd50a3a [Tarek Auel] add missing python functions
2015-06-30 17:00:51 -07:00
Davies Liu 58ee2a2e47 [SPARK-8738] [SQL] [PYSPARK] capture SQL AnalysisException in Python API
Capture the AnalysisException in SQL, hide the long java stack trace, only show the error message.

cc rxin

Author: Davies Liu <davies@databricks.com>

Closes #7135 from davies/ananylis and squashes the following commits:

dad7ae7 [Davies Liu] add comment
ec0c0e8 [Davies Liu] Update utils.py
cdd7edd [Davies Liu] add doc
7b044c2 [Davies Liu] fix python 3
f84d3bd [Davies Liu] capture SQL AnalysisException in Python API
2015-06-30 16:17:46 -07:00
MechCoder 5fa0863626 [SPARK-8679] [PYSPARK] [MLLIB] Default values in Pipeline API should be immutable
It might be dangerous to have a mutable as value for default param. (http://stackoverflow.com/a/11416002/1170730)

e.g

    def func(example, f={}):
        f[example] = 1
        return f

    func(2)

    {2: 1}
    func(3)
    {2:1, 3:1}

mengxr

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #7058 from MechCoder/pipeline_api_playground and squashes the following commits:

40a5eb2 [MechCoder] copy
95f7ff2 [MechCoder] [SPARK-8679] [PySpark] [MLlib] Default values in Pipeline API should be immutable
2015-06-30 10:27:29 -07:00
MechCoder 45281664e0 [SPARK-4127] [MLLIB] [PYSPARK] Python bindings for StreamingLinearRegressionWithSGD
Python bindings for StreamingLinearRegressionWithSGD

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #6744 from MechCoder/spark-4127 and squashes the following commits:

d8f6457 [MechCoder] Moved StreamingLinearAlgorithm to pyspark.mllib.regression
d47cc24 [MechCoder] Inherit from StreamingLinearAlgorithm
1b4ddd6 [MechCoder] minor
4de6c68 [MechCoder] Minor refactor
5e85a3b [MechCoder] Add tests for simultaneous training and prediction
fb27889 [MechCoder] Add example and docs
505380b [MechCoder] Add tests
d42bdae [MechCoder] [SPARK-4127] Python bindings for StreamingLinearRegressionWithSGD
2015-06-30 10:25:59 -07:00
zsxwing 12671dd5e4 [SPARK-8434][SQL]Add a "pretty" parameter to the "show" method to display long strings
Sometimes the user may want to show the complete content of cells. Now `sql("set -v").show()` displays:

![screen shot 2015-06-18 at 4 34 51 pm](https://cloud.githubusercontent.com/assets/1000778/8227339/14d3c5ea-15d9-11e5-99b9-f00b7e93beef.png)

The user needs to use something like `sql("set -v").collect().foreach(r => r.toSeq.mkString("\t"))` to show the complete content.

This PR adds a `pretty` parameter to show. If `pretty` is false, `show` won't truncate strings or align cells right.

![screen shot 2015-06-18 at 4 21 44 pm](https://cloud.githubusercontent.com/assets/1000778/8227407/b6f8dcac-15d9-11e5-8219-8079280d76fc.png)

Author: zsxwing <zsxwing@gmail.com>

Closes #6877 from zsxwing/show and squashes the following commits:

22e28e9 [zsxwing] pretty -> truncate
e582628 [zsxwing] Add pretty parameter to the show method in R
a3cd55b [zsxwing] Fix calling showString in R
923cee4 [zsxwing] Add a "pretty" parameter to show to display long strings
2015-06-29 23:44:11 -07:00
Josh Rosen 6c5a6db4d5 [SPARK-5161] [HOTFIX] Fix bug in Python test failure reporting
This patch fixes a bug introduced in #7031 which can cause Jenkins to incorrectly report a build with failed Python tests as passing if an error occurred while printing the test failure message.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7112 from JoshRosen/python-tests-hotfix and squashes the following commits:

c3f2961 [Josh Rosen] Hotfix for bug in Python test failure reporting
2015-06-29 23:08:51 -07:00
Josh Rosen 7bbbe380c5 [SPARK-5161] Parallelize Python test execution
This commit parallelizes the Python unit test execution, significantly reducing Jenkins build times.  Parallelism is now configurable by passing the `-p` or `--parallelism` flags to either `dev/run-tests` or `python/run-tests` (the default parallelism is 4, but I've successfully tested with higher parallelism).

To avoid flakiness, I've disabled the Spark Web UI for the Python tests, similar to what we've done for the JVM tests.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7031 from JoshRosen/parallelize-python-tests and squashes the following commits:

feb3763 [Josh Rosen] Re-enable other tests
f87ea81 [Josh Rosen] Only log output from failed tests
d4ded73 [Josh Rosen] Logging improvements
a2717e1 [Josh Rosen] Make parallelism configurable via dev/run-tests
1bacf1b [Josh Rosen] Merge remote-tracking branch 'origin/master' into parallelize-python-tests
110cd9d [Josh Rosen] Fix universal_newlines for Python 3
cd13db8 [Josh Rosen] Also log python_implementation
9e31127 [Josh Rosen] Log Python --version output for each executable.
a2b9094 [Josh Rosen] Bump up parallelism.
5552380 [Josh Rosen] Python 3 fix
866b5b9 [Josh Rosen] Fix lazy logging warnings in Prospector checks
87cb988 [Josh Rosen] Skip MLLib tests for PyPy
8309bfe [Josh Rosen] Temporarily disable parallelism to debug a failure
9129027 [Josh Rosen] Disable Spark UI in Python tests
037b686 [Josh Rosen] Temporarily disable JVM tests so we can test Python speedup in Jenkins.
af4cef4 [Josh Rosen] Initial attempt at parallelizing Python test execution
2015-06-29 21:32:40 -07:00
Yanbo Liang f9b6bf2f83 [SPARK-7667] [MLLIB] MLlib Python API consistency check
MLlib Python API consistency check

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #6856 from yanboliang/spark-7667 and squashes the following commits:

21bae35 [Yanbo Liang] remove duplicate code
eb12f95 [Yanbo Liang] fix doc inherit problem
9e7ec3c [Yanbo Liang] address comments
e763d32 [Yanbo Liang] MLlib Python API consistency check
2015-06-29 18:50:23 -07:00
Feynman Liang 620605a4a1 [SPARK-8456] [ML] Ngram featurizer python
Python API for N-gram feature transformer

Author: Feynman Liang <fliang@databricks.com>

Closes #6960 from feynmanliang/ngram-featurizer-python and squashes the following commits:

f9e37c9 [Feynman Liang] Remove debugging code
4dd81f4 [Feynman Liang] Fix typo and doctest
06c79ac [Feynman Liang] Style guide
26c1175 [Feynman Liang] Add python NGram API
2015-06-29 18:40:30 -07:00
Ai He ecd3aacf28 [SPARK-7810] [PYSPARK] solve python rdd socket connection problem
Method "_load_from_socket" in rdd.py cannot load data from jvm socket when ipv6 is used. The current method only works well with ipv4. New modification should work around both two protocols.

Author: Ai He <ai.he@ussuning.com>
Author: AiHe <ai.he@ussuning.com>

Closes #6338 from AiHe/pyspark-networking-issue and squashes the following commits:

d4fc9c4 [Ai He] handle code review 2
e75c5c8 [Ai He] handle code review
5644953 [AiHe] solve python rdd socket connection problem to jvm
2015-06-29 14:36:26 -07:00
Ilya Ganelin f6fc254ec4 [SPARK-8056][SQL] Design an easier way to construct schema for both Scala and Python
I've added functionality to create new StructType similar to how we add parameters to a new SparkContext.

I've also added tests for this type of creation.

Author: Ilya Ganelin <ilya.ganelin@capitalone.com>

Closes #6686 from ilganeli/SPARK-8056B and squashes the following commits:

27c1de1 [Ilya Ganelin] Rename
467d836 [Ilya Ganelin] Removed from_string in favor of _parse_Datatype_json_value
5fef5a4 [Ilya Ganelin] Updates for type parsing
4085489 [Ilya Ganelin] Style errors
3670cf5 [Ilya Ganelin] added string to DataType conversion
8109e00 [Ilya Ganelin] Fixed error in tests
41ab686 [Ilya Ganelin] Fixed style errors
e7ba7e0 [Ilya Ganelin] Moved some python tests to tests.py. Added cleaner handling of null data type and added test for correctness of input format
15868fa [Ilya Ganelin] Fixed python errors
b79b992 [Ilya Ganelin] Merge remote-tracking branch 'upstream/master' into SPARK-8056B
a3369fc [Ilya Ganelin] Fixing space errors
e240040 [Ilya Ganelin] Style
bab7823 [Ilya Ganelin] Constructor error
73d4677 [Ilya Ganelin] Style
4ed00d9 [Ilya Ganelin] Fixed default arg
67df57a [Ilya Ganelin] Removed Foo
04cbf0c [Ilya Ganelin] Added comments for single object
0484d7a [Ilya Ganelin] Restored second method
6aeb740 [Ilya Ganelin] Style
689e54d [Ilya Ganelin] Style
f497e9e [Ilya Ganelin] Got rid of old code
e3c7a88 [Ilya Ganelin] Fixed doctest failure
a62ccde [Ilya Ganelin] Style
966ac06 [Ilya Ganelin] style checks
dabb7e6 [Ilya Ganelin] Added Python tests
a3f4152 [Ilya Ganelin] added python bindings and better comments
e6e536c [Ilya Ganelin] Added extra space
7529a2e [Ilya Ganelin] Fixed formatting
d388f86 [Ilya Ganelin] Fixed small bug
c4e3bf5 [Ilya Ganelin] Reverted to using parse. Updated parse to support long
d7634b6 [Ilya Ganelin] Reverted to fromString to properly support types
22c39d5 [Ilya Ganelin] replaced FromString with DataTypeParser.parse. Replaced empty constructor initializing a null to have it instead create a new array to allow appends to it.
faca398 [Ilya Ganelin] [SPARK-8056] Replaced default argument usage. Updated usage and code for DataType.fromString
1acf76e [Ilya Ganelin] Scala style
e31c674 [Ilya Ganelin] Fixed bug in test
8dc0795 [Ilya Ganelin] Added tests for creation of StructType object with new methods
fdf7e9f [Ilya Ganelin] [SPARK-8056] Created add methods to facilitate building new StructType objects.
2015-06-29 14:15:15 -07:00
Davies Liu afae9766f2 [SPARK-8070] [SQL] [PYSPARK] avoid spark jobs in createDataFrame
Avoid the unnecessary jobs when infer schema from list.

cc yhuai mengxr

Author: Davies Liu <davies@databricks.com>

Closes #6606 from davies/improve_create and squashes the following commits:

a5928bf [Davies Liu] Update MimaExcludes.scala
62da911 [Davies Liu] fix mima
bab4d7d [Davies Liu] Merge branch 'improve_create' of github.com:davies/spark into improve_create
eee44a8 [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_create
8d9292d [Davies Liu] Update context.py
eb24531 [Davies Liu] Update context.py
c969997 [Davies Liu] bug fix
d5a8ab0 [Davies Liu] fix tests
8c3f10d [Davies Liu] Merge branch 'master' of github.com:apache/spark into improve_create
6ea5925 [Davies Liu] address comments
6ceaeff [Davies Liu] avoid spark jobs in createDataFrame
2015-06-29 13:20:55 -07:00
Vladimir Vladimirov 492dca3a73 [SPARK-8528] Expose SparkContext.applicationId in PySpark
Use case - we want to log applicationId (YARN in hour case) to request help with troubleshooting from the DevOps

Author: Vladimir Vladimirov <vladimir.vladimirov@magnetic.com>

Closes #6936 from smartkiwi/master and squashes the following commits:

870338b [Vladimir Vladimirov] this would make doctest to run in python3
0eae619 [Vladimir Vladimirov] Scala doesn't use u'...' for unicode literals
14d77a8 [Vladimir Vladimirov] stop using ELLIPSIS
b4ebfc5 [Vladimir Vladimirov] addressed PR feedback - updated docstring
223a32f [Vladimir Vladimirov] fixed test - applicationId is property that returns the string
3221f5a [Vladimir Vladimirov] [SPARK-8528] added documentation for Scala
2cff090 [Vladimir Vladimirov] [SPARK-8528] add applicationId property for SparkContext object in pyspark
2015-06-29 12:03:41 -07:00
Tarek Auel a5c2961caa [SPARK-8235] [SQL] misc function sha / sha1
Jira: https://issues.apache.org/jira/browse/SPARK-8235

I added the support for sha1. If I understood rxin correctly, sha and sha1 should execute the same algorithm, shouldn't they?

Please take a close look on the Python part. This is adopted from #6934

Author: Tarek Auel <tarek.auel@gmail.com>
Author: Tarek Auel <tarek.auel@googlemail.com>

Closes #6963 from tarekauel/SPARK-8235 and squashes the following commits:

f064563 [Tarek Auel] change to shaHex
7ce3cdc [Tarek Auel] rely on automatic cast
a1251d6 [Tarek Auel] Merge remote-tracking branch 'upstream/master' into SPARK-8235
68eb043 [Tarek Auel] added docstring
be5aff1 [Tarek Auel] improved error message
7336c96 [Tarek Auel] added type check
cf23a80 [Tarek Auel] simplified example
ebf75ef [Tarek Auel] [SPARK-8301] updated the python documentation. Removed sha in python and scala
6d6ff0d [Tarek Auel] [SPARK-8233] added docstring
ea191a9 [Tarek Auel] [SPARK-8233] fixed signatureof python function. Added expected type to misc
e3fd7c3 [Tarek Auel] SPARK[8235] added sha to the list of __all__
e5dad4e [Tarek Auel] SPARK[8235] sha / sha1
2015-06-29 11:57:19 -07:00
Reynold Xin 660c6cec75 [SPARK-8698] partitionBy in Python DataFrame reader/writer interface should not default to empty tuple.
Author: Reynold Xin <rxin@databricks.com>

Closes #7079 from rxin/SPARK-8698 and squashes the following commits:

8513e1c [Reynold Xin] [SPARK-8698] partitionBy in Python DataFrame reader/writer interface should not default to empty tuple.
2015-06-29 00:22:44 -07:00
Cheolsoo Park ac2e17b01c [SPARK-8355] [SQL] Python DataFrameReader/Writer should mirror Scala
I compared PySpark DataFrameReader/Writer against Scala ones. `Option` function is missing in both reader and writer, but the rest seems to all match.

I added `Option` to reader and writer and updated the `pyspark-sql` test.

Author: Cheolsoo Park <cheolsoop@netflix.com>

Closes #7078 from piaozhexiu/SPARK-8355 and squashes the following commits:

c63d419 [Cheolsoo Park] Fix version
524e0aa [Cheolsoo Park] Add option function to df reader and writer
2015-06-29 00:13:39 -07:00
Yanbo Liang dfde31da5c [SPARK-5962] [MLLIB] Python support for Power Iteration Clustering
Python support for Power Iteration Clustering
https://issues.apache.org/jira/browse/SPARK-5962

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #6992 from yanboliang/pyspark-pic and squashes the following commits:

6b03d82 [Yanbo Liang] address comments
4be4423 [Yanbo Liang] Python support for Power Iteration Clustering
2015-06-28 22:38:04 -07:00
Feynman Liang 25f574eb9a [SPARK-7212] [MLLIB] Add sequence learning flag
Support mining of ordered frequent item sequences.

Author: Feynman Liang <fliang@databricks.com>

Closes #6997 from feynmanliang/fp-sequence and squashes the following commits:

7c14e15 [Feynman Liang] Improve scalatests with R code and Seq
0d3e4b6 [Feynman Liang] Fix python test
ce987cb [Feynman Liang] Backwards compatibility aux constructor
34ef8f2 [Feynman Liang] Fix failing test due to reverse orderering
f04bd50 [Feynman Liang] Naming, add ordered to FreqItemsets, test ordering using Seq
648d4d4 [Feynman Liang] Test case for frequent item sequences
252a36a [Feynman Liang] Add sequence learning flag
2015-06-28 22:26:07 -07:00
Josh Rosen 40648c56cd [SPARK-8583] [SPARK-5482] [BUILD] Refactor python/run-tests to integrate with dev/run-tests module system
This patch refactors the `python/run-tests` script:

- It's now written in Python instead of Bash.
- The descriptions of the tests to run are now stored in `dev/run-tests`'s modules.  This allows the pull request builder to skip Python tests suites that were not affected by the pull request's changes.  For example, we can now skip the PySpark Streaming test cases when only SQL files are changed.
- `python/run-tests` now supports command-line flags to make it easier to run individual test suites (this addresses SPARK-5482):

  ```
Usage: run-tests [options]

Options:
  -h, --help            show this help message and exit
  --python-executables=PYTHON_EXECUTABLES
                        A comma-separated list of Python executables to test
                        against (default: python2.6,python3.4,pypy)
  --modules=MODULES     A comma-separated list of Python modules to test
                        (default: pyspark-core,pyspark-ml,pyspark-mllib
                        ,pyspark-sql,pyspark-streaming)
   ```
- `dev/run-tests` has been split into multiple files: the module definitions and test utility functions are now stored inside of a `dev/sparktestsupport` Python module, allowing them to be re-used from the Python test runner script.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #6967 from JoshRosen/run-tests-python-modules and squashes the following commits:

f578d6d [Josh Rosen] Fix print for Python 2.x
8233d61 [Josh Rosen] Add python/run-tests.py to Python lint checks
34c98d2 [Josh Rosen] Fix universal_newlines for Python 3
8f65ed0 [Josh Rosen] Fix handling of  module in python/run-tests
37aff00 [Josh Rosen] Python 3 fix
27a389f [Josh Rosen] Skip MLLib tests for PyPy
c364ccf [Josh Rosen] Use which() to convert PYSPARK_PYTHON to an absolute path before shelling out to run tests
568a3fd [Josh Rosen] Fix hashbang
3b852ae [Josh Rosen] Fall back to PYSPARK_PYTHON when sys.executable is None (fixes a test)
f53db55 [Josh Rosen] Remove python2 flag, since the test runner script also works fine under Python 3
9c80469 [Josh Rosen] Fix passing of PYSPARK_PYTHON
d33e525 [Josh Rosen] Merge remote-tracking branch 'origin/master' into run-tests-python-modules
4f8902c [Josh Rosen] Python lint fixes.
8f3244c [Josh Rosen] Use universal_newlines to fix dev/run-tests doctest failures on Python 3.
f542ac5 [Josh Rosen] Fix lint check for Python 3
fff4d09 [Josh Rosen] Add dev/sparktestsupport to pep8 checks
2efd594 [Josh Rosen] Update dev/run-tests to use new Python test runner flags
b2ab027 [Josh Rosen] Add command-line options for running individual suites in python/run-tests
caeb040 [Josh Rosen] Fixes to PySpark test module definitions
d6a77d3 [Josh Rosen] Fix the tests of dev/run-tests
def2d8a [Josh Rosen] Two minor fixes
aec0b8f [Josh Rosen] Actually get the Kafka stuff to run properly
04015b9 [Josh Rosen] First attempt at getting PySpark Kafka test to work in new runner script
4c97136 [Josh Rosen] PYTHONPATH fixes
dcc9c09 [Josh Rosen] Fix time division
32660fc [Josh Rosen] Initial cut at Python test runner refactoring
311c6a9 [Josh Rosen] Move shell utility functions to own module.
1bdeb87 [Josh Rosen] Move module definitions to separate file.
2015-06-27 20:24:34 -07:00
Josh Rosen 41afa16500 [SPARK-8652] [PYSPARK] Check return value for all uses of doctest.testmod()
This patch addresses a critical issue in the PySpark tests:

Several of our Python modules' `__main__` methods call `doctest.testmod()` in order to run doctests but forget to check and handle its return value. As a result, some PySpark test failures can go unnoticed because they will not fail the build.

Fortunately, there was only one test failure which was masked by this bug: a `pyspark.profiler` doctest was failing due to changes in RDD pipelining.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #7032 from JoshRosen/testmod-fix and squashes the following commits:

60dbdc0 [Josh Rosen] Account for int vs. long formatting change in Python 3
8b8d80a [Josh Rosen] Fix failing test.
e6423f9 [Josh Rosen] Check return code for all uses of doctest.testmod().
2015-06-26 08:12:22 -07:00
Liang-Chi Hsieh 47c874babe [SPARK-8237] [SQL] Add misc function sha2
JIRA: https://issues.apache.org/jira/browse/SPARK-8237

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6934 from viirya/expr_sha2 and squashes the following commits:

35e0bb3 [Liang-Chi Hsieh] For comments.
68b5284 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_sha2
8573aff [Liang-Chi Hsieh] Remove unnecessary Product.
ee61e06 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_sha2
59e41aa [Liang-Chi Hsieh] Add misc function: sha2.
2015-06-25 22:07:37 -07:00
Yanbo Liang 2519dcc33b [MINOR] [MLLIB] rename some functions of PythonMLLibAPI
Keep the same naming conventions for PythonMLLibAPI.
Only the following three functions is different from others
```scala
trainNaiveBayes
trainGaussianMixture
trainWord2Vec
```
So change them to
```scala
trainNaiveBayesModel
trainGaussianMixtureModel
trainWord2VecModel
```
It does not affect any users and public APIs, only to make better understand for developer and code hacker.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #7011 from yanboliang/py-mllib-api-rename and squashes the following commits:

771ffec [Yanbo Liang] rename some functions of PythonMLLibAPI
2015-06-25 08:13:17 -07:00
MechCoder fb32c38898 [SPARK-7633] [MLLIB] [PYSPARK] Python bindings for StreamingLogisticRegressionwithSGD
Add Python bindings to StreamingLogisticRegressionwithSGD.

No Java wrappers are needed as models are updated directly using train.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #6849 from MechCoder/spark-3258 and squashes the following commits:

b4376a5 [MechCoder] minor
d7e5fc1 [MechCoder] Refactor into StreamingLinearAlgorithm Better docs
9c09d4e [MechCoder] [SPARK-7633] Python bindings for StreamingLogisticRegressionwithSGD
2015-06-24 14:58:43 -07:00
Reynold Xin a458efc66c Revert "[SPARK-7157][SQL] add sampleBy to DataFrame"
This reverts commit 0401cbaa8e.

The new test case on Jenkins is failing.
2015-06-23 19:30:25 -07:00
Xiangrui Meng 0401cbaa8e [SPARK-7157][SQL] add sampleBy to DataFrame
Add `sampleBy` to DataFrame. rxin

Author: Xiangrui Meng <meng@databricks.com>

Closes #6769 from mengxr/SPARK-7157 and squashes the following commits:

991f26f [Xiangrui Meng] fix seed
4a14834 [Xiangrui Meng] move sampleBy to stat
832f7cc [Xiangrui Meng] add sampleBy to DataFrame
2015-06-23 17:46:29 -07:00
Davies Liu 7fb5ae5024 [SPARK-8573] [SPARK-8568] [SQL] [PYSPARK] raise Exception if column is used in booelan expression
It's a common mistake that user will put Column in a boolean expression (together with `and` , `or`), which does not work as expected, we should raise a exception in that case, and suggest user to use `&`, `|` instead.

Author: Davies Liu <davies@databricks.com>

Closes #6961 from davies/column_bool and squashes the following commits:

9f19beb [Davies Liu] update message
af74bd6 [Davies Liu] fix tests
07dff84 [Davies Liu] address comments, fix tests
f70c08e [Davies Liu] raise Exception if column is used in booelan expression
2015-06-23 15:51:16 -07:00
MechCoder f2022fa0d3 [SPARK-8265] [MLLIB] [PYSPARK] Add LinearDataGenerator to pyspark.mllib.utils
It is useful to generate linear data for easy testing of linear models and in general. Scala already has it. This is just a wrapper around the Scala code.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #6715 from MechCoder/generate_linear_input and squashes the following commits:

6182884 [MechCoder] Minor changes
8bda047 [MechCoder] Minor style fixes
0f1053c [MechCoder] [SPARK-8265] Add LinearDataGenerator to pyspark.mllib.utils
2015-06-23 12:43:32 -07:00
Scott Taylor f0dcbe8a7c [SPARK-8541] [PYSPARK] test the absolute error in approx doctests
A minor change but one which is (presumably) visible on the public api docs webpage.

Author: Scott Taylor <github@megatron.me.uk>

Closes #6942 from megatron-me-uk/patch-3 and squashes the following commits:

fbed000 [Scott Taylor] test the absolute error in approx doctests
2015-06-22 23:37:56 -07:00
Holden Karau 164fe2aa44 [SPARK-7781] [MLLIB] gradient boosted trees.train regressor missing max bins
Author: Holden Karau <holden@pigscanfly.ca>

Closes #6331 from holdenk/SPARK-7781-GradientBoostedTrees.trainRegressor-missing-max-bins and squashes the following commits:

2894695 [Holden Karau] remove extra blank line
2573e8d [Holden Karau] Update the scala side of the pythonmllibapi and make the test a bit nicer too
3a09170 [Holden Karau] add maxBins to to the train method as well
af7f274 [Holden Karau] Add maxBins to GradientBoostedTrees.trainRegressor and correctly mention the default of 32 in other places where it mentioned 100
2015-06-22 22:40:19 -07:00
Yin Huai 5ab9fcfb01 [SPARK-8532] [SQL] In Python's DataFrameWriter, save/saveAsTable/json/parquet/jdbc always override mode
https://issues.apache.org/jira/browse/SPARK-8532

This PR has two changes. First, it fixes the bug that save actions (i.e. `save/saveAsTable/json/parquet/jdbc`) always override mode. Second, it adds input argument `partitionBy` to `save/saveAsTable/parquet`.

Author: Yin Huai <yhuai@databricks.com>

Closes #6937 from yhuai/SPARK-8532 and squashes the following commits:

f972d5d [Yin Huai] davies's comment.
d37abd2 [Yin Huai] style.
d21290a [Yin Huai] Python doc.
889eb25 [Yin Huai] Minor refactoring and add partitionBy to save, saveAsTable, and parquet.
7fbc24b [Yin Huai] Use None instead of "error" as the default value of mode since JVM-side already uses "error" as the default value.
d696dff [Yin Huai] Python style.
88eb6c4 [Yin Huai] If mode is "error", do not call mode method.
c40c461 [Yin Huai] Regression test.
2015-06-22 13:51:23 -07:00
Wenchen Fan da7bbb9435 [SPARK-8104] [SQL] auto alias expressions in analyzer
Currently we auto alias expression in parser. However, during parser phase we don't have enough information to do the right alias. For example, Generator that has more than 1 kind of element need MultiAlias, ExtractValue don't need Alias if it's in middle of a ExtractValue chain.

Author: Wenchen Fan <cloud0fan@outlook.com>

Closes #6647 from cloud-fan/alias and squashes the following commits:

552eba4 [Wenchen Fan] fix python
5b5786d [Wenchen Fan] fix agg
73a90cb [Wenchen Fan] fix case-preserve of ExtractValue
4cfd23c [Wenchen Fan] fix order by
d18f401 [Wenchen Fan] refine
9f07359 [Wenchen Fan] address comments
39c1aef [Wenchen Fan] small fix
33640ec [Wenchen Fan] auto alias expressions in analyzer
2015-06-22 12:13:00 -07:00
Yu ISHIKAWA 5d89d9f00b [SPARK-8511] [PYSPARK] Modify a test to remove a saved model in regression.py
[[SPARK-8511] Modify a test to remove a saved model in `regression.py` - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8511)

Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>

Closes #6926 from yu-iskw/SPARK-8511 and squashes the following commits:

7cd0948 [Yu ISHIKAWA] Use `shutil.rmtree()` to temporary directories for saving model testings, instead of `os.removedirs()`
4a01c9e [Yu ISHIKAWA] [SPARK-8511][pyspark] Modify a test to remove a saved model in `regression.py`
2015-06-22 11:53:11 -07:00
Yanbo Liang 32e3cdaa64 [SPARK-7604] [MLLIB] Python API for PCA and PCAModel
Python API for PCA and PCAModel

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #6315 from yanboliang/spark-7604 and squashes the following commits:

1d58734 [Yanbo Liang] remove transform() in PCAModel, use default behavior
4d9d121 [Yanbo Liang] Python API for PCA and PCAModel
2015-06-21 12:04:20 -07:00
Liang-Chi Hsieh 0b8995168f [SPARK-8468] [ML] Take the negative of some metrics in RegressionEvaluator to get correct cross validation
JIRA: https://issues.apache.org/jira/browse/SPARK-8468

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6905 from viirya/cv_min and squashes the following commits:

930d3db [Liang-Chi Hsieh] Fix python unit test and add document.
d632135 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into cv_min
16e3b2c [Liang-Chi Hsieh] Take the negative instead of reciprocal.
c3dd8d9 [Liang-Chi Hsieh] For comments.
b5f52c1 [Liang-Chi Hsieh] Add param to CrossValidator for choosing whether to maximize evaulation value.
2015-06-20 13:01:59 -07:00
MechCoder 54976e55e3 [SPARK-4118] [MLLIB] [PYSPARK] Python bindings for StreamingKMeans
Python bindings for StreamingKMeans

Will change status to MRG once docs, tests and examples are updated.

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #6499 from MechCoder/spark-4118 and squashes the following commits:

7722d16 [MechCoder] minor style fixes
51052d3 [MechCoder] Doc fixes
2061a76 [MechCoder] Add tests for simultaneous training and prediction Minor style fixes
81482fd [MechCoder] minor
5d9fe61 [MechCoder] predictOn should take into account the latest model
8ab9e89 [MechCoder] Fix Python3 error
a9817df [MechCoder] Better tests and minor fixes
c80e451 [MechCoder] Add ignore_unicode_prefix
ee8ce16 [MechCoder] Update tests, doc and examples
4b1481f [MechCoder] Some changes and tests
d8b066a [MechCoder] [SPARK-4118] [MLlib] [PySpark] Python bindings for StreamingKMeans
2015-06-19 12:23:15 -07:00
Liang-Chi Hsieh 2c59d5c12a [SPARK-8207] [SQL] Add math function bin
JIRA: https://issues.apache.org/jira/browse/SPARK-8207

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6721 from viirya/expr_bin and squashes the following commits:

07e1c8f [Liang-Chi Hsieh] Remove AbstractUnaryMathExpression and let BIN inherit UnaryExpression.
0677f1a [Liang-Chi Hsieh] For comments.
cf62b95 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_bin
0cf20f2 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_bin
dea9c12 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_bin
d4f4774 [Liang-Chi Hsieh] Add @ignore_unicode_prefix.
7a0196f [Liang-Chi Hsieh] Fix python style.
ac2bacd [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_bin
a0a2d0f [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_bin
4cb764d [Liang-Chi Hsieh] For comments.
0f78682 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_bin
c0c3197 [Liang-Chi Hsieh] Add bin to FunctionRegistry.
824f761 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_bin
50e0c3b [Liang-Chi Hsieh] Add math function bin(a: long): string.
2015-06-19 10:09:31 -07:00
Kevin Conor fdf63f1249 [SPARK-8339] [PYSPARK] integer division for python 3
Itertools islice requires an integer for the stop argument.  Switching to integer division here prevents a ValueError when vs is evaluated above.

davies

This is my original work, and I license it to the project.

Author: Kevin Conor <kevin@discoverybayconsulting.com>

Closes #6794 from kconor/kconor-patch-1 and squashes the following commits:

da5e700 [Kevin Conor] Integer division for batch size
2015-06-19 00:12:20 -07:00
Bryan Cutler a2016b4bc4 [SPARK-8444] [STREAMING] Adding Python streaming example for queueStream
A Python example similar to the existing one for Scala.

Author: Bryan Cutler <bjcutler@us.ibm.com>

Closes #6884 from BryanCutler/streaming-queueStream-example-8444 and squashes the following commits:

435ba7e [Bryan Cutler] [SPARK-8444] Fixed style checks, increased sleep time to show empty queue
257abb0 [Bryan Cutler] [SPARK-8444] Stop context gracefully, Removed unused import, Added description comment
376ef6e [Bryan Cutler] [SPARK-8444] Fixed bug causing DStream.pprint to append empty parenthesis to output instead of blank line
1ff5f8b [Bryan Cutler] [SPARK-8444] Adding Python streaming example for queue_stream
2015-06-19 00:07:53 -07:00
Reynold Xin dc41313899 [SPARK-8218][SQL] Binary log math function update.
Some minor updates based on after merging #6725.

Author: Reynold Xin <rxin@databricks.com>

Closes #6871 from rxin/log and squashes the following commits:

ab51542 [Reynold Xin] Use JVM log
76fc8de [Reynold Xin] Fixed arg.
a7c1522 [Reynold Xin] [SPARK-8218][SQL] Binary log math function update.
2015-06-18 18:41:15 -07:00
Davies Liu 9b20027222 [SPARK-8202] [PYSPARK] fix infinite loop during external sort in PySpark
The batch size during external sort will grow up to max 10000, then shrink down to zero, causing infinite loop.
Given the assumption that the items usually have similar size, so we don't need to adjust the batch size after first spill.

cc JoshRosen rxin angelini

Author: Davies Liu <davies@databricks.com>

Closes #6714 from davies/batch_size and squashes the following commits:

b170dfb [Davies Liu] update test
b9be832 [Davies Liu] Merge branch 'batch_size' of github.com:davies/spark into batch_size
6ade745 [Davies Liu] update test
5c21777 [Davies Liu] Update shuffle.py
e746aec [Davies Liu] fix batch size during sort
2015-06-18 13:45:58 -07:00
Liang-Chi Hsieh fee3438a32 [SPARK-8218][SQL] Add binary log math function
JIRA: https://issues.apache.org/jira/browse/SPARK-8218

Because there is already `log` unary function defined, the binary log function is called `logarithm` for now.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6725 from viirya/expr_binary_log and squashes the following commits:

bf96bd9 [Liang-Chi Hsieh] Compare log result in string.
102070d [Liang-Chi Hsieh] Round log result to better comparing in python test.
fd01863 [Liang-Chi Hsieh] For comments.
beed631 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_binary_log
6089d11 [Liang-Chi Hsieh] Remove unnecessary override.
8cf37b7 [Liang-Chi Hsieh] For comments.
bc89597 [Liang-Chi Hsieh] For comments.
db7dc38 [Liang-Chi Hsieh] Use ctor instead of companion object.
0634ef7 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_binary_log
1750034 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_binary_log
3d75bfc [Liang-Chi Hsieh] Fix scala style.
5b39c02 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_binary_log
23c54a3 [Liang-Chi Hsieh] Fix scala style.
ebc9929 [Liang-Chi Hsieh] Let Logarithm accept one parameter too.
605574d [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_binary_log
21c3bfd [Liang-Chi Hsieh] Fix scala style.
c6c187f [Liang-Chi Hsieh] For comments.
c795342 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_binary_log
f373bac [Liang-Chi Hsieh] Add binary log expression.
2015-06-17 23:31:30 -07:00
MechCoder 22732e1eca [SPARK-7605] [MLLIB] [PYSPARK] Python API for ElementwiseProduct
Python API for org.apache.spark.mllib.feature.ElementwiseProduct

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #6346 from MechCoder/spark-7605 and squashes the following commits:

79d1ef5 [MechCoder] Consistent and support list / array types
5f81d81 [MechCoder] [SPARK-7605] [MLlib] Python API for ElementwiseProduct
2015-06-17 22:08:38 -07:00
zsxwing 0fc4b96f3e [SPARK-8373] [PYSPARK] Add emptyRDD to pyspark and fix the issue when calling sum on an empty RDD
This PR fixes the sum issue and also adds `emptyRDD` so that it's easy to create a test case.

Author: zsxwing <zsxwing@gmail.com>

Closes #6826 from zsxwing/python-emptyRDD and squashes the following commits:

b36993f [zsxwing] Update the return type to JavaRDD[T]
71df047 [zsxwing] Add emptyRDD to pyspark and fix the issue when calling sum on an empty RDD
2015-06-17 13:59:39 -07:00
MechCoder 6765ef98df [SPARK-6390] [SQL] [MLlib] Port MatrixUDT to PySpark
MatrixUDT was recently coded in scala. This has been ported to PySpark

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #6354 from MechCoder/spark-6390 and squashes the following commits:

fc4dc1e [MechCoder] Better error message
c940a44 [MechCoder] Added test
aa9c391 [MechCoder] Add pyUDT to MatrixUDT
62a2a7d [MechCoder] [SPARK-6390] Port MatrixUDT to PySpark
2015-06-17 11:10:16 -07:00
Yanbo Liang ca998757e8 [SPARK-7916] [MLLIB] MLlib Python doc parity check for classification and regression
Check then make the MLlib Python classification and regression doc to be as complete as the Scala doc.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #6460 from yanboliang/spark-7916 and squashes the following commits:

f8deda4 [Yanbo Liang] trigger jenkins
6dc4d99 [Yanbo Liang] address comments
ce2a43e [Yanbo Liang] truncate too long line and remove extra sparse
3eaf6ad [Yanbo Liang] MLlib Python doc parity check for classification and regression
2015-06-16 14:30:30 -07:00
Davies Liu 424b0075a1 [SPARK-6411] [SQL] [PySpark] support date/datetime with timezone in Python
Spark SQL does not support timezone, and Pyrolite does not support timezone well. This patch will convert datetime into POSIX timestamp (without confusing of timezone), which is used by SQL. If the datetime object does not have timezone, it's treated as local time.

The timezone in RDD will be lost after one round trip, all the datetime from SQL will be local time.

Because of Pyrolite, datetime from SQL only has precision as 1 millisecond.

This PR also drop the timezone in date, convert it to number of days since epoch (used in SQL).

Author: Davies Liu <davies@databricks.com>

Closes #6250 from davies/tzone and squashes the following commits:

44d8497 [Davies Liu] add timezone support for DateType
99d9d9c [Davies Liu] use int for timestamp
10aa7ca [Davies Liu] Merge branch 'master' of github.com:apache/spark into tzone
6a29aa4 [Davies Liu] support datetime with timezone
2015-06-11 01:00:41 -07:00
Davies Liu 37719e0cd0 [SPARK-8189] [SQL] use Long for TimestampType in SQL
This PR change to use Long as internal type for TimestampType for efficiency, which means it will the precision below 100ns.

Author: Davies Liu <davies@databricks.com>

Closes #6733 from davies/timestamp and squashes the following commits:

d9565fa [Davies Liu] remove print
65cf2f1 [Davies Liu] fix Timestamp in SparkR
86fecfb [Davies Liu] disable two timestamp tests
8f77ee0 [Davies Liu] fix scala style
246ee74 [Davies Liu] address comments
309d2e1 [Davies Liu] use Long for TimestampType in SQL
2015-06-10 16:55:39 -07:00
Reynold Xin 1b499993ad [SPARK-7886] Add built-in expressions to FunctionRegistry.
This patch switches to using FunctionRegistry for built-in expressions. It is based on #6463, but with some work to simplify it along with unit tests.

TODOs for future pull requests:
- Use static registration so we don't need to register all functions every time we start a new SQLContext
- Switch to using this in HiveContext

Author: Reynold Xin <rxin@databricks.com>
Author: Santiago M. Mola <santi@mola.io>

Closes #6710 from rxin/udf-registry and squashes the following commits:

6930822 [Reynold Xin] Fixed Python test.
b802c9a [Reynold Xin] Made UDF case insensitive.
e60d815 [Reynold Xin] Made UDF case insensitive.
852f9c0 [Reynold Xin] Fixed style violation.
e76a3c1 [Reynold Xin] Fixed parser.
52ddaba [Reynold Xin] Fixed compilation.
ee7854f [Reynold Xin] Improved error reporting.
ff906f2 [Reynold Xin] More robust constructor calling.
77b46f1 [Reynold Xin] Simplified the code.
2a2a149 [Reynold Xin] Merge pull request #6463 from smola/SPARK-7886
8616924 [Santiago M. Mola] [SPARK-7886] Add built-in expressions to FunctionRegistry.
2015-06-09 16:24:38 +08:00
Liang-Chi Hsieh 7658eb28a2 [SPARK-7990][SQL] Add methods to facilitate equi-join on multiple joining keys
JIRA: https://issues.apache.org/jira/browse/SPARK-7990

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6616 from viirya/multi_keys_equi_join and squashes the following commits:

cd5c888 [Liang-Chi Hsieh] Import reduce in python3.
c43722c [Liang-Chi Hsieh] For comments.
0400e89 [Liang-Chi Hsieh] Fix scala style.
cc90015 [Liang-Chi Hsieh] Add methods to facilitate equi-join on multiple joining keys.
2015-06-08 23:27:05 -07:00
cody koeninger b127ff8a0c [SPARK-2808] [STREAMING] [KAFKA] cleanup tests from
see if requiring producer acks eliminates the need for waitUntilLeaderOffset calls in tests

Author: cody koeninger <cody@koeninger.org>

Closes #5921 from koeninger/kafka-0.8.2-test-cleanup and squashes the following commits:

1e89dc8 [cody koeninger] Merge branch 'master' into kafka-0.8.2-test-cleanup
4662828 [cody koeninger] [Streaming][Kafka] filter mima issue for removal of method from private test class
af1e083 [cody koeninger] Merge branch 'master' into kafka-0.8.2-test-cleanup
4298ac2 [cody koeninger] [Streaming][Kafka] update comment to trigger jenkins attempt
1274afb [cody koeninger] [Streaming][Kafka] see if requiring producer acks eliminates the need for waitUntilLeaderOffset calls in tests
2015-06-07 21:42:45 +01:00
Reynold Xin 0ac47083f7 [SPARK-8146] DataFrame Python API: Alias replace in df.na
Author: Reynold Xin <rxin@databricks.com>

Closes #6688 from rxin/df-alias-replace and squashes the following commits:

774c19c [Reynold Xin] [SPARK-8146] DataFrame Python API: Alias replace in DataFrameNaFunctions.
2015-06-07 01:21:02 -07:00
MechCoder 5aa804f3c6 [SPARK-7639] [PYSPARK] [MLLIB] Python API for KernelDensity
Python API for KernelDensity

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #6387 from MechCoder/spark-7639 and squashes the following commits:

17abc62 [MechCoder] add tests
2de6540 [MechCoder] style tests
bf4acc0 [MechCoder] Added doctests
84359d5 [MechCoder] [SPARK-7639] Python API for KernelDensity
2015-06-06 14:52:14 -07:00
amey 356a4a9b93 [SPARK-7991] [PySpark] Adding support for passing lists to describe.
This is a minor change.

Author: amey <amey@skytree.net>

Closes #6655 from ameyc/JIRA-7991/support-passing-list-to-describe and squashes the following commits:

e8a1dff [amey] Adding support for passing lists to describe.
2015-06-05 13:49:33 -07:00
Ted Blackman e505460599 [SPARK-8116][PYSPARK] Allow sc.range() to take a single argument.
Author: Ted Blackman <ted.blackman@gmail.com>

Closes #6656 from belisarius222/branch-1.4 and squashes the following commits:

747cbc2 [Ted Blackman] [SPARK-8116][PYSPARK] Allow sc.range() to take a single argument.

(cherry picked from commit f02af7c8f7)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-06-04 22:22:01 -07:00
Mike Dusenberry df7da07a86 [SPARK-7969] [SQL] Added a DataFrame.drop function that accepts a Column reference.
Added a `DataFrame.drop` function that accepts a `Column` reference rather than a `String`, and added associated unit tests.  Basically iterates through the `DataFrame` to find a column with an expression that is equivalent to that of the `Column` argument supplied to the function.

Author: Mike Dusenberry <dusenberrymw@gmail.com>

Closes #6585 from dusenberrymw/SPARK-7969_Drop_method_on_Dataframes_should_handle_Column and squashes the following commits:

514727a [Mike Dusenberry] Updating the @since tag of the drop(Column) function doc to reflect version 1.4.1 instead of 1.4.0.
2f1bb4e [Mike Dusenberry] Adding an additional assert statement to the 'drop column after join' unit test in order to make sure the correct column was indeed left over.
6bf7c0e [Mike Dusenberry] Minor code formatting change.
e583888 [Mike Dusenberry] Adding more Python doctests for the df.drop with column reference function to test joined datasets that have columns with the same name.
5f74401 [Mike Dusenberry] Updating DataFrame.drop with column reference function to use logicalPlan.output to prevent ambiguities resulting from columns with the same name. Also added associated unit tests for joined datasets with duplicate column names.
4b8bbe8 [Mike Dusenberry] Adding Python support for Dataframe.drop with a Column reference.
986129c [Mike Dusenberry] Added a DataFrame.drop function that accepts a Column reference rather than a String, and added associated unit tests.  Basically iterates through the DataFrame to find a column with an expression that is equivalent to one supplied to the function.
2015-06-04 11:30:07 -07:00
Reynold Xin 2c5a06cafd Update documentation for [SPARK-7980] [SQL] Support SQLContext.range(end) 2015-06-03 14:20:27 -07:00
animesh d053a31be9 [SPARK-7980] [SQL] Support SQLContext.range(end)
1. range() overloaded in SQLContext.scala
2. range() modified in python sql context.py
3. Tests added accordingly in DataFrameSuite.scala and python sql tests.py

Author: animesh <animesh@apache.spark>

Closes #6609 from animeshbaranawal/SPARK-7980 and squashes the following commits:

935899c [animesh] SPARK-7980:python+scala changes
2015-06-03 11:28:18 -07:00
Reynold Xin ce320cb2db [SPARK-8060] Improve DataFrame Python test coverage and documentation.
Author: Reynold Xin <rxin@databricks.com>

Closes #6601 from rxin/python-read-write-test-and-doc and squashes the following commits:

baa8ad5 [Reynold Xin] Code review feedback.
f081d47 [Reynold Xin] More documentation updates.
c9902fa [Reynold Xin] [SPARK-8060] Improve DataFrame Python reader/writer interface doc and testing.
2015-06-03 00:23:34 -07:00
MechCoder 452eb82dd7 [SPARK-8032] [PYSPARK] Make version checking for NumPy in MLlib more robust
The current checking does version `1.x' is less than `1.4' this will fail if x has greater than 1 digit, since x > 4, however `1.x` < `1.4`

It fails in my system since I have version `1.10` :P

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #6579 from MechCoder/np_ver and squashes the following commits:

15430f8 [MechCoder] fix syntax error
893fb7e [MechCoder] remove equal to
e35f0d4 [MechCoder] minor
e89376c [MechCoder] Better checking
22703dd [MechCoder] [SPARK-8032] Make version checking for NumPy in MLlib more robust
2015-06-02 23:24:47 -07:00
Davies Liu 605ddbb27c [SPARK-8038] [SQL] [PYSPARK] fix Column.when() and otherwise()
Thanks ogirardot, closes #6580

cc rxin JoshRosen

Author: Davies Liu <davies@databricks.com>

Closes #6590 from davies/when and squashes the following commits:

c0f2069 [Davies Liu] fix Column.when() and otherwise()
2015-06-02 13:38:06 -07:00