Commit graph

1358 commits

Author SHA1 Message Date
Nick Pentreath 18faa588ca [SPARK-16127][ML][PYPSARK] Audit @Since annotations related to ml.linalg
[SPARK-14615](https://issues.apache.org/jira/browse/SPARK-14615) and #12627 changed `spark.ml` pipelines to use the new `ml.linalg` classes for `Vector`/`Matrix`. Some `Since` annotations for public methods/vals have not been updated accordingly to be `2.0.0`. This PR updates them.

## How was this patch tested?

Existing unit tests.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #13840 from MLnick/SPARK-16127-ml-linalg-since.
2016-06-22 10:05:25 -07:00
Holden Karau d281b0bafe [SPARK-15162][SPARK-15164][PYSPARK][DOCS][ML] update some pydocs
## What changes were proposed in this pull request?

Mark ml.classification algorithms as experimental to match Scala algorithms, update PyDoc for for thresholds on `LogisticRegression` to have same level of info as Scala, and enable mathjax for PyDoc.

## How was this patch tested?

Built docs locally & PySpark SQL tests

Author: Holden Karau <holden@us.ibm.com>

Closes #12938 from holdenk/SPARK-15162-SPARK-15164-update-some-pydocs.
2016-06-22 11:54:49 +02:00
Bryan Cutler b76e355376 [SPARK-15741][PYSPARK][ML] Pyspark cleanup of set default seed to None
## What changes were proposed in this pull request?

Several places set the seed Param default value to None which will translate to a zero value on the Scala side.  This is unnecessary because a default fixed value already exists and if a test depends on a zero valued seed, then it should explicitly set it to zero instead of relying on this translation.  These cases can be safely removed except for the ALS doc test, which has been changed to set the seed value to zero.

## How was this patch tested?

Ran PySpark tests locally

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #13672 from BryanCutler/pyspark-cleanup-setDefault-seed-SPARK-15741.
2016-06-21 11:43:25 -07:00
Davies Liu 2d6919bea9 [SPARK-16086] [SQL] [PYSPARK] create Row without any fields
## What changes were proposed in this pull request?

This PR allows us to create a Row without any fields.

## How was this patch tested?

Added a test for empty row and udf without arguments.

Author: Davies Liu <davies@databricks.com>

Closes #13812 from davies/no_argus.
2016-06-21 10:53:33 -07:00
Reynold Xin 93338807aa [SPARK-13792][SQL] Addendum: Fix Python API
## What changes were proposed in this pull request?
This is a follow-up to https://github.com/apache/spark/pull/13795 to properly set CSV options in Python API. As part of this, I also make the Python option setting for both CSV and JSON more robust against positional errors.

## How was this patch tested?
N/A

Author: Reynold Xin <rxin@databricks.com>

Closes #13800 from rxin/SPARK-13792-2.
2016-06-21 10:47:51 -07:00
Nick Pentreath 37494a18e8 [SPARK-10258][DOC][ML] Add @Since annotations to ml.feature
This PR adds missing `Since` annotations to `ml.feature` package.

Closes #8505.

## How was this patch tested?

Existing tests.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #13641 from MLnick/add-since-annotations.
2016-06-21 00:39:47 -07:00
Xiangrui Meng ce49bfc255 Revert "[SPARK-16086] [SQL] fix Python UDF without arguments (for 1.6)"
This reverts commit a46553cbac.
2016-06-21 00:32:51 -07:00
Reynold Xin c775bf09e0 [SPARK-13792][SQL] Limit logging of bad records in CSV data source
## What changes were proposed in this pull request?
This pull request adds a new option (maxMalformedLogPerPartition) in CSV reader to limit the maximum of logging message Spark generates per partition for malformed records.

The error log looks something like
```
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: Dropping malformed line: adsf,1,4
16/06/20 18:50:14 WARN CSVRelation: More than 10 malformed records have been found on this partition. Malformed records from now on will not be logged.
```

Closes #12173

## How was this patch tested?
Manually tested.

Author: Reynold Xin <rxin@databricks.com>

Closes #13795 from rxin/SPARK-13792.
2016-06-20 21:46:12 -07:00
Davies Liu a46553cbac [SPARK-16086] [SQL] fix Python UDF without arguments (for 1.6)
Fix the bug for Python UDF that does not have any arguments.

Added regression tests.

Author: Davies Liu <davies.liu@gmail.com>

Closes #13793 from davies/fix_no_arguments.

(cherry picked from commit abe36c53d1)
Signed-off-by: Davies Liu <davies.liu@gmail.com>
2016-06-20 20:53:45 -07:00
Bryan Cutler a42bf55532 [SPARK-16079][PYSPARK][ML] Added missing import for DecisionTreeRegressionModel used in GBTClassificationModel
## What changes were proposed in this pull request?

Fixed missing import for DecisionTreeRegressionModel used in GBTClassificationModel trees method.

## How was this patch tested?

Local tests

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #13787 from BryanCutler/pyspark-GBTClassificationModel-import-SPARK-16079.
2016-06-20 16:28:11 -07:00
Josh Howes e574c9973d [SPARK-15973][PYSPARK] Fix GroupedData Documentation
*This contribution is my original work and that I license the work to the project under the project's open source license.*

## What changes were proposed in this pull request?

Documentation updates to PySpark's GroupedData

## How was this patch tested?

Manual Tests

Author: Josh Howes <josh.howes@gmail.com>
Author: Josh Howes <josh.howes@maxpoint.com>

Closes #13724 from josh-howes/bugfix/SPARK-15973.
2016-06-17 23:43:31 -07:00
Jeff Zhang 898cb65255 [SPARK-15803] [PYSPARK] Support with statement syntax for SparkSession
## What changes were proposed in this pull request?

Support with statement syntax for SparkSession in pyspark

## How was this patch tested?

Manually verify it. Although I can add unit test for it, it would affect other unit test because the SparkContext is stopped after the with statement.

Author: Jeff Zhang <zjffdu@apache.org>

Closes #13541 from zjffdu/SPARK-15803.
2016-06-17 22:57:38 -07:00
andreapasqua 4c64e88d5b [SPARK-16035][PYSPARK] Fix SparseVector parser assertion for end parenthesis
## What changes were proposed in this pull request?
The check on the end parenthesis of the expression to parse was using the wrong variable. I corrected that.
## How was this patch tested?
Manual test

Author: andreapasqua <andrea@radius.com>

Closes #13750 from andreapasqua/sparse-vector-parser-assertion-fix.
2016-06-17 22:41:05 -07:00
Xiangrui Meng edb23f9e47 [SPARK-15946][MLLIB] Conversion between old/new vector columns in a DataFrame (Python)
## What changes were proposed in this pull request?

This PR implements python wrappers for #13662 to convert old/new vector columns in a DataFrame.

## How was this patch tested?

doctest in Python

cc: yanboliang

Author: Xiangrui Meng <meng@databricks.com>

Closes #13731 from mengxr/SPARK-15946.
2016-06-17 21:22:29 -07:00
Tathagata Das 084dca770f [SPARK-15981][SQL][STREAMING] Fixed bug and added tests in DataStreamReader Python API
## What changes were proposed in this pull request?

- Fixed bug in Python API of DataStreamReader.  Because a single path was being converted to a array before calling Java DataStreamReader method (which takes a string only), it gave the following error.
```
File "/Users/tdas/Projects/Spark/spark/python/pyspark/sql/readwriter.py", line 947, in pyspark.sql.readwriter.DataStreamReader.json
Failed example:
    json_sdf = spark.readStream.json(os.path.join(tempfile.mkdtemp(), 'data'),                 schema = sdf_schema)
Exception raised:
    Traceback (most recent call last):
      File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/doctest.py", line 1253, in __run
        compileflags, 1) in test.globs
      File "<doctest pyspark.sql.readwriter.DataStreamReader.json[0]>", line 1, in <module>
        json_sdf = spark.readStream.json(os.path.join(tempfile.mkdtemp(), 'data'),                 schema = sdf_schema)
      File "/Users/tdas/Projects/Spark/spark/python/pyspark/sql/readwriter.py", line 963, in json
        return self._df(self._jreader.json(path))
      File "/Users/tdas/Projects/Spark/spark/python/lib/py4j-0.10.1-src.zip/py4j/java_gateway.py", line 933, in __call__
        answer, self.gateway_client, self.target_id, self.name)
      File "/Users/tdas/Projects/Spark/spark/python/pyspark/sql/utils.py", line 63, in deco
        return f(*a, **kw)
      File "/Users/tdas/Projects/Spark/spark/python/lib/py4j-0.10.1-src.zip/py4j/protocol.py", line 316, in get_return_value
        format(target_id, ".", name, value))
    Py4JError: An error occurred while calling o121.json. Trace:
    py4j.Py4JException: Method json([class java.util.ArrayList]) does not exist
    	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:318)
    	at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:326)
    	at py4j.Gateway.invoke(Gateway.java:272)
    	at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:128)
    	at py4j.commands.CallCommand.execute(CallCommand.java:79)
    	at py4j.GatewayConnection.run(GatewayConnection.java:211)
    	at java.lang.Thread.run(Thread.java:744)
```

- Reduced code duplication between DataStreamReader and DataFrameWriter
- Added missing Python doctests

## How was this patch tested?
New tests

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13703 from tdas/SPARK-15981.
2016-06-16 13:17:41 -07:00
Davies Liu 5389013acc [SPARK-15888] [SQL] fix Python UDF with aggregate
## What changes were proposed in this pull request?

After we move the ExtractPythonUDF rule into physical plan, Python UDF can't work on top of aggregate anymore, because they can't be evaluated before aggregate, should be evaluated after aggregate. This PR add another rule to extract these kind of Python UDF from logical aggregate, create a Project on top of Aggregate.

## How was this patch tested?

Added regression tests. The plan of added test query looks like this:
```
== Parsed Logical Plan ==
'Project [<lambda>('k, 's) AS t#26]
+- Aggregate [<lambda>(key#5L)], [<lambda>(key#5L) AS k#17, sum(cast(<lambda>(value#6) as bigint)) AS s#22L]
   +- LogicalRDD [key#5L, value#6]

== Analyzed Logical Plan ==
t: int
Project [<lambda>(k#17, s#22L) AS t#26]
+- Aggregate [<lambda>(key#5L)], [<lambda>(key#5L) AS k#17, sum(cast(<lambda>(value#6) as bigint)) AS s#22L]
   +- LogicalRDD [key#5L, value#6]

== Optimized Logical Plan ==
Project [<lambda>(agg#29, agg#30L) AS t#26]
+- Aggregate [<lambda>(key#5L)], [<lambda>(key#5L) AS agg#29, sum(cast(<lambda>(value#6) as bigint)) AS agg#30L]
   +- LogicalRDD [key#5L, value#6]

== Physical Plan ==
*Project [pythonUDF0#37 AS t#26]
+- BatchEvalPython [<lambda>(agg#29, agg#30L)], [agg#29, agg#30L, pythonUDF0#37]
   +- *HashAggregate(key=[<lambda>(key#5L)#31], functions=[sum(cast(<lambda>(value#6) as bigint))], output=[agg#29,agg#30L])
      +- Exchange hashpartitioning(<lambda>(key#5L)#31, 200)
         +- *HashAggregate(key=[pythonUDF0#34 AS <lambda>(key#5L)#31], functions=[partial_sum(cast(pythonUDF1#35 as bigint))], output=[<lambda>(key#5L)#31,sum#33L])
            +- BatchEvalPython [<lambda>(key#5L), <lambda>(value#6)], [key#5L, value#6, pythonUDF0#34, pythonUDF1#35]
               +- Scan ExistingRDD[key#5L,value#6]
```

Author: Davies Liu <davies@databricks.com>

Closes #13682 from davies/fix_py_udf.
2016-06-15 13:38:04 -07:00
Tathagata Das 9a5071996b [SPARK-15953][WIP][STREAMING] Renamed ContinuousQuery to StreamingQuery
Renamed for simplicity, so that its obvious that its related to streaming.

Existing unit tests.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13673 from tdas/SPARK-15953.
2016-06-15 10:46:07 -07:00
Shixiong Zhu 0ee9fd9e52 [SPARK-15935][PYSPARK] Fix a wrong format tag in the error message
## What changes were proposed in this pull request?

A follow up PR for #13655 to fix a wrong format tag.

## How was this patch tested?

Jenkins unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #13665 from zsxwing/fix.
2016-06-14 19:45:11 -07:00
Tathagata Das 214adb14b8 [SPARK-15933][SQL][STREAMING] Refactored DF reader-writer to use readStream and writeStream for streaming DFs
## What changes were proposed in this pull request?
Currently, the DataFrameReader/Writer has method that are needed for streaming and non-streaming DFs. This is quite awkward because each method in them through runtime exception for one case or the other. So rather having half the methods throw runtime exceptions, its just better to have a different reader/writer API for streams.

- [x] Python API!!

## How was this patch tested?
Existing unit tests + two sets of unit tests for DataFrameReader/Writer and DataStreamReader/Writer.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13653 from tdas/SPARK-15933.
2016-06-14 17:58:45 -07:00
Shixiong Zhu 96c3500c66 [SPARK-15935][PYSPARK] Enable test for sql/streaming.py and fix these tests
## What changes were proposed in this pull request?

This PR just enables tests for sql/streaming.py and also fixes the failures.

## How was this patch tested?

Existing unit tests.

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #13655 from zsxwing/python-streaming-test.
2016-06-14 02:12:29 -07:00
Sandeep Singh 1842cdd4ee [SPARK-15663][SQL] SparkSession.catalog.listFunctions shouldn't include the list of built-in functions
## What changes were proposed in this pull request?
SparkSession.catalog.listFunctions currently returns all functions, including the list of built-in functions. This makes the method not as useful because anytime it is run the result set contains over 100 built-in functions.

## How was this patch tested?
CatalogSuite

Author: Sandeep Singh <sandeep@techaddict.me>

Closes #13413 from techaddict/SPARK-15663.
2016-06-13 21:58:52 -07:00
Liang-Chi Hsieh baa3e633e1 [SPARK-15364][ML][PYSPARK] Implement PySpark picklers for ml.Vector and ml.Matrix under spark.ml.python
## What changes were proposed in this pull request?

Now we have PySpark picklers for new and old vector/matrix, individually. However, they are all implemented under `PythonMLlibAPI`. To separate spark.mllib from spark.ml, we should implement the picklers of new vector/matrix under `spark.ml.python` instead.

## How was this patch tested?
Existing tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #13219 from viirya/pyspark-pickler-ml.
2016-06-13 19:59:53 -07:00
Wenchen Fan e2ab79d5ea [SPARK-15898][SQL] DataFrameReader.text should return DataFrame
## What changes were proposed in this pull request?

We want to maintain API compatibility for DataFrameReader.text, and will introduce a new API called DataFrameReader.textFile which returns Dataset[String].

affected PRs:
https://github.com/apache/spark/pull/11731
https://github.com/apache/spark/pull/13104
https://github.com/apache/spark/pull/13184

## How was this patch tested?

N/A

Author: Wenchen Fan <wenchen@databricks.com>

Closes #13604 from cloud-fan/revert.
2016-06-12 21:36:41 -07:00
hyukjinkwon 9e204c62c6 [SPARK-15840][SQL] Add two missing options in documentation and some option related changes
## What changes were proposed in this pull request?

This PR

1. Adds the documentations for some missing options, `inferSchema` and `mergeSchema` for Python and Scala.

2. Fiixes `[[DataFrame]]` to ```:class:`DataFrame` ``` so that this can be shown

  - from
    ![2016-06-09 9 31 16](https://cloud.githubusercontent.com/assets/6477701/15929721/8b864734-2e89-11e6-83f6-207527de4ac9.png)

  - to (with class link)
    ![2016-06-09 9 31 00](https://cloud.githubusercontent.com/assets/6477701/15929717/8a03d728-2e89-11e6-8a3f-08294964db22.png)

  (Please refer [the latest documentation](https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/api/python/pyspark.sql.html))

3. Moves `mergeSchema` option to `ParquetOptions` with removing unused options, `metastoreSchema` and `metastoreTableName`.

  They are not used anymore. They were removed in e720dda42e and there are no use cases as below:

  ```bash
  grep -r -e METASTORE_SCHEMA -e \"metastoreSchema\" -e \"metastoreTableName\" -e METASTORE_TABLE_NAME .
  ```

  ```
  ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala:  private[sql] val METASTORE_SCHEMA = "metastoreSchema"
  ./sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFileFormat.scala:  private[sql] val METASTORE_TABLE_NAME = "metastoreTableName"
  ./sql/hive/src/main/scala/org/apache/spark/sql/hive/HiveMetastoreCatalog.scala:        ParquetFileFormat.METASTORE_TABLE_NAME -> TableIdentifier(
```

  It only sets `metastoreTableName` in the last case but does not use the table name.

4. Sets the correct default values (in the documentation) for `compression` option for ORC(`snappy`, see [OrcOptions.scala#L33-L42](3ded5bc4db/sql/hive/src/main/scala/org/apache/spark/sql/hive/orc/OrcOptions.scala (L33-L42))) and Parquet(`the value specified in SQLConf`, see [ParquetOptions.scala#L38-L47](3ded5bc4db/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetOptions.scala (L38-L47))) and `columnNameOfCorruptRecord` for JSON(`the value specified in SQLConf`, see [JsonFileFormat.scala#L53-L55](4538443e27/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala (L53-L55)) and [JsonFileFormat.scala#L105-L106](4538443e27/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/json/JsonFileFormat.scala (L105-L106))).

## How was this patch tested?

Existing tests should cover this.

Author: hyukjinkwon <gurwls223@gmail.com>
Author: Hyukjin Kwon <gurwls223@gmail.com>

Closes #13576 from HyukjinKwon/SPARK-15840.
2016-06-11 23:20:40 -07:00
Takeshi YAMAMURO cb5d933d86 [SPARK-15585][SQL] Add doc for turning off quotations
## What changes were proposed in this pull request?
This pr is to add doc for turning off quotations because this behavior is different from `com.databricks.spark.csv`.

## How was this patch tested?
Check behavior  to put an empty string in csv options.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #13616 from maropu/SPARK-15585-2.
2016-06-11 15:12:21 -07:00
Bryan Cutler 7d7a0a5e07 [SPARK-15738][PYSPARK][ML] Adding Pyspark ml RFormula __str__ method similar to Scala API
## What changes were proposed in this pull request?
Adding __str__ to RFormula and model that will show the set formula param and resolved formula.  This is currently present in the Scala API, found missing in PySpark during Spark 2.0 coverage review.

## How was this patch tested?
run pyspark-ml tests locally

Author: Bryan Cutler <cutlerb@gmail.com>

Closes #13481 from BryanCutler/pyspark-ml-rformula_str-SPARK-15738.
2016-06-10 11:27:30 -07:00
WeichenXu cdd7f5a57a [SPARK-15837][ML][PYSPARK] Word2vec python add maxsentence parameter
## What changes were proposed in this pull request?

Word2vec python add maxsentence parameter.

## How was this patch tested?

Existing test.

Author: WeichenXu <WeichenXu123@outlook.com>

Closes #13578 from WeichenXu123/word2vec_python_add_maxsentence.
2016-06-10 12:26:53 +01:00
Zheng RuiFeng 16ca32eace [SPARK-15823][PYSPARK][ML] Add @property for 'accuracy' in MulticlassMetrics
## What changes were proposed in this pull request?
`accuracy` should be decorated with `property` to keep step with other methods in `pyspark.MulticlassMetrics`, like `weightedPrecision`, `weightedRecall`, etc

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13560 from zhengruifeng/add_accuracy_property.
2016-06-10 10:09:19 +01:00
Jeff Zhang e594b49283 [SPARK-15788][PYSPARK][ML] PySpark IDFModel missing "idf" property
## What changes were proposed in this pull request?

add method idf to IDF in pyspark

## How was this patch tested?

add unit test

Author: Jeff Zhang <zjffdu@apache.org>

Closes #13540 from zjffdu/SPARK-15788.
2016-06-09 09:54:38 -07:00
Zheng RuiFeng 00ad4f054c [SPARK-14900][ML][PYSPARK] Add accuracy and deprecate precison,recall,f1
## What changes were proposed in this pull request?
1, add accuracy for MulticlassMetrics
2, deprecate overall precision,recall,f1 and recommend accuracy usage

## How was this patch tested?
manual tests in pyspark shell

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13511 from zhengruifeng/deprecate_py_precisonrecall.
2016-06-06 15:19:22 +01:00
Yanbo Liang a95252823e [SPARK-15771][ML][EXAMPLES] Use 'accuracy' rather than 'precision' in many ML examples
## What changes were proposed in this pull request?
Since [SPARK-15617](https://issues.apache.org/jira/browse/SPARK-15617) deprecated ```precision``` in ```MulticlassClassificationEvaluator```, many ML examples broken.
```python
pyspark.sql.utils.IllegalArgumentException: u'MulticlassClassificationEvaluator_4c3bb1d73d8cc0cedae6 parameter metricName given invalid value precision.'
```
We should use ```accuracy``` to replace ```precision``` in these examples.

## How was this patch tested?
Offline tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13519 from yanboliang/spark-15771.
2016-06-06 09:36:34 +01:00
Zheng RuiFeng fd8af39713 [MINOR] Fix Typos 'an -> a'
## What changes were proposed in this pull request?

`an -> a`

Use cmds like `find . -name '*.R' | xargs -i sh -c "grep -in ' an [^aeiou]' {} && echo {}"` to generate candidates, and review them one by one.

## How was this patch tested?
manual tests

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13515 from zhengruifeng/an_a.
2016-06-06 09:35:47 +01:00
Reynold Xin 32f2f95dbd Revert "[SPARK-15585][SQL] Fix NULL handling along with a spark-csv behaivour"
This reverts commit b7e8d1cb3c.
2016-06-05 23:40:13 -07:00
Takeshi YAMAMURO b7e8d1cb3c [SPARK-15585][SQL] Fix NULL handling along with a spark-csv behaivour
## What changes were proposed in this pull request?
This pr fixes the behaviour of `format("csv").option("quote", null)` along with one of spark-csv.
Also, it explicitly sets default values for CSV options in python.

## How was this patch tested?
Added tests in CSVSuite.

Author: Takeshi YAMAMURO <linguin.m.s@gmail.com>

Closes #13372 from maropu/SPARK-15585.
2016-06-05 23:35:04 -07:00
Ruifeng Zheng 2099e05f93 [SPARK-15617][ML][DOC] Clarify that fMeasure in MulticlassMetrics is "micro" f1_score
## What changes were proposed in this pull request?
1, del precision,recall in  `ml.MulticlassClassificationEvaluator`
2, update user guide for `mlllib.weightedFMeasure`

## How was this patch tested?
local build

Author: Ruifeng Zheng <ruifengz@foxmail.com>

Closes #13390 from zhengruifeng/clarify_f1.
2016-06-04 13:56:04 +01:00
Holden Karau 67cc89ff02 [SPARK-15168][PYSPARK][ML] Add missing params to MultilayerPerceptronClassifier
## What changes were proposed in this pull request?

MultilayerPerceptronClassifier is missing step size, solver, and weights. Add these params. Also clarify the scaladoc a bit while we are updating these params.

Eventually we should follow up and unify the HasSolver params (filed https://issues.apache.org/jira/browse/SPARK-15169 )

## How was this patch tested?

Doc tests

Author: Holden Karau <holden@us.ibm.com>

Closes #12943 from holdenk/SPARK-15168-add-missing-params-to-MultilayerPerceptronClassifier.
2016-06-03 15:56:17 -07:00
Holden Karau 72353311d3 [SPARK-15092][SPARK-15139][PYSPARK][ML] Pyspark TreeEnsemble missing methods
## What changes were proposed in this pull request?

Add `toDebugString` and `totalNumNodes` to `TreeEnsembleModels` and add `toDebugString` to `DecisionTreeModel`

## How was this patch tested?

Extended doc tests.

Author: Holden Karau <holden@us.ibm.com>

Closes #12919 from holdenk/SPARK-15139-pyspark-treeEnsemble-missing-methods.
2016-06-02 15:55:14 -07:00
Yanbo Liang 07a98ca4ce [SPARK-15587][ML] ML 2.0 QA: Scala APIs audit for ml.feature
## What changes were proposed in this pull request?
ML 2.0 QA: Scala APIs audit for ml.feature. Mainly include:
* Remove seed for ```QuantileDiscretizer```, since we use ```approxQuantile``` to produce bins and ```seed``` is useless.
* Scala API docs update.
* Sync Scala and Python API docs for these changes.

## How was this patch tested?
Exist tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13410 from yanboliang/spark-15587.
2016-06-01 10:49:51 -07:00
Reynold Xin a71d1364ae [SPARK-15686][SQL] Move user-facing streaming classes into sql.streaming
## What changes were proposed in this pull request?
This patch moves all user-facing structured streaming classes into sql.streaming. As part of this, I also added some since version annotation to methods and classes that don't have them.

## How was this patch tested?
Updated tests to reflect the moves.

Author: Reynold Xin <rxin@databricks.com>

Closes #13429 from rxin/SPARK-15686.
2016-06-01 10:14:40 -07:00
Tathagata Das 90b11439b3 [SPARK-15517][SQL][STREAMING] Add support for complete output mode in Structure Streaming
## What changes were proposed in this pull request?
Currently structured streaming only supports append output mode.  This PR adds the following.

- Added support for Complete output mode in the internal state store, analyzer and planner.
- Added public API in Scala and Python for users to specify output mode
- Added checks for unsupported combinations of output mode and DF operations
  - Plans with no aggregation should support only Append mode
  - Plans with aggregation should support only Update and Complete modes
  - Default output mode is Append mode (**Question: should we change this to automatically set to Complete mode when there is aggregation?**)
- Added support for Complete output mode in Memory Sink. So Memory Sink internally supports append and complete, update. But from public API only Complete and Append output modes are supported.

## How was this patch tested?
Unit tests in various test suites
- StreamingAggregationSuite: tests for complete mode
- MemorySinkSuite: tests for checking behavior in Append and Complete modes.
- UnsupportedOperationSuite: tests for checking unsupported combinations of DF ops and output modes
- DataFrameReaderWriterSuite: tests for checking that output mode cannot be called on static DFs
- Python doc test and existing unit tests modified to call write.outputMode.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #13286 from tdas/complete-mode.
2016-05-31 15:57:01 -07:00
Yanbo Liang 594484cd83 [MINOR][DOC][ML] ml.clustering scala & python api doc sync
## What changes were proposed in this pull request?
Since we done Scala API audit for ml.clustering at #13148, we should also fix and update the corresponding Python API docs to keep them in sync.

## How was this patch tested?
Docs change, no tests.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #13291 from yanboliang/spark-15361-followup.
2016-05-31 14:56:43 -07:00
Shixiong Zhu 9a74de18a1 Revert "[SPARK-11753][SQL][TEST-HADOOP2.2] Make allowNonNumericNumbers option work
## What changes were proposed in this pull request?

This reverts commit c24b6b679c. Sent a PR to run Jenkins tests due to the revert conflicts of `dev/deps/spark-deps-hadoop*`.

## How was this patch tested?

Jenkins unit tests, integration tests, manual tests)

Author: Shixiong Zhu <shixiong@databricks.com>

Closes #13417 from zsxwing/revert-SPARK-11753.
2016-05-31 14:50:07 -07:00
yinxusen 130b8d07b8 [SPARK-15008][ML][PYSPARK] Add integration test for OneVsRest
## What changes were proposed in this pull request?

1. Add `_transfer_param_map_to/from_java` for OneVsRest;

2. Add `_compare_params` in ml/tests.py to help compare params.

3. Add `test_onevsrest` as the integration test for OneVsRest.

## How was this patch tested?

Python unit test.

Author: yinxusen <yinxusen@gmail.com>

Closes #12875 from yinxusen/SPARK-15008.
2016-05-27 13:18:29 -07:00
Zheng RuiFeng 6b1a6180e7 [MINOR] Fix Typos 'a -> an'
## What changes were proposed in this pull request?

`a` -> `an`

I use regex to generate potential error lines:
`grep -in ' a [aeiou]' mllib/src/main/scala/org/apache/spark/ml/*/*scala`
and review them line by line.

## How was this patch tested?

local build
`lint-java` checking

Author: Zheng RuiFeng <ruifengz@foxmail.com>

Closes #13317 from zhengruifeng/a_an.
2016-05-26 22:39:14 -07:00
Eric Liang 594a1bf200 [SPARK-15520][SQL] Also set sparkContext confs when using SparkSession builder in pyspark
## What changes were proposed in this pull request?

Also sets confs in the underlying sc when using SparkSession.builder.getOrCreate(). This is a bug-fix from a post-merge comment in https://github.com/apache/spark/pull/13289

## How was this patch tested?

Python doc-tests.

Author: Eric Liang <ekl@databricks.com>

Closes #13309 from ericl/spark-15520-1.
2016-05-26 12:05:47 -07:00
Jurriaan Pruis c875d81a3d [SPARK-15493][SQL] default QuoteEscapingEnabled flag to true when writing CSV
## What changes were proposed in this pull request?

Default QuoteEscapingEnabled flag to true when writing CSV and add an escapeQuotes option to be able to change this.

See f3eb2af263/src/main/java/com/univocity/parsers/csv/CsvWriterSettings.java (L231-L247)

This change is needed to be able to write RFC 4180 compatible CSV files (https://tools.ietf.org/html/rfc4180#section-2)

https://issues.apache.org/jira/browse/SPARK-15493

## How was this patch tested?

Added a test that verifies the output is quoted correctly.

Author: Jurriaan Pruis <email@jurriaanpruis.nl>

Closes #13267 from jurriaan/quote-escaping.
2016-05-25 12:40:16 -07:00
Nick Pentreath 1cb347fbc4 [SPARK-15500][DOC][ML][PYSPARK] Remove default value in Param doc field in ALS
Remove "Default: MEMORY_AND_DISK" from `Param` doc field in ALS storage level params. This fixes up the output of `explainParam(s)` so that default values are not displayed twice.

We can revisit in the case that [SPARK-15130](https://issues.apache.org/jira/browse/SPARK-15130) moves ahead with adding defaults in some way to PySpark param doc fields.

Tests N/A.

Author: Nick Pentreath <nickp@za.ibm.com>

Closes #13277 from MLnick/SPARK-15500-als-remove-default-storage-param.
2016-05-25 20:41:53 +02:00
Eric Liang 8239fdcb9b [SPARK-15520][SQL] SparkSession builder in python should also allow overriding confs of existing sessions
## What changes were proposed in this pull request?

This fixes the python SparkSession builder to allow setting confs correctly. This was a leftover TODO from https://github.com/apache/spark/pull/13200.

## How was this patch tested?

Python doc tests.

cc andrewor14

Author: Eric Liang <ekl@databricks.com>

Closes #13289 from ericl/spark-15520.
2016-05-25 10:49:11 -07:00
Holden Karau cd9f16906c [SPARK-15412][PYSPARK][SPARKR][DOCS] Improve linear isotonic regression pydoc & doc build insturctions
## What changes were proposed in this pull request?

PySpark: Add links to the predictors from the models in regression.py, improve linear and isotonic pydoc in minor ways.
User guide / R: Switch the installed package list to be enough to build the R docs on a "fresh" install on ubuntu and add sudo to match the rest of the commands.
User Guide: Add a note about using gem2.0 for systems with both 1.9 and 2.0 (e.g. some ubuntu but maybe more).

## How was this patch tested?

built pydocs locally, tested new user build instructions

Author: Holden Karau <holden@us.ibm.com>

Closes #13199 from holdenk/SPARK-15412-improve-linear-isotonic-regression-pydoc.
2016-05-24 22:20:00 -07:00
Liang-Chi Hsieh 695d9a0fd4 [SPARK-15433] [PYSPARK] PySpark core test should not use SerDe from PythonMLLibAPI
## What changes were proposed in this pull request?

Currently PySpark core test uses the `SerDe` from `PythonMLLibAPI` which includes many MLlib things. It should use `SerDeUtil` instead.

## How was this patch tested?
Existing tests.

Author: Liang-Chi Hsieh <simonh@tw.ibm.com>

Closes #13214 from viirya/pycore-use-serdeutil.
2016-05-24 10:10:41 -07:00