For now they are thin wrappers around the corresponding Hive UDAFs.
One limitation with these in Hive 0.13.0 is they only support aggregating primitive types.
I chose snake_case here instead of camelCase because it seems to be used in the majority of the multi-word fns.
Do we also want to add these to `functions.py`?
This approach was recommended here: https://github.com/apache/spark/pull/8592#issuecomment-154247089
marmbrus rxin
Author: Nick Buroojy <nick.buroojy@civitaslearning.com>
Closes#9526 from nburoojy/nick/udaf-alias.
(cherry picked from commit a6ee4f989d)
Signed-off-by: Michael Armbrust <michael@databricks.com>
Could jkbradley and davies review it?
- Create a wrapper class: `LDAModelWrapper` for `LDAModel`. Because we can't deal with the return value of`describeTopics` in Scala from pyspark directly. `Array[(Array[Int], Array[Double])]` is too complicated to convert it.
- Add `loadLDAModel` in `PythonMLlibAPI`. Since `LDAModel` in Scala is an abstract class and we need to call `load` of `DistributedLDAModel`.
[[SPARK-8467] Add LDAModel.describeTopics() in Python - ASF JIRA](https://issues.apache.org/jira/browse/SPARK-8467)
Author: Yu ISHIKAWA <yuu.ishikawa@gmail.com>
Closes#8643 from yu-iskw/SPARK-8467-2.
https://issues.apache.org/jira/browse/SPARK-10116
This is really trivial, just happened to notice it -- if `XORShiftRandom.hashSeed` is really supposed to have random bits throughout (as the comment implies), it needs to do something for the conversion to `long`.
mengxr mkolod
Author: Imran Rashid <irashid@cloudera.com>
Closes#8314 from squito/SPARK-10116.
Follow up [SPARK-9836](https://issues.apache.org/jira/browse/SPARK-9836), we should also support summary statistics for ```intercept```.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9485 from yanboliang/spark-11473.
This adds a failing test checking that `awaitTerminationOrTimeout` returns the expected value, and then fixes that failing test with the addition of a `return`.
tdas zsxwing
Author: Nick Evans <me@nicolasevans.org>
Closes#9336 from manygrams/fix_await_termination_or_timeout.
We added a bunch of higher order statistics such as skewness and kurtosis to GroupedData. I don't think they are common enough to justify being listed, since users can always use the normal statistics aggregate functions.
That is to say, after this change, we won't support
```scala
df.groupBy("key").kurtosis("colA", "colB")
```
However, we will still support
```scala
df.groupBy("key").agg(kurtosis(col("colA")), kurtosis(col("colB")))
```
Author: Reynold Xin <rxin@databricks.com>
Closes#9446 from rxin/SPARK-11489.
Add Python API for stddev/stddev_pop/stddev_samp/variance/var_pop/var_samp/skewness/kurtosis
Author: Davies Liu <davies@databricks.com>
Closes#9424 from davies/py_var.
This PR deprecates `runs` in k-means. `runs` introduces extra complexity and overhead in MLlib's k-means implementation. I haven't seen much usage with `runs` not equal to `1`. We don't have a unit test for it either. We can deprecate this method in 1.6, and void it in 1.7. It helps us simplify the implementation.
cc: srowen
Author: Xiangrui Meng <meng@databricks.com>
Closes#9322 from mengxr/SPARK-11358.
When creating a DataFrame from an RDD in PySpark, `createDataFrame` calls `.take(10)` to verify the first 10 rows of the RDD match the provided schema. Similar to https://issues.apache.org/jira/browse/SPARK-8070, but that issue affected cases where a schema was not provided.
Verifying the first 10 rows is of limited utility and causes the DAG to be executed non-lazily. If necessary, I believe this verification should be done lazily on all rows. However, since the caller is providing a schema to follow, I think it's acceptable to simply fail if the schema is incorrect.
marmbrus We chatted about this at SparkSummitEU. davies you made a similar change for the infer-schema path in https://github.com/apache/spark/pull/6606
Author: Jason White <jason.white@shopify.com>
Closes#9392 from JasonMWhite/createDataFrame_without_take.
[SPARK-10668](https://issues.apache.org/jira/browse/SPARK-10668) has provided ```WeightedLeastSquares``` solver("normal") in ```LinearRegression``` with L2 regularization in Scala and R, Python ML ```LinearRegression``` should also support setting solver("auto", "normal", "l-bfgs")
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#9328 from yanboliang/spark-11367.
Fix computation of root-sigma-inverse in multivariate Gaussian; add a test and fix related Python mixture model test.
Supersedes https://github.com/apache/spark/pull/9293
Author: Sean Owen <sowen@cloudera.com>
Closes#9309 from srowen/SPARK-11302.2.
implement {RandomForest, GBT, TreeEnsemble, TreeClassifier, TreeRegressor}Params for Python API
in pyspark/ml/{classification, regression}.py
Author: vectorijk <jiangkai@gmail.com>
Closes#9233 from vectorijk/spark-10024.
This PR adds addition and multiplication to PySpark's `BlockMatrix` class via `add` and `multiply` functions.
Author: Mike Dusenberry <mwdusenb@us.ibm.com>
Closes#9139 from dusenberrymw/SPARK-6488_Add_Addition_and_Multiplication_to_PySpark_BlockMatrix.
jerryshao tdas
I know this is kind of minor, and I know you all are busy, but this brings this class in line with the `OffsetRange` class, and makes tests a little more concise.
Instead of doing something like:
```
assert topic_and_partition_instance._topic == "foo"
assert topic_and_partition_instance._partition == 0
```
You can do something like:
```
assert topic_and_partition_instance == TopicAndPartition("foo", 0)
```
Before:
```
>>> from pyspark.streaming.kafka import TopicAndPartition
>>> TopicAndPartition("foo", 0) == TopicAndPartition("foo", 0)
False
```
After:
```
>>> from pyspark.streaming.kafka import TopicAndPartition
>>> TopicAndPartition("foo", 0) == TopicAndPartition("foo", 0)
True
```
I couldn't find any tests - am I missing something?
Author: Nick Evans <me@nicolasevans.org>
Closes#9236 from manygrams/topic_and_partition_equality.
Duplicated the since decorator from pyspark.sql into pyspark (also tweaked to handle functions without docstrings).
Added since to methods + "versionadded::" to classes (derived from the git file history in pyspark).
Author: noelsmith <mail@noelsmith.com>
Closes#8627 from noel-smith/SPARK-10271-since-mllib-clustering.
Namely "." shows up in some places in the template when using the param docstring and not in others
Author: Holden Karau <holden@pigscanfly.ca>
Closes#9017 from holdenk/SPARK-10767-Make-pyspark-shared-params-codegen-more-consistent.
Duplicated the since decorator from pyspark.sql into pyspark (also tweaked to handle functions without docstrings).
Added since to methods + "versionadded::" to classes derived from the file history.
Note - some methods are inherited from the regression module (i.e. LinearModel.intercept) so these won't have version numbers in the API docs until that model is updated.
Author: noelsmith <mail@noelsmith.com>
Closes#8626 from noel-smith/SPARK-10269-since-mlib-classification.
Duplicated the since decorator from pyspark.sql into pyspark (also tweaked to handle functions without docstrings).
Added since to public methods + "versionadded::" to classes (derived from the git file history in pyspark).
Note - I added also the tags to MultilabelMetrics even though it isn't declared as public in the __all__ statement... if that's incorrect - I'll remove.
Author: noelsmith <mail@noelsmith.com>
Closes#8628 from noel-smith/SPARK-10272-since-mllib-evalutation.
This commit refactors the `run-tests-jenkins` script into Python. This refactoring was done by brennonyork in #7401; this PR contains a few minor edits from joshrosen in order to bring it up to date with other recent changes.
From the original PR description (by brennonyork):
Currently a few things are left out that, could and I think should, be smaller JIRA's after this.
1. There are still a few areas where we use environment variables where we don't need to (like `CURRENT_BLOCK`). I might get around to fixing this one in lieu of everything else, but wanted to point that out.
2. The PR tests are still written in bash. I opted to not change those and just rewrite the runner into Python. This is a great follow-on JIRA IMO.
3. All of the linting scripts are still in bash as well and would likely do to just add those in as follow-on JIRA's as well.
Closes#7401.
Author: Brennon York <brennon.york@capitalone.com>
Closes#9161 from JoshRosen/run-tests-jenkins-refactoring.
The _verify_type() function had Errors that were raised when there were Type conversion issues but left out the Object in question. The Object is now added in the Error to reduce the strain on the user to debug through to figure out the Object that failed the Type conversion.
The use case for me was a Pandas DataFrame that contained 'nan' as values for columns of Strings.
Author: Mahmoud Lababidi <mahmoud@thehumangeo.com>
Author: Mahmoud Lababidi <lababidi@gmail.com>
Closes#9149 from lababidi/master.
Make sure comma-separated paths get processed correcly in ResolvedDataSource for a HadoopFsRelationProvider
Author: Koert Kuipers <koert@tresata.com>
Closes#8416 from koertkuipers/feat-sql-comma-separated-paths.
At this moment `SparseVector.__getitem__` executes `np.searchsorted` first and checks if result is in an expected range after that. It is possible to check if index can contain non-zero value before executing `np.searchsorted`.
Author: zero323 <matthew.szymkiewicz@gmail.com>
Closes#9098 from zero323/sparse_vector_getitem_improved.
…rror message
For negative indices in the SparseVector, we update the index value. If we have an incorrect index
at this point, the error message has the incorrect *updated* index instead of the original one. This
change contains the fix for the same.
Author: Bhargav Mangipudi <bhargav.mangipudi@gmail.com>
Closes#9069 from bhargav/spark-10759.
Output list of supported modules for python tests in error message when given bad module name.
CC: davies
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#9088 from jkbradley/python-tests-modules.
This patch adds a signal handler to trap Ctrl-C and cancels running job.
Author: Ashwin Shankar <ashankar@netflix.com>
Closes#9033 from ashwinshankar77/master.
Support for recommendUsersForProducts and recommendProductsForUsers in matrix factorization model for PySpark
Author: Vladimir Vladimirov <vladimir.vladimirov@magnetic.com>
Closes#8700 from smartkiwi/SPARK-10535_.
These params were being passed into the StreamingLogisticRegressionWithSGD constructor, but not transferred to the call for model training. Same with StreamingLinearRegressionWithSGD. I added the params as named arguments to the call and also fixed the intercept parameter, which was being passed as regularization value.
Author: Bryan Cutler <bjcutler@us.ibm.com>
Closes#9002 from BryanCutler/StreamingSGD-convergenceTol-bug-10959.
__gettitem__ method throws IndexError exception when we try to access index after the last non-zero entry
from pyspark.mllib.linalg import Vectors
sv = Vectors.sparse(5, {1: 3})
sv[0]
## 0.0
sv[1]
## 3.0
sv[2]
## Traceback (most recent call last):
## File "<stdin>", line 1, in <module>
## File "/python/pyspark/mllib/linalg/__init__.py", line 734, in __getitem__
## row_ind = inds[insert_index]
## IndexError: index out of bounds
Author: zero323 <matthew.szymkiewicz@gmail.com>
Closes#9009 from zero323/sparse_vector_index_error.
Add the Python API for isotonicregression.
Author: Holden Karau <holden@pigscanfly.ca>
Closes#8214 from holdenk/SPARK-9774-add-python-api-for-ml-regression-isotonicregression.
Provide initialModel param for pyspark.mllib.clustering.KMeans
Author: Evan Chen <chene@us.ibm.com>
Closes#8967 from evanyc15/SPARK-10779-pyspark-mllib.
If user doesn't specify `quantileProbs` in `setParams`, it will get reset to the default value. We don't need special handling here. vectorijk yanboliang
Author: Xiangrui Meng <meng@databricks.com>
Closes#9001 from mengxr/SPARK-10957.
Documentation for dropDuplicates() and drop_duplicates() is one and the same. Resolved the error in the example for drop_duplicates using the same approach used for groupby and groupBy, by indicating that dropDuplicates and drop_duplicates are aliases.
Author: asokadiggs <asoka.diggs@intel.com>
Closes#8930 from asokadiggs/jira-10782.
Add method to easily convert a StatCounter instance into a Python dict
https://issues.apache.org/jira/browse/SPARK-6919
Note: This is my original work and the existing Spark license applies.
Author: Erik Shilts <erik.shilts@opower.com>
Closes#5516 from eshilts/statcounter-asdict.
This integrates the Interaction feature transformer with SparkR R formula support (i.e. support `:`).
To generate reasonable ML attribute names for feature interactions, it was necessary to add the ability to read attribute the original attribute names back from `StructField`, and also to specify custom group prefixes in `VectorAssembler`. This also has the side-benefit of cleaning up the double-underscores in the attributes generated for non-interaction terms.
mengxr
Author: Eric Liang <ekl@databricks.com>
Closes#8830 from ericl/interaction-2.
Python DataFrame.head/take now requires scanning all the partitions. This pull request changes them to delegate the actual implementation to Scala DataFrame (by calling DataFrame.take).
This is more of a hack for fixing this issue in 1.5.1. A more proper fix is to change executeCollect and executeTake to return InternalRow rather than Row, and thus eliminate the extra round-trip conversion.
Author: Reynold Xin <rxin@databricks.com>
Closes#8876 from rxin/SPARK-10731.
JIRA: https://issues.apache.org/jira/browse/SPARK-10446
Currently the method `join(right: DataFrame, usingColumns: Seq[String])` only supports inner join. It is more convenient to have it support other join types.
Author: Liang-Chi Hsieh <viirya@appier.com>
Closes#8600 from viirya/usingcolumns_df.
Remove ._SUCCESS.crc hidden file that may cause problems in distribution tar archive, and is not used
Author: Sean Owen <sowen@cloudera.com>
Closes#8846 from srowen/SPARK-10716.
from the issue:
In Scala, I can supply a custom partitioner to reduceByKey (and other aggregation/repartitioning methods like aggregateByKey and combinedByKey), but as far as I can tell from the Pyspark API, there's no way to do the same in Python.
Here's an example of my code in Scala:
weblogs.map(s => (getFileType(s), 1)).reduceByKey(new FileTypePartitioner(),_+_)
But I can't figure out how to do the same in Python. The closest I can get is to call repartition before reduceByKey like so:
weblogs.map(lambda s: (getFileType(s), 1)).partitionBy(3,hash_filetype).reduceByKey(lambda v1,v2: v1+v2).collect()
But that defeats the purpose, because I'm shuffling twice instead of once, so my performance is worse instead of better.
Author: Holden Karau <holden@pigscanfly.ca>
Closes#8569 from holdenk/SPARK-9821-pyspark-reduceByKey-should-take-a-custom-partitioner.
From JIRA: Add Python API, user guide and example for ml.feature.CountVectorizerModel
Author: Holden Karau <holden@pigscanfly.ca>
Closes#8561 from holdenk/SPARK-9769-add-python-api-for-countvectorizermodel.
There are some missing API docs in pyspark.mllib.linalg.Vector (including DenseVector and SparseVector). We should add them based on their Scala counterparts.
Author: vinodkc <vinod.kc.in@gmail.com>
Closes#8834 from vinodkc/fix_SPARK-10631.
It does not make much sense to set `spark.shuffle.spill` or `spark.sql.planner.externalSort` to false: I believe that these configurations were initially added as "escape hatches" to guard against bugs in the external operators, but these operators are now mature and well-tested. In addition, these configurations are not handled in a consistent way anymore: SQL's Tungsten codepath ignores these configurations and will continue to use spilling operators. Similarly, Spark Core's `tungsten-sort` shuffle manager does not respect `spark.shuffle.spill=false`.
This pull request removes these configurations, adds warnings at the appropriate places, and deletes a large amount of code which was only used in code paths that did not support spilling.
Author: Josh Rosen <joshrosen@databricks.com>
Closes#8831 from JoshRosen/remove-ability-to-disable-spilling.
As ```assertEquals``` is deprecated, so we need to change ```assertEquals``` to ```assertEqual``` for existing python unit tests.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8814 from yanboliang/spark-10615.
JIRA: https://issues.apache.org/jira/browse/SPARK-10642
When calling `rdd.lookup()` on a RDD with tuple keys, `portable_hash` will return a long. That causes `DAGScheduler.submitJob` to throw `java.lang.ClassCastException: java.lang.Long cannot be cast to java.lang.Integer`.
Author: Liang-Chi Hsieh <viirya@appier.com>
Closes#8796 from viirya/fix-pyrdd-lookup.
Missed this when reviewing `pyspark.mllib.random` for SPARK-10275.
Author: noelsmith <mail@noelsmith.com>
Closes#8773 from noel-smith/mllib-random-versionadded-fix.
Duplicated the since decorator from pyspark.sql into pyspark (also tweaked to handle functions without docstrings).
Added since to methods + "versionadded::" to classes (derived from the git file history in pyspark).
Author: noelsmith <mail@noelsmith.com>
Closes#8633 from noel-smith/SPARK-10273-since-mllib-feature.
PySpark DenseVector, SparseVector ```__eq__``` method should use semantics equality, and DenseVector can compared with SparseVector.
Implement PySpark DenseVector, SparseVector ```__hash__``` method based on the first 16 entries. That will make PySpark Vector objects can be used in collections.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8166 from yanboliang/spark-9793.
[SPARK-3382](https://issues.apache.org/jira/browse/SPARK-3382) added a ```convergenceTol``` parameter for GradientDescent-based methods in Scala. We need that parameter in Python; otherwise, Python users will not be able to adjust that behavior (or even reproduce behavior from previous releases since the default changed).
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8457 from yanboliang/spark-10194.
Adding STDDEV support for DataFrame using 1-pass online /parallel algorithm to compute variance. Please review the code change.
Author: JihongMa <linlin200605@gmail.com>
Author: Jihong MA <linlin200605@gmail.com>
Author: Jihong MA <jihongma@jihongs-mbp.usca.ibm.com>
Author: Jihong MA <jihongma@Jihongs-MacBook-Pro.local>
Closes#6297 from JihongMA/SPARK-SQL.
Just fixing a typo in exception message, raised when attempting to pickle SparkContext.
Author: Icaro Medeiros <icaro.medeiros@gmail.com>
Closes#8724 from icaromedeiros/master.
Changes:
* Make Scala doc for StringIndexerInverse clearer. Also remove Scala doc from transformSchema, so that the doc is inherited.
* MetadataUtils.scala: “ Helper utilities for tree-based algorithms” —> not just trees anymore
CC: holdenk mengxr
Author: Joseph K. Bradley <joseph@databricks.com>
Closes#8679 from jkbradley/doc-fixes-1.5.
LinearRegression and LogisticRegression lack of some Params for Python, and some Params are not shared classes which lead we need to write them for each class. These kinds of Params are list here:
```scala
HasElasticNetParam
HasFitIntercept
HasStandardization
HasThresholds
```
Here we implement them in shared params at Python side and make LinearRegression/LogisticRegression parameters peer with Scala one.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8508 from yanboliang/spark-10026.
Missing method of ml.feature are listed here:
```StringIndexer``` lacks of parameter ```handleInvalid```.
```StringIndexerModel``` lacks of method ```labels```.
```VectorIndexerModel``` lacks of methods ```numFeatures``` and ```categoryMaps```.
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8313 from yanboliang/spark-10027.
Modified class-level docstrings to mark all feature transformers in pyspark.ml as experimental.
Author: noelsmith <mail@noelsmith.com>
Closes#8623 from noel-smith/SPARK-10094-mark-pyspark-ml-trans-exp.
- Fixed information around Python API tags in streaming programming guides
- Added missing stuff in python docs
Author: Tathagata Das <tathagata.das1565@gmail.com>
Closes#8595 from tdas/SPARK-10440.
`pyspark.sql.column.Column` object has `__getitem__` method, which makes it iterable for Python. In fact it has `__getitem__` to address the case when the column might be a list or dict, for you to be able to access certain element of it in DF API. The ability to iterate over it is just a side effect that might cause confusion for the people getting familiar with Spark DF (as you might iterate this way on Pandas DF for instance)
Issue reproduction:
```
df = sqlContext.jsonRDD(sc.parallelize(['{"name": "El Magnifico"}']))
for i in df["name"]: print i
```
Author: 0x0FFF <programmerag@gmail.com>
Closes#8574 from 0x0FFF/SPARK-10417.
This PR addresses issue [SPARK-10392](https://issues.apache.org/jira/browse/SPARK-10392)
The problem is that for "start of epoch" date (01 Jan 1970) PySpark class DateType returns 0 instead of the `datetime.date` due to implementation of its return statement
Issue reproduction on master:
```
>>> from pyspark.sql.types import *
>>> a = DateType()
>>> a.fromInternal(0)
0
>>> a.fromInternal(1)
datetime.date(1970, 1, 2)
```
Author: 0x0FFF <programmerag@gmail.com>
Closes#8556 from 0x0FFF/SPARK-10392.
This PR addresses [SPARK-10162](https://issues.apache.org/jira/browse/SPARK-10162)
The issue is with DataFrame filter() function, if datetime.datetime is passed to it:
* Timezone information of this datetime is ignored
* This datetime is assumed to be in local timezone, which depends on the OS timezone setting
Fix includes both code change and regression test. Problem reproduction code on master:
```python
import pytz
from datetime import datetime
from pyspark.sql import *
from pyspark.sql.types import *
sqc = SQLContext(sc)
df = sqc.createDataFrame([], StructType([StructField("dt", TimestampType())]))
m1 = pytz.timezone('UTC')
m2 = pytz.timezone('Etc/GMT+3')
df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
```
It gives the same timestamp ignoring time zone:
```
>>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
Filter (dt#0 > 946713600000000)
Scan PhysicalRDD[dt#0]
>>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
Filter (dt#0 > 946713600000000)
Scan PhysicalRDD[dt#0]
```
After the fix:
```
>>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m1)).explain()
Filter (dt#0 > 946684800000000)
Scan PhysicalRDD[dt#0]
>>> df.filter(df.dt > datetime(2000, 01, 01, tzinfo=m2)).explain()
Filter (dt#0 > 946695600000000)
Scan PhysicalRDD[dt#0]
```
PR [8536](https://github.com/apache/spark/pull/8536) was occasionally closed by me dropping the repo
Author: 0x0FFF <programmerag@gmail.com>
Closes#8555 from 0x0FFF/SPARK-10162.
* Added isLargerBetter() method to Pyspark Evaluator to match the Scala version.
* JavaEvaluator delegates isLargerBetter() to underlying Scala object.
* Added check for isLargerBetter() in CrossValidator to determine whether to use argmin or argmax.
* Added test cases for where smaller is better (RMSE) and larger is better (R-Squared).
(This contribution is my original work and that I license the work to the project under Sparks' open source license)
Author: noelsmith <mail@noelsmith.com>
Closes#8399 from noel-smith/pyspark-rmse-xval-fix.
PySpark DataFrameReader should could accept an RDD of Strings (like the Scala version does) for JSON, rather than only taking a path.
If this PR is merged, it should be duplicated to cover the other input types (not just JSON).
Author: Yanbo Liang <ybliang8@gmail.com>
Closes#8444 from yanboliang/spark-9964.