Commit graph

689 commits

Author SHA1 Message Date
Liang-Chi Hsieh fee3438a32 [SPARK-8218][SQL] Add binary log math function
JIRA: https://issues.apache.org/jira/browse/SPARK-8218

Because there is already `log` unary function defined, the binary log function is called `logarithm` for now.

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6725 from viirya/expr_binary_log and squashes the following commits:

bf96bd9 [Liang-Chi Hsieh] Compare log result in string.
102070d [Liang-Chi Hsieh] Round log result to better comparing in python test.
fd01863 [Liang-Chi Hsieh] For comments.
beed631 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_binary_log
6089d11 [Liang-Chi Hsieh] Remove unnecessary override.
8cf37b7 [Liang-Chi Hsieh] For comments.
bc89597 [Liang-Chi Hsieh] For comments.
db7dc38 [Liang-Chi Hsieh] Use ctor instead of companion object.
0634ef7 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_binary_log
1750034 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_binary_log
3d75bfc [Liang-Chi Hsieh] Fix scala style.
5b39c02 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_binary_log
23c54a3 [Liang-Chi Hsieh] Fix scala style.
ebc9929 [Liang-Chi Hsieh] Let Logarithm accept one parameter too.
605574d [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_binary_log
21c3bfd [Liang-Chi Hsieh] Fix scala style.
c6c187f [Liang-Chi Hsieh] For comments.
c795342 [Liang-Chi Hsieh] Merge remote-tracking branch 'upstream/master' into expr_binary_log
f373bac [Liang-Chi Hsieh] Add binary log expression.
2015-06-17 23:31:30 -07:00
MechCoder 22732e1eca [SPARK-7605] [MLLIB] [PYSPARK] Python API for ElementwiseProduct
Python API for org.apache.spark.mllib.feature.ElementwiseProduct

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #6346 from MechCoder/spark-7605 and squashes the following commits:

79d1ef5 [MechCoder] Consistent and support list / array types
5f81d81 [MechCoder] [SPARK-7605] [MLlib] Python API for ElementwiseProduct
2015-06-17 22:08:38 -07:00
zsxwing 0fc4b96f3e [SPARK-8373] [PYSPARK] Add emptyRDD to pyspark and fix the issue when calling sum on an empty RDD
This PR fixes the sum issue and also adds `emptyRDD` so that it's easy to create a test case.

Author: zsxwing <zsxwing@gmail.com>

Closes #6826 from zsxwing/python-emptyRDD and squashes the following commits:

b36993f [zsxwing] Update the return type to JavaRDD[T]
71df047 [zsxwing] Add emptyRDD to pyspark and fix the issue when calling sum on an empty RDD
2015-06-17 13:59:39 -07:00
MechCoder 6765ef98df [SPARK-6390] [SQL] [MLlib] Port MatrixUDT to PySpark
MatrixUDT was recently coded in scala. This has been ported to PySpark

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #6354 from MechCoder/spark-6390 and squashes the following commits:

fc4dc1e [MechCoder] Better error message
c940a44 [MechCoder] Added test
aa9c391 [MechCoder] Add pyUDT to MatrixUDT
62a2a7d [MechCoder] [SPARK-6390] Port MatrixUDT to PySpark
2015-06-17 11:10:16 -07:00
Yanbo Liang ca998757e8 [SPARK-7916] [MLLIB] MLlib Python doc parity check for classification and regression
Check then make the MLlib Python classification and regression doc to be as complete as the Scala doc.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #6460 from yanboliang/spark-7916 and squashes the following commits:

f8deda4 [Yanbo Liang] trigger jenkins
6dc4d99 [Yanbo Liang] address comments
ce2a43e [Yanbo Liang] truncate too long line and remove extra sparse
3eaf6ad [Yanbo Liang] MLlib Python doc parity check for classification and regression
2015-06-16 14:30:30 -07:00
Davies Liu 424b0075a1 [SPARK-6411] [SQL] [PySpark] support date/datetime with timezone in Python
Spark SQL does not support timezone, and Pyrolite does not support timezone well. This patch will convert datetime into POSIX timestamp (without confusing of timezone), which is used by SQL. If the datetime object does not have timezone, it's treated as local time.

The timezone in RDD will be lost after one round trip, all the datetime from SQL will be local time.

Because of Pyrolite, datetime from SQL only has precision as 1 millisecond.

This PR also drop the timezone in date, convert it to number of days since epoch (used in SQL).

Author: Davies Liu <davies@databricks.com>

Closes #6250 from davies/tzone and squashes the following commits:

44d8497 [Davies Liu] add timezone support for DateType
99d9d9c [Davies Liu] use int for timestamp
10aa7ca [Davies Liu] Merge branch 'master' of github.com:apache/spark into tzone
6a29aa4 [Davies Liu] support datetime with timezone
2015-06-11 01:00:41 -07:00
Davies Liu 37719e0cd0 [SPARK-8189] [SQL] use Long for TimestampType in SQL
This PR change to use Long as internal type for TimestampType for efficiency, which means it will the precision below 100ns.

Author: Davies Liu <davies@databricks.com>

Closes #6733 from davies/timestamp and squashes the following commits:

d9565fa [Davies Liu] remove print
65cf2f1 [Davies Liu] fix Timestamp in SparkR
86fecfb [Davies Liu] disable two timestamp tests
8f77ee0 [Davies Liu] fix scala style
246ee74 [Davies Liu] address comments
309d2e1 [Davies Liu] use Long for TimestampType in SQL
2015-06-10 16:55:39 -07:00
Reynold Xin 1b499993ad [SPARK-7886] Add built-in expressions to FunctionRegistry.
This patch switches to using FunctionRegistry for built-in expressions. It is based on #6463, but with some work to simplify it along with unit tests.

TODOs for future pull requests:
- Use static registration so we don't need to register all functions every time we start a new SQLContext
- Switch to using this in HiveContext

Author: Reynold Xin <rxin@databricks.com>
Author: Santiago M. Mola <santi@mola.io>

Closes #6710 from rxin/udf-registry and squashes the following commits:

6930822 [Reynold Xin] Fixed Python test.
b802c9a [Reynold Xin] Made UDF case insensitive.
e60d815 [Reynold Xin] Made UDF case insensitive.
852f9c0 [Reynold Xin] Fixed style violation.
e76a3c1 [Reynold Xin] Fixed parser.
52ddaba [Reynold Xin] Fixed compilation.
ee7854f [Reynold Xin] Improved error reporting.
ff906f2 [Reynold Xin] More robust constructor calling.
77b46f1 [Reynold Xin] Simplified the code.
2a2a149 [Reynold Xin] Merge pull request #6463 from smola/SPARK-7886
8616924 [Santiago M. Mola] [SPARK-7886] Add built-in expressions to FunctionRegistry.
2015-06-09 16:24:38 +08:00
Liang-Chi Hsieh 7658eb28a2 [SPARK-7990][SQL] Add methods to facilitate equi-join on multiple joining keys
JIRA: https://issues.apache.org/jira/browse/SPARK-7990

Author: Liang-Chi Hsieh <viirya@gmail.com>

Closes #6616 from viirya/multi_keys_equi_join and squashes the following commits:

cd5c888 [Liang-Chi Hsieh] Import reduce in python3.
c43722c [Liang-Chi Hsieh] For comments.
0400e89 [Liang-Chi Hsieh] Fix scala style.
cc90015 [Liang-Chi Hsieh] Add methods to facilitate equi-join on multiple joining keys.
2015-06-08 23:27:05 -07:00
cody koeninger b127ff8a0c [SPARK-2808] [STREAMING] [KAFKA] cleanup tests from
see if requiring producer acks eliminates the need for waitUntilLeaderOffset calls in tests

Author: cody koeninger <cody@koeninger.org>

Closes #5921 from koeninger/kafka-0.8.2-test-cleanup and squashes the following commits:

1e89dc8 [cody koeninger] Merge branch 'master' into kafka-0.8.2-test-cleanup
4662828 [cody koeninger] [Streaming][Kafka] filter mima issue for removal of method from private test class
af1e083 [cody koeninger] Merge branch 'master' into kafka-0.8.2-test-cleanup
4298ac2 [cody koeninger] [Streaming][Kafka] update comment to trigger jenkins attempt
1274afb [cody koeninger] [Streaming][Kafka] see if requiring producer acks eliminates the need for waitUntilLeaderOffset calls in tests
2015-06-07 21:42:45 +01:00
Reynold Xin 0ac47083f7 [SPARK-8146] DataFrame Python API: Alias replace in df.na
Author: Reynold Xin <rxin@databricks.com>

Closes #6688 from rxin/df-alias-replace and squashes the following commits:

774c19c [Reynold Xin] [SPARK-8146] DataFrame Python API: Alias replace in DataFrameNaFunctions.
2015-06-07 01:21:02 -07:00
MechCoder 5aa804f3c6 [SPARK-7639] [PYSPARK] [MLLIB] Python API for KernelDensity
Python API for KernelDensity

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #6387 from MechCoder/spark-7639 and squashes the following commits:

17abc62 [MechCoder] add tests
2de6540 [MechCoder] style tests
bf4acc0 [MechCoder] Added doctests
84359d5 [MechCoder] [SPARK-7639] Python API for KernelDensity
2015-06-06 14:52:14 -07:00
amey 356a4a9b93 [SPARK-7991] [PySpark] Adding support for passing lists to describe.
This is a minor change.

Author: amey <amey@skytree.net>

Closes #6655 from ameyc/JIRA-7991/support-passing-list-to-describe and squashes the following commits:

e8a1dff [amey] Adding support for passing lists to describe.
2015-06-05 13:49:33 -07:00
Ted Blackman e505460599 [SPARK-8116][PYSPARK] Allow sc.range() to take a single argument.
Author: Ted Blackman <ted.blackman@gmail.com>

Closes #6656 from belisarius222/branch-1.4 and squashes the following commits:

747cbc2 [Ted Blackman] [SPARK-8116][PYSPARK] Allow sc.range() to take a single argument.

(cherry picked from commit f02af7c8f7)
Signed-off-by: Reynold Xin <rxin@databricks.com>
2015-06-04 22:22:01 -07:00
Mike Dusenberry df7da07a86 [SPARK-7969] [SQL] Added a DataFrame.drop function that accepts a Column reference.
Added a `DataFrame.drop` function that accepts a `Column` reference rather than a `String`, and added associated unit tests.  Basically iterates through the `DataFrame` to find a column with an expression that is equivalent to that of the `Column` argument supplied to the function.

Author: Mike Dusenberry <dusenberrymw@gmail.com>

Closes #6585 from dusenberrymw/SPARK-7969_Drop_method_on_Dataframes_should_handle_Column and squashes the following commits:

514727a [Mike Dusenberry] Updating the @since tag of the drop(Column) function doc to reflect version 1.4.1 instead of 1.4.0.
2f1bb4e [Mike Dusenberry] Adding an additional assert statement to the 'drop column after join' unit test in order to make sure the correct column was indeed left over.
6bf7c0e [Mike Dusenberry] Minor code formatting change.
e583888 [Mike Dusenberry] Adding more Python doctests for the df.drop with column reference function to test joined datasets that have columns with the same name.
5f74401 [Mike Dusenberry] Updating DataFrame.drop with column reference function to use logicalPlan.output to prevent ambiguities resulting from columns with the same name. Also added associated unit tests for joined datasets with duplicate column names.
4b8bbe8 [Mike Dusenberry] Adding Python support for Dataframe.drop with a Column reference.
986129c [Mike Dusenberry] Added a DataFrame.drop function that accepts a Column reference rather than a String, and added associated unit tests.  Basically iterates through the DataFrame to find a column with an expression that is equivalent to one supplied to the function.
2015-06-04 11:30:07 -07:00
Reynold Xin 2c5a06cafd Update documentation for [SPARK-7980] [SQL] Support SQLContext.range(end) 2015-06-03 14:20:27 -07:00
animesh d053a31be9 [SPARK-7980] [SQL] Support SQLContext.range(end)
1. range() overloaded in SQLContext.scala
2. range() modified in python sql context.py
3. Tests added accordingly in DataFrameSuite.scala and python sql tests.py

Author: animesh <animesh@apache.spark>

Closes #6609 from animeshbaranawal/SPARK-7980 and squashes the following commits:

935899c [animesh] SPARK-7980:python+scala changes
2015-06-03 11:28:18 -07:00
Reynold Xin ce320cb2db [SPARK-8060] Improve DataFrame Python test coverage and documentation.
Author: Reynold Xin <rxin@databricks.com>

Closes #6601 from rxin/python-read-write-test-and-doc and squashes the following commits:

baa8ad5 [Reynold Xin] Code review feedback.
f081d47 [Reynold Xin] More documentation updates.
c9902fa [Reynold Xin] [SPARK-8060] Improve DataFrame Python reader/writer interface doc and testing.
2015-06-03 00:23:34 -07:00
MechCoder 452eb82dd7 [SPARK-8032] [PYSPARK] Make version checking for NumPy in MLlib more robust
The current checking does version `1.x' is less than `1.4' this will fail if x has greater than 1 digit, since x > 4, however `1.x` < `1.4`

It fails in my system since I have version `1.10` :P

Author: MechCoder <manojkumarsivaraj334@gmail.com>

Closes #6579 from MechCoder/np_ver and squashes the following commits:

15430f8 [MechCoder] fix syntax error
893fb7e [MechCoder] remove equal to
e35f0d4 [MechCoder] minor
e89376c [MechCoder] Better checking
22703dd [MechCoder] [SPARK-8032] Make version checking for NumPy in MLlib more robust
2015-06-02 23:24:47 -07:00
Davies Liu 605ddbb27c [SPARK-8038] [SQL] [PYSPARK] fix Column.when() and otherwise()
Thanks ogirardot, closes #6580

cc rxin JoshRosen

Author: Davies Liu <davies@databricks.com>

Closes #6590 from davies/when and squashes the following commits:

c0f2069 [Davies Liu] fix Column.when() and otherwise()
2015-06-02 13:38:06 -07:00
Xiangrui Meng bd97840d5c [SPARK-7432] [MLLIB] fix flaky CrossValidator doctest
The new test uses CV to compare `maxIter=0` and `maxIter=1`, and validate on the evaluation result. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #6572 from mengxr/SPARK-7432 and squashes the following commits:

c236bb8 [Xiangrui Meng] fix flacky cv doctest
2015-06-02 08:51:00 -07:00
Davies Liu 445647a1a3 [SPARK-8021] [SQL] [PYSPARK] make Python read/write API consistent with Scala
add schema()/format()/options() for reader,  add mode()/format()/options()/partitionBy() for writer

cc rxin yhuai  pwendell

Author: Davies Liu <davies@databricks.com>

Closes #6578 from davies/readwrite and squashes the following commits:

720d293 [Davies Liu] address comments
b65dfa2 [Davies Liu] Update readwriter.py
1299ab6 [Davies Liu] make Python API consistent with Scala
2015-06-02 08:37:18 -07:00
Reynold Xin 4c868b9943 [minor doc] Add exploratory data analysis warning for DataFrame.stat.freqItem API
Author: Reynold Xin <rxin@databricks.com>

Closes #6569 from rxin/freqItemsWarning and squashes the following commits:

7eec145 [Reynold Xin] [minor doc] Add exploratory data analysis warning for DataFrame.stat.freqItem API.
2015-06-01 21:29:39 -07:00
Davies Liu b7ab0299b0 [SPARK-7497] [PYSPARK] [STREAMING] fix streaming flaky tests
Increase the duration and timeout in streaming python tests.

Author: Davies Liu <davies@databricks.com>

Closes #6239 from davies/flaky_tests and squashes the following commits:

d6aee8f [Davies Liu] fix window tests
26317f7 [Davies Liu] Merge branch 'master' of github.com:apache/spark into flaky_tests
7947db6 [Davies Liu] fix streaming flaky tests
2015-06-01 14:40:08 -07:00
Davies Liu 91777a1c3a [SPARK-7978] [SQL] [PYSPARK] DecimalType should not be singleton
Author: Davies Liu <davies@databricks.com>

Closes #6532 from davies/decimal and squashes the following commits:

c7fcbce [Davies Liu] Update tests.py
1425359 [Davies Liu] DecimalType should not be singleton
2015-05-31 19:55:57 -07:00
Yanbo Liang 1617363fbb [SPARK-7918] [MLLIB] MLlib Python doc parity check for evaluation and feature
Check then make the MLlib Python evaluation and feature doc to be as complete as the Scala doc.

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #6461 from yanboliang/spark-7918 and squashes the following commits:

940e3f1 [Yanbo Liang] truncate too long line and remove extra sparse
a80ae58 [Yanbo Liang] MLlib Python doc parity check for evaluation and feature
2015-05-30 16:24:07 -07:00
Michael Nazario 1c5b19827a [SPARK-7899] [PYSPARK] Fix Python 3 pyspark/sql/types module conflict
This PR makes the types module in `pyspark/sql/types` work with pylint static analysis by removing the dynamic naming of the `pyspark/sql/_types` module to `pyspark/sql/types`.

Tests are now loaded using `$PYSPARK_DRIVER_PYTHON -m module` rather than `$PYSPARK_DRIVER_PYTHON module.py`. The old method adds the location of `module.py` to `sys.path`, so this change prevents accidental use of relative paths in Python.

Author: Michael Nazario <mnazario@palantir.com>

Closes #6439 from mnazario/feature/SPARK-7899 and squashes the following commits:

366ef30 [Michael Nazario] Remove hack on random.py
bb8b04d [Michael Nazario] Make doctests consistent with other tests
6ee4f75 [Michael Nazario] Change test scripts to use "-m"
673528f [Michael Nazario] Move _types back to types
2015-05-29 14:13:44 -07:00
Xiangrui Meng 23452be944 [SPARK-7912] [SPARK-7921] [MLLIB] Update OneHotEncoder to handle ML attributes and change includeFirst to dropLast
This PR contains two major changes to `OneHotEncoder`:

1. more robust handling of ML attributes. If the input attribute is unknown, we look at the values to get the max category index
2. change `includeFirst` to `dropLast` and leave the default to `true`. There are couple benefits:

    a. consistent with other tutorials of one-hot encoding (or dummy coding) (e.g., http://www.ats.ucla.edu/stat/mult_pkg/faq/general/dummy.htm)
    b. keep the indices unmodified in the output vector. If we drop the first, all indices will be shifted by 1.
    c. If users use `StringIndex`, the last element is the least frequent one.

Sorry for including two changes in one PR! I'll update the user guide in another PR.

jkbradley sryza

Author: Xiangrui Meng <meng@databricks.com>

Closes #6466 from mengxr/SPARK-7912 and squashes the following commits:

a280dca [Xiangrui Meng] fix tests
d8f234d [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7912
171b276 [Xiangrui Meng] mention the difference between our impl vs sklearn's
00dfd96 [Xiangrui Meng] update OneHotEncoder in Python
208ddad [Xiangrui Meng] update OneHotEncoder to handle ML attributes and change includeFirst to dropLast
2015-05-29 00:51:12 -07:00
Xiangrui Meng db95137897 [SPARK-7922] [MLLIB] use DataFrames for user/item factors in ALSModel
Expose user/item factors in DataFrames. This is to be more consistent with the pipeline API. It also helps maintain consistent APIs across languages. This PR also removed fitting params from `ALSModel`.

coderxiang

Author: Xiangrui Meng <meng@databricks.com>

Closes #6468 from mengxr/SPARK-7922 and squashes the following commits:

7bfb1d5 [Xiangrui Meng] update ALSModel in PySpark
1ba5607 [Xiangrui Meng] use DataFrames for user/item factors in ALS
2015-05-28 22:38:38 -07:00
Xiangrui Meng 834e699524 [MINOR] fix RegressionEvaluator doc
`make clean html` under `python/doc` returns
~~~
/Users/meng/src/spark/python/pyspark/ml/evaluation.py:docstring of pyspark.ml.evaluation.RegressionEvaluator.setParams:3: WARNING: Definition list ends without a blank line; unexpected unindent.
~~~

harsha2010

Author: Xiangrui Meng <meng@databricks.com>

Closes #6469 from mengxr/fix-regression-evaluator-doc and squashes the following commits:

91e2dad [Xiangrui Meng] fix RegressionEvaluator doc
2015-05-28 21:26:43 -07:00
linweizhong 8948ad3fb5 [SPARK-7339] [PYSPARK] PySpark shuffle spill memory sometimes are not correct
In PySpark we get memory used before and after spill, then use the difference of these two value as memorySpilled, but if the before value is small than after value, then we will get a negative value, but this scenario 0 value may be more reasonable.

Below is the result in HistoryServer we have tested:
Index	ID	Attempt	Status	Locality Level	Executor ID / Host	Launch Time	Duration	GC Time	Input Size / Records	Write Time	Shuffle Write Size / Records	Shuffle Spill (Memory)	Shuffle Spill (Disk)	Errors
0	0	0	SUCCESS	NODE_LOCAL	3 / vm119	2015/05/04 17:31:06	21 s	0.1 s	128.1 MB (hadoop) / 3237	70 ms	10.1 MB / 2529	0.0 B	5.7 MB
2	2	0	SUCCESS	NODE_LOCAL	1 / vm118	2015/05/04 17:31:06	22 s	89 ms	128.1 MB (hadoop) / 3205	0.1 s	10.1 MB / 2529	-1048576.0 B	5.9 MB
1	1	0	SUCCESS	NODE_LOCAL	2 / vm117	2015/05/04 17:31:06	22 s	0.1 s	128.1 MB (hadoop) / 3271	68 ms	10.1 MB / 2529	-1048576.0 B	5.6 MB
4	4	0	SUCCESS	NODE_LOCAL	2 / vm117	2015/05/04 17:31:06	22 s	0.1 s	128.1 MB (hadoop) / 3192	51 ms	10.1 MB / 2529	-1048576.0 B	5.9 MB
3	3	0	SUCCESS	NODE_LOCAL	3 / vm119	2015/05/04 17:31:06	22 s	0.1 s	128.1 MB (hadoop) / 3262	51 ms	10.1 MB / 2529	1024.0 KB	5.8 MB
5	5	0	SUCCESS	NODE_LOCAL	1 / vm118	2015/05/04 17:31:06	22 s	89 ms	128.1 MB (hadoop) / 3256	93 ms	10.1 MB / 2529	-1048576.0 B	5.7 MB

/cc davies

Author: linweizhong <linweizhong@huawei.com>

Closes #5887 from Sephiroth-Lin/spark-7339 and squashes the following commits:

9186c81 [linweizhong] Use max function to get a nonnegative value
d41672b [linweizhong] Update MemoryBytesSpilled when memorySpilled > 0
2015-05-26 08:35:39 -07:00
Ram Sriharsha 65c696ecc0 [SPARK-7833] [ML] Add python wrapper for RegressionEvaluator
Author: Ram Sriharsha <rsriharsha@hw11853.local>

Closes #6365 from harsha2010/SPARK-7833 and squashes the following commits:

923f288 [Ram Sriharsha] cleanup
7623b7d [Ram Sriharsha] python style fix
9743f83 [Ram Sriharsha] [SPARK-7833][ml] Add python wrapper for RegressionEvaluator
2015-05-24 10:36:02 -07:00
Davies Liu be47af1bdb [SPARK-7840] add insertInto() to Writer
Add tests later.

Author: Davies Liu <davies@databricks.com>

Closes #6375 from davies/insertInto and squashes the following commits:

826423e [Davies Liu] add insertInto() to Writer
2015-05-23 09:07:14 -07:00
Davies Liu efe3bfdf49 [SPARK-7322, SPARK-7836, SPARK-7822][SQL] DataFrame window function related updates
1. ntile should take an integer as parameter.
2. Added Python API (based on #6364)
3. Update documentation of various DataFrame Python functions.

Author: Davies Liu <davies@databricks.com>
Author: Reynold Xin <rxin@databricks.com>

Closes #6374 from rxin/window-final and squashes the following commits:

69004c7 [Reynold Xin] Style fix.
288cea9 [Reynold Xin] Update documentaiton.
7cb8985 [Reynold Xin] Merge pull request #6364 from davies/window
66092b4 [Davies Liu] update docs
ed73cb4 [Reynold Xin] [SPARK-7322][SQL] Improve DataFrame window function documentation.
ef55132 [Davies Liu] Merge branch 'master' of github.com:apache/spark into window4
8936ade [Davies Liu] fix maxint in python 3
2649358 [Davies Liu] update docs
778e2c0 [Davies Liu] SPARK-7836 and SPARK-7822: Python API of window functions
2015-05-23 08:30:05 -07:00
Xiangrui Meng 8f11c6116b [SPARK-7535] [.0] [MLLIB] Audit the pipeline APIs for 1.4
Some changes to the pipeilne APIs:

1. Estimator/Transformer/ doesn’t need to extend Params since PipelineStage already does.
1. Move Evaluator to ml.evaluation.
1. Mention larger metric values are better.
1. PipelineModel doc. “compiled” -> “fitted”
1. Hide object PolynomialExpansion.
1. Hide object VectorAssembler.
1. Word2Vec.minCount (and other) -> group param
1. ParamValidators -> DeveloperApi
1. Hide MetadataUtils/SchemaUtils.

jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #6322 from mengxr/SPARK-7535.0 and squashes the following commits:

9e9c7da [Xiangrui Meng] move JavaEvaluator to ml.evaluation as well
e179480 [Xiangrui Meng] move Evaluation to ml.evaluation in PySpark
08ef61f [Xiangrui Meng] update pipieline APIs
2015-05-21 22:57:33 -07:00
Xiangrui Meng f5db4b416c [SPARK-7794] [MLLIB] update RegexTokenizer default settings
The previous default is `{gaps: false, pattern: "\\p{L}+|[^\\p{L}\\s]+"}`. The default pattern is hard to understand. This PR changes the default to `{gaps: true, pattern: "\\s+"}`. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #6330 from mengxr/SPARK-7794 and squashes the following commits:

5ee7cde [Xiangrui Meng] update RegexTokenizer default settings
2015-05-21 17:59:03 -07:00
Davies Liu 17791a5815 [SPARK-7783] [SQL] [PySpark] add DataFrame.rollup/cube in Python
Author: Davies Liu <davies@databricks.com>

Closes #6311 from davies/rollup and squashes the following commits:

0261db1 [Davies Liu] use @since
a51ca6b [Davies Liu] Merge branch 'master' of github.com:apache/spark into rollup
8ad5af4 [Davies Liu] Update dataframe.py
ade3841 [Davies Liu] add DataFrame.rollup/cube in Python
2015-05-21 17:43:08 -07:00
Holden Karau 6b18cdc1b1 [SPARK-7711] Add a startTime property to match the corresponding one in Scala
Author: Holden Karau <holden@pigscanfly.ca>

Closes #6275 from holdenk/SPARK-771-startTime-is-missing-from-pyspark and squashes the following commits:

06662dc [Holden Karau] add mising blank line for style checks
7a87410 [Holden Karau] add back missing newline
7a7876b [Holden Karau] Add a startTime property to match the corresponding one in the Scala SparkContext
2015-05-21 14:08:57 -07:00
kaka1992 699906e538 [SPARK-7394][SQL] Add Pandas style cast (astype)
Author: kaka1992 <kaka_1992@163.com>

Closes #6313 from kaka1992/astype and squashes the following commits:

73dfd0b [kaka1992] [SPARK-7394] Add Pandas style cast (astype)
ad8feb2 [kaka1992] [SPARK-7394] Add Pandas style cast (astype)
4f328b7 [kaka1992] [SPARK-7394] Add Pandas style cast (astype)
2015-05-21 11:50:39 -07:00
Sean Owen 6e53402696 [SPARK-6416] [DOCS] RDD.fold() requires the operator to be commutative
Document current limitation of rdd.fold.

This does not resolve SPARK-6416 but just documents the issue.
CC JoshRosen

Author: Sean Owen <sowen@cloudera.com>

Closes #6231 from srowen/SPARK-6416 and squashes the following commits:

9fef39f [Sean Owen] Add comment to other languages; reword to highlight the difference from non-distributed collections and to not suggest it is a bug that is to be fixed
da40d84 [Sean Owen] Document current limitation of rdd.fold.
2015-05-21 19:42:51 +01:00
Davies Liu 8ddcb25b39 [SPARK-7606] [SQL] [PySpark] add version to Python SQL API docs
Add version info for public Python SQL API.

cc rxin

Author: Davies Liu <davies@databricks.com>

Closes #6295 from davies/versions and squashes the following commits:

cfd91e6 [Davies Liu] add more version for DataFrame API
600834d [Davies Liu] add version to SQL API docs
2015-05-20 23:05:54 -07:00
Xiangrui Meng c330e52dae [SPARK-7762] [MLLIB] set default value for outputCol
Set a default value for `outputCol` instead of forcing users to name it. This is useful for intermediate transformers in the pipeline. jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #6289 from mengxr/SPARK-7762 and squashes the following commits:

54edebc [Xiangrui Meng] merge master
bff8667 [Xiangrui Meng] update unit test
171246b [Xiangrui Meng] add unit test for outputCol
a4321bd [Xiangrui Meng] set default value for outputCol
2015-05-20 17:26:26 -07:00
Holden Karau 191ee47452 [SPARK-7511] [MLLIB] pyspark ml seed param should be random by default or 42 is quite funny but not very random
Author: Holden Karau <holden@pigscanfly.ca>

Closes #6139 from holdenk/SPARK-7511-pyspark-ml-seed-param-should-be-random-by-default-or-42-is-quite-funny-but-not-very-random and squashes the following commits:

591f8e5 [Holden Karau] specify old seed for doc tests
2470004 [Holden Karau] Fix a bunch of seeds with default values to have None as the default which will then result in using the hash of the class name
cbad96d [Holden Karau] Add the setParams function that is used in the real code
423b8d7 [Holden Karau] Switch the test code to behave slightly more like production code. also don't check the param map value only check for key existence
140d25d [Holden Karau] remove extra space
926165a [Holden Karau] Add some missing newlines for pep8 style
8616751 [Holden Karau] merge in master
58532e6 [Holden Karau] its the __name__ method, also treat None values as not set
56ef24a [Holden Karau] fix test and regenerate base
afdaa5c [Holden Karau] make sure different classes have different results
68eb528 [Holden Karau] switch default seed to hash of type of self
89c4611 [Holden Karau] Merge branch 'master' into SPARK-7511-pyspark-ml-seed-param-should-be-random-by-default-or-42-is-quite-funny-but-not-very-random
31cd96f [Holden Karau] specify the seed to randomforestregressor test
e1b947f [Holden Karau] Style fixes
ce90ec8 [Holden Karau] merge in master
bcdf3c9 [Holden Karau] update docstring seeds to none and some other default seeds from 42
65eba21 [Holden Karau] pep8 fixes
0e3797e [Holden Karau] Make seed default to random in more places
213a543 [Holden Karau] Simplify the generated code to only include set default if there is a default rather than having None is note None in the generated code
1ff17c2 [Holden Karau] Make the seed random for HasSeed in python
2015-05-20 15:16:12 -07:00
Yanbo Liang 98a46f9dff [SPARK-6094] [MLLIB] Add MultilabelMetrics in PySpark/MLlib
Add MultilabelMetrics in PySpark/MLlib

Author: Yanbo Liang <ybliang8@gmail.com>

Closes #6276 from yanboliang/spark-6094 and squashes the following commits:

b8e3343 [Yanbo Liang] Add MultilabelMetrics in PySpark/MLlib
2015-05-20 07:55:51 -07:00
Davies Liu 4de74d2602 [SPARK-7738] [SQL] [PySpark] add reader and writer API in Python
cc rxin, please take a quick look, I'm working on tests.

Author: Davies Liu <davies@databricks.com>

Closes #6238 from davies/readwrite and squashes the following commits:

c7200eb [Davies Liu] update tests
9cbf01b [Davies Liu] Merge branch 'master' of github.com:apache/spark into readwrite
f0c5a04 [Davies Liu] use sqlContext.read.load
5f68bc8 [Davies Liu] update tests
6437e9a [Davies Liu] Merge branch 'master' of github.com:apache/spark into readwrite
bcc6668 [Davies Liu] add reader amd writer API in Python
2015-05-19 14:23:28 -07:00
Daoyuan Wang c2437de189 [SPARK-7150] SparkContext.range() and SQLContext.range()
This PR is based on #6081, thanks adrian-wang.

Closes #6081

Author: Daoyuan Wang <daoyuan.wang@intel.com>
Author: Davies Liu <davies@databricks.com>

Closes #6230 from davies/range and squashes the following commits:

d3ce5fe [Davies Liu] add tests
789eda5 [Davies Liu] add range() in Python
4590208 [Davies Liu] Merge commit 'refs/pull/6081/head' of github.com:apache/spark into range
cbf5200 [Daoyuan Wang] let's add python support in a separate PR
f45e3b2 [Daoyuan Wang] remove redundant toLong
617da76 [Daoyuan Wang] fix safe marge for corner cases
867c417 [Daoyuan Wang] fix
13dbe84 [Daoyuan Wang] update
bd998ba [Daoyuan Wang] update comments
d3a0c1b [Daoyuan Wang] add range api()
2015-05-18 21:43:12 -07:00
Davies Liu 32fbd297dd [SPARK-6216] [PYSPARK] check python version of worker with driver
This PR revert #5404, change to pass the version of python in driver into JVM, check it in worker before deserializing closure, then it can works with different major version of Python.

Author: Davies Liu <davies@databricks.com>

Closes #6203 from davies/py_version and squashes the following commits:

b8fb76e [Davies Liu] fix test
6ce5096 [Davies Liu] use string for version
47c6278 [Davies Liu] check python version of worker with driver
2015-05-18 12:55:13 -07:00
Xiangrui Meng 9c7e802a5a [SPARK-7380] [MLLIB] pipeline stages should be copyable in Python
This PR makes pipeline stages in Python copyable and hence simplifies some implementations. It also includes the following changes:

1. Rename `paramMap` and `defaultParamMap` to `_paramMap` and `_defaultParamMap`, respectively.
2. Accept a list of param maps in `fit`.
3. Use parent uid and name to identify param.

jkbradley

Author: Xiangrui Meng <meng@databricks.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #6088 from mengxr/SPARK-7380 and squashes the following commits:

413c463 [Xiangrui Meng] remove unnecessary doc
4159f35 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380
611c719 [Xiangrui Meng] fix python style
68862b8 [Xiangrui Meng] update _java_obj initialization
927ad19 [Xiangrui Meng] fix ml/tests.py
0138fc3 [Xiangrui Meng] update feature transformers and fix a bug in RegexTokenizer
9ca44fb [Xiangrui Meng] simplify Java wrappers and add tests
c7d84ef [Xiangrui Meng] update ml/tests.py to test copy params
7e0d27f [Xiangrui Meng] merge master
46840fb [Xiangrui Meng] update wrappers
b6db1ed [Xiangrui Meng] update all self.paramMap to self._paramMap
46cb6ed [Xiangrui Meng] merge master
a163413 [Xiangrui Meng] fix style
1042e80 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-7380
9630eae [Xiangrui Meng] fix Identifiable._randomUID
13bd70a [Xiangrui Meng] update ml/tests.py
64a536c [Xiangrui Meng] use _fit/_transform/_evaluate to simplify the impl
02abf13 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into copyable-python
66ce18c [Joseph K. Bradley] some cleanups before sending to Xiangrui
7431272 [Joseph K. Bradley] Rebased with master
2015-05-18 12:02:18 -07:00
Xiangrui Meng 1ecfac6e38 [SPARK-6657] [PYSPARK] Fix doc warnings
Fixed the following warnings in `make clean html` under `python/docs`:

~~~
/Users/meng/src/spark/python/pyspark/mllib/evaluation.py:docstring of pyspark.mllib.evaluation.RankingMetrics.ndcgAt:3: ERROR: Unexpected indentation.
/Users/meng/src/spark/python/pyspark/mllib/evaluation.py:docstring of pyspark.mllib.evaluation.RankingMetrics.ndcgAt:4: WARNING: Block quote ends without a blank line; unexpected unindent.
/Users/meng/src/spark/python/pyspark/mllib/fpm.py:docstring of pyspark.mllib.fpm.FPGrowth.train:3: ERROR: Unexpected indentation.
/Users/meng/src/spark/python/pyspark/mllib/fpm.py:docstring of pyspark.mllib.fpm.FPGrowth.train:4: WARNING: Block quote ends without a blank line; unexpected unindent.
/Users/meng/src/spark/python/pyspark/sql/__init__.py:docstring of pyspark.sql.DataFrame.replace:16: WARNING: Field list ends without a blank line; unexpected unindent.
/Users/meng/src/spark/python/pyspark/streaming/kafka.py:docstring of pyspark.streaming.kafka.KafkaUtils.createRDD:8: ERROR: Unexpected indentation.
/Users/meng/src/spark/python/pyspark/streaming/kafka.py:docstring of pyspark.streaming.kafka.KafkaUtils.createRDD:9: WARNING: Block quote ends without a blank line; unexpected unindent.
~~~

davies

Author: Xiangrui Meng <meng@databricks.com>

Closes #6221 from mengxr/SPARK-6657 and squashes the following commits:

e3f83fe [Xiangrui Meng] fix sql and streaming doc warnings
2b4371e [Xiangrui Meng] fix mllib python doc warnings
2015-05-18 08:35:14 -07:00
Davies Liu d7b69946cb [SPARK-7543] [SQL] [PySpark] split dataframe.py into multiple files
dataframe.py is splited into column.py, group.py and dataframe.py:
```
   360 column.py
  1223 dataframe.py
   183 group.py
```

Author: Davies Liu <davies@databricks.com>

Closes #6201 from davies/split_df and squashes the following commits:

fc8f5ab [Davies Liu] split dataframe.py into multiple files
2015-05-15 20:09:15 -07:00