Commit graph

496 commits

Author SHA1 Message Date
Xiangrui Meng 32218307ed [SPARK-4372][MLLIB] Make LR and SVM's default parameters consistent in Scala and Python
The current default regParam is 1.0 and regType is claimed to be none in Python (but actually it is l2), while regParam = 0.0 and regType is L2 in Scala. We should make the default values consistent. This PR sets the default regType to L2 and regParam to 0.01. Note that the default regParam value in LIBLINEAR (and hence scikit-learn) is 1.0. However, we use average loss instead of total loss in our formulation. Hence regParam=1.0 is definitely too heavy.

In LinearRegression, we set regParam=0.0 and regType=None, because we have separate classes for Lasso and Ridge, both of which use regParam=0.01 as the default.

davies atalwalkar

Author: Xiangrui Meng <meng@databricks.com>

Closes #3232 from mengxr/SPARK-4372 and squashes the following commits:

9979837 [Xiangrui Meng] update Ridge/Lasso to use default regParam 0.01 cast input arguments
d3ba096 [Xiangrui Meng] change 'none' back to None
1909a6e [Xiangrui Meng] change default regParam to 0.01 and regType to L2 in LR and SVM
2014-11-13 13:54:16 -08:00
Davies Liu ce0333f9a0 [SPARK-4348] [PySpark] [MLlib] rename random.py to rand.py
This PR rename random.py to rand.py to avoid the side affects of conflict with random module, but still keep the same interface as before.

```
>>> from pyspark.mllib.random import RandomRDDs
```

```
$ pydoc pyspark.mllib.random
Help on module random in pyspark.mllib:
NAME
    random - Python package for random data generation.

FILE
    /Users/davies/work/spark/python/pyspark/mllib/rand.py

CLASSES
    __builtin__.object
        pyspark.mllib.random.RandomRDDs

    class RandomRDDs(__builtin__.object)
     |  Generator methods for creating RDDs comprised of i.i.d samples from
     |  some distribution.
     |
     |  Static methods defined here:
     |
     |  normalRDD(sc, size, numPartitions=None, seed=None)
```

cc mengxr

reference link: http://xion.org.pl/2012/05/06/hacking-python-imports/

Author: Davies Liu <davies@databricks.com>

Closes #3216 from davies/random and squashes the following commits:

7ac4e8b [Davies Liu] rename random.py to rand.py
2014-11-13 10:24:54 -08:00
Davies Liu bd86118c4e [SPARK-4369] [MLLib] fix TreeModel.predict() with RDD
Fix  TreeModel.predict() with RDD, added tests for it.

(Also checked that other models don't have this issue)

Author: Davies Liu <davies@databricks.com>

Closes #3230 from davies/predict and squashes the following commits:

81172aa [Davies Liu] fix predict
2014-11-12 13:56:41 -08:00
Davies Liu 65083e93dd [SPARK-4324] [PySpark] [MLlib] support numpy.array for all MLlib API
This PR check all of the existing Python MLlib API to make sure that numpy.array is supported as Vector (also RDD of numpy.array).

It also improve some docstring and doctest.

cc mateiz mengxr

Author: Davies Liu <davies@databricks.com>

Closes #3189 from davies/numpy and squashes the following commits:

d5057c4 [Davies Liu] fix tests
6987611 [Davies Liu] support numpy.array for all MLlib API
2014-11-10 22:26:16 -08:00
Michelangelo D'Agostino 7e9d975676 [MLLIB] [PYTHON] SPARK-4221: Expose nonnegative ALS in the python API
SPARK-1553 added alternating nonnegative least squares to MLLib, however it's not possible to access it via the python API.  This pull request resolves that.

Author: Michelangelo D'Agostino <mdagostino@civisanalytics.com>

Closes #3095 from mdagost/python_nmf and squashes the following commits:

a6743ad [Michelangelo D'Agostino] Use setters instead of static methods in PythonMLLibAPI.  Remove the new static methods I added.  Set seed in tests.  Change ratings to ratingsRDD in both train and trainImplicit for consistency.
7cffd39 [Michelangelo D'Agostino] Swapped nonnegative and seed in a few more places.
3fdc851 [Michelangelo D'Agostino] Moved seed to the end of the python parameter list.
bdcc154 [Michelangelo D'Agostino] Change seed type to java.lang.Long so that it can handle null.
cedf043 [Michelangelo D'Agostino] Added in ability to set the seed from python and made that play nice with the nonnegative changes.  Also made the python ALS tests more exact.
a72fdc9 [Michelangelo D'Agostino] Expose nonnegative ALS in the python API.
2014-11-07 22:53:01 -08:00
Davies Liu 7779109796 [SPARK-4304] [PySpark] Fix sort on empty RDD
This PR fix sortBy()/sortByKey() on empty RDD.

This should be back ported into 1.1/1.2

Author: Davies Liu <davies@databricks.com>

Closes #3162 from davies/fix_sort and squashes the following commits:

84f64b7 [Davies Liu] add tests
52995b5 [Davies Liu] fix sortByKey() on empty RDD
2014-11-07 20:53:03 -08:00
Davies Liu b41a39e240 [SPARK-4186] add binaryFiles and binaryRecords in Python
add binaryFiles() and binaryRecords() in Python
```
binaryFiles(self, path, minPartitions=None):
    :: Developer API ::

    Read a directory of binary files from HDFS, a local file system
    (available on all nodes), or any Hadoop-supported file system URI
    as a byte array. Each file is read as a single record and returned
    in a key-value pair, where the key is the path of each file, the
    value is the content of each file.

    Note: Small files are preferred, large file is also allowable, but
    may cause bad performance.

binaryRecords(self, path, recordLength):
    Load data from a flat binary file, assuming each record is a set of numbers
    with the specified numerical format (see ByteBuffer), and the number of
    bytes per record is constant.

    :param path: Directory to the input data files
    :param recordLength: The length at which to split the records
```

Author: Davies Liu <davies@databricks.com>

Closes #3078 from davies/binary and squashes the following commits:

cd0bdbd [Davies Liu] Merge branch 'master' of github.com:apache/spark into binary
3aa349b [Davies Liu] add experimental notes
24e84b6 [Davies Liu] Merge branch 'master' of github.com:apache/spark into binary
5ceaa8a [Davies Liu] Merge branch 'master' of github.com:apache/spark into binary
1900085 [Davies Liu] bugfix
bb22442 [Davies Liu] add binaryFiles and binaryRecords in Python
2014-11-06 00:22:19 -08:00
Davies Liu c8abddc516 [SPARK-3964] [MLlib] [PySpark] add Hypothesis test Python API
```
pyspark.mllib.stat.StatisticschiSqTest(observed, expected=None)
    :: Experimental ::

    If `observed` is Vector, conduct Pearson's chi-squared goodness
    of fit test of the observed data against the expected distribution,
    or againt the uniform distribution (by default), with each category
    having an expected frequency of `1 / len(observed)`.
    (Note: `observed` cannot contain negative values)

    If `observed` is matrix, conduct Pearson's independence test on the
    input contingency matrix, which cannot contain negative entries or
    columns or rows that sum up to 0.

    If `observed` is an RDD of LabeledPoint, conduct Pearson's independence
    test for every feature against the label across the input RDD.
    For each feature, the (feature, label) pairs are converted into a
    contingency matrix for which the chi-squared statistic is computed.
    All label and feature values must be categorical.

    :param observed: it could be a vector containing the observed categorical
                     counts/relative frequencies, or the contingency matrix
                     (containing either counts or relative frequencies),
                     or an RDD of LabeledPoint containing the labeled dataset
                     with categorical features. Real-valued features will be
                     treated as categorical for each distinct value.
    :param expected: Vector containing the expected categorical counts/relative
                     frequencies. `expected` is rescaled if the `expected` sum
                     differs from the `observed` sum.
    :return: ChiSquaredTest object containing the test statistic, degrees
             of freedom, p-value, the method used, and the null hypothesis.
```

Author: Davies Liu <davies@databricks.com>

Closes #3091 from davies/his and squashes the following commits:

145d16c [Davies Liu] address comments
0ab0764 [Davies Liu] fix float
5097d54 [Davies Liu] add Hypothesis test Python API
2014-11-04 21:35:52 -08:00
Davies Liu e4f42631a6 [SPARK-3886] [PySpark] simplify serializer, use AutoBatchedSerializer by default.
This PR simplify serializer, always use batched serializer (AutoBatchedSerializer as default), even batch size is 1.

Author: Davies Liu <davies@databricks.com>

This patch had conflicts when merged, resolved by
Committer: Josh Rosen <joshrosen@databricks.com>

Closes #2920 from davies/fix_autobatch and squashes the following commits:

e544ef9 [Davies Liu] revert unrelated change
6880b14 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
1d557fc [Davies Liu] fix tests
8180907 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
76abdce [Davies Liu] clean up
53fa60b [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
d7ac751 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
2cc2497 [Davies Liu] Merge branch 'master' of github.com:apache/spark into fix_autobatch
b4292ce [Davies Liu] fix bug in master
d79744c [Davies Liu] recover hive tests
be37ece [Davies Liu] refactor
eb3938d [Davies Liu] refactor serializer in scala
8d77ef2 [Davies Liu] simplify serializer, use AutoBatchedSerializer by default.
2014-11-03 23:56:14 -08:00
Xiangrui Meng 1a9c6cddad [SPARK-3573][MLLIB] Make MLlib's Vector compatible with SQL's SchemaRDD
Register MLlib's Vector as a SQL user-defined type (UDT) in both Scala and Python. With this PR, we can easily map a RDD[LabeledPoint] to a SchemaRDD, and then select columns or save to a Parquet file. Examples in Scala/Python are attached. The Scala code was copied from jkbradley.

~~This PR contains the changes from #3068 . I will rebase after #3068 is merged.~~

marmbrus jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #3070 from mengxr/SPARK-3573 and squashes the following commits:

3a0b6e5 [Xiangrui Meng] organize imports
236f0a0 [Xiangrui Meng] register vector as UDT and provide dataset examples
2014-11-03 22:29:48 -08:00
Xiangrui Meng 04450d1154 [SPARK-4192][SQL] Internal API for Python UDT
Following #2919, this PR adds Python UDT (for internal use only) with tests under "pyspark.tests". Before `SQLContext.applySchema`, we check whether we need to convert user-type instances into SQL recognizable data. In the current implementation, a Python UDT must be paired with a Scala UDT for serialization on the JVM side. A following PR will add VectorUDT in MLlib for both Scala and Python.

marmbrus jkbradley davies

Author: Xiangrui Meng <meng@databricks.com>

Closes #3068 from mengxr/SPARK-4192-sql and squashes the following commits:

acff637 [Xiangrui Meng] merge master
dba5ea7 [Xiangrui Meng] only use pyClass for Python UDT output sqlType as well
2c9d7e4 [Xiangrui Meng] move import to global setup; update needsConversion
7c4a6a9 [Xiangrui Meng] address comments
75223db [Xiangrui Meng] minor update
f740379 [Xiangrui Meng] remove UDT from default imports
e98d9d0 [Xiangrui Meng] fix py style
4e84fce [Xiangrui Meng] remove local hive tests and add more tests
39f19e0 [Xiangrui Meng] add tests
b7f666d [Xiangrui Meng] add Python UDT
2014-11-03 19:29:11 -08:00
Davies Liu 24544fbce0 [SPARK-3594] [PySpark] [SQL] take more rows to infer schema or sampling
This patch will try to infer schema for RDD which has empty value (None, [], {}) in the first row. It will try first 100 rows and merge the types into schema, also merge fields of StructType together. If there is still NullType in schema, then it will show an warning, tell user to try with sampling.

If sampling is presented, it will infer schema from all the rows after sampling.

Also, add samplingRatio for jsonFile() and jsonRDD()

Author: Davies Liu <davies.liu@gmail.com>
Author: Davies Liu <davies@databricks.com>

Closes #2716 from davies/infer and squashes the following commits:

e678f6d [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
34b5c63 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
567dc60 [Davies Liu] update docs
9767b27 [Davies Liu] Merge branch 'master' into infer
e48d7fb [Davies Liu] fix tests
29e94d5 [Davies Liu] let NullType inherit from PrimitiveType
ee5d524 [Davies Liu] Merge branch 'master' of github.com:apache/spark into infer
540d1d5 [Davies Liu] merge fields for StructType
f93fd84 [Davies Liu] add more tests
3603e00 [Davies Liu] take more rows to infer schema, or infer the schema by sampling the RDD
2014-11-03 13:17:09 -08:00
Xiangrui Meng 3cca196220 [SPARK-4148][PySpark] fix seed distribution and add some tests for rdd.sample
The current way of seed distribution makes the random sequences from partition i and i+1 offset by 1.

~~~
In [14]: import random

In [15]: r1 = random.Random(10)

In [16]: r1.randint(0, 1)
Out[16]: 1

In [17]: r1.random()
Out[17]: 0.4288890546751146

In [18]: r1.random()
Out[18]: 0.5780913011344704

In [19]: r2 = random.Random(10)

In [20]: r2.randint(0, 1)
Out[20]: 1

In [21]: r2.randint(0, 1)
Out[21]: 0

In [22]: r2.random()
Out[22]: 0.5780913011344704
~~~

Note: The new tests are not for this bug fix.

Author: Xiangrui Meng <meng@databricks.com>

Closes #3010 from mengxr/SPARK-4148 and squashes the following commits:

869ae4b [Xiangrui Meng] move tests tests.py
c1bacd9 [Xiangrui Meng] fix seed distribution and add some tests for rdd.sample
2014-11-03 12:24:24 -08:00
Matei Zaharia 23f966f475 [SPARK-3930] [SPARK-3933] Support fixed-precision decimal in SQL, and some optimizations
- Adds optional precision and scale to Spark SQL's decimal type, which behave similarly to those in Hive 13 (https://cwiki.apache.org/confluence/download/attachments/27362075/Hive_Decimal_Precision_Scale_Support.pdf)
- Replaces our internal representation of decimals with a Decimal class that can store small values in a mutable Long, saving memory in this situation and letting some operations happen directly on Longs

This is still marked WIP because there are a few TODOs, but I'll remove that tag when done.

Author: Matei Zaharia <matei@databricks.com>

Closes #2983 from mateiz/decimal-1 and squashes the following commits:

35e6b02 [Matei Zaharia] Fix issues after merge
227f24a [Matei Zaharia] Review comments
31f915e [Matei Zaharia] Implement Davies's suggestions in Python
eb84820 [Matei Zaharia] Support reading/writing decimals as fixed-length binary in Parquet
4dc6bae [Matei Zaharia] Fix decimal support in PySpark
d1d9d68 [Matei Zaharia] Fix compile error and test issues after rebase
b28933d [Matei Zaharia] Support decimal precision/scale in Hive metastore
2118c0d [Matei Zaharia] Some test and bug fixes
81db9cb [Matei Zaharia] Added mutable Decimal that will be more efficient for small precisions
7af0c3b [Matei Zaharia] Add optional precision and scale to DecimalType, but use Unlimited for now
ec0a947 [Matei Zaharia] Make the result of AVG on Decimals be Decimal, not Double
2014-11-01 19:29:14 -07:00
Xiangrui Meng 1d4f355203 [SPARK-3569][SQL] Add metadata field to StructField
Add `metadata: Metadata` to `StructField` to store extra information of columns. `Metadata` is a simple wrapper over `Map[String, Any]` with value types restricted to Boolean, Long, Double, String, Metadata, and arrays of those types. SerDe is via JSON.

Metadata is preserved through simple operations like `SELECT`.

marmbrus liancheng

Author: Xiangrui Meng <meng@databricks.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #2701 from mengxr/structfield-metadata and squashes the following commits:

dedda56 [Xiangrui Meng] merge remote
5ef930a [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata
c35203f [Xiangrui Meng] Merge pull request #1 from marmbrus/pr/2701
886b85c [Michael Armbrust] Expose Metadata and MetadataBuilder through the public scala and java packages.
589f314 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata
1e2abcf [Xiangrui Meng] change default value of metadata to None in python
611d3c2 [Xiangrui Meng] move metadata from Expr to NamedExpr
ddfcfad [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata
a438440 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata
4266f4d [Xiangrui Meng] add StructField.toString back for backward compatibility
3f49aab [Xiangrui Meng] remove StructField.toString
24a9f80 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into structfield-metadata
473a7c5 [Xiangrui Meng] merge master
c9d7301 [Xiangrui Meng] organize imports
1fcbf13 [Xiangrui Meng] change metadata type in StructField for Scala/Java
60cc131 [Xiangrui Meng] add doc and header
60614c7 [Xiangrui Meng] add metadata
e42c452 [Xiangrui Meng] merge master
93518fb [Xiangrui Meng] support metadata in python
905bb89 [Xiangrui Meng] java conversions
618e349 [Xiangrui Meng] make tests work in scala
61b8e0f [Xiangrui Meng] merge master
7e5a322 [Xiangrui Meng] do not output metadata in StructField.toString
c41a664 [Xiangrui Meng] merge master
d8af0ed [Xiangrui Meng] move tests to SQLQuerySuite
67fdebb [Xiangrui Meng] add test on join
d65072e [Xiangrui Meng] remove Map.empty
367d237 [Xiangrui Meng] add test
c194d5e [Xiangrui Meng] add metadata field to StructField and Attribute
2014-11-01 14:37:00 -07:00
Kousuke Saruta 55ab777078 [SPARK-3870] EOL character enforcement
We have shell scripts and Windows batch files, so we should enforce proper EOL character.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2726 from sarutak/eol-enforcement and squashes the following commits:

9748c3f [Kousuke Saruta] Fixed make.bat
252de89 [Kousuke Saruta] Removed extra characters from make.bat
5b81c00 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into eol-enforcement
8633ed2 [Kousuke Saruta] merge branch 'master' of git://git.apache.org/spark into eol-enforcement
5d630d8 [Kousuke Saruta] Merged
ba10797 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into eol-enforcement
7407515 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into eol-enforcement
772fd4e [Kousuke Saruta] Normized EOL character in make.bat and compute-classpath.cmd
ac7f873 [Kousuke Saruta] Added an entry for .gitattributes to .rat-excludes
1570e77 [Kousuke Saruta] Added .gitattributes
2014-10-31 12:39:52 -07:00
Xiangrui Meng f1e7361f66 [SPARK-4150][PySpark] return self in rdd.setName
Then we can do `rdd.setName('abc').cache().count()`.

Author: Xiangrui Meng <meng@databricks.com>

Closes #3011 from mengxr/rdd-setname and squashes the following commits:

10d0d60 [Xiangrui Meng] update test
4ac3bbd [Xiangrui Meng] return self in rdd.setName
2014-10-31 12:07:48 -07:00
wangfei 7c41d13570 [SPARK-3826][SQL]enable hive-thriftserver to support hive-0.13.1
In #2241 hive-thriftserver is not enabled. This patch enable hive-thriftserver to support hive-0.13.1 by using a shim layer refer to #2241.

 1 A light shim layer(code in sql/hive-thriftserver/hive-version) for each different hive version to handle api compatibility

 2 New pom profiles "hive-default" and "hive-versions"(copy from #2241) to activate different hive version

 3 SBT cmd for different version as follows:
   hive-0.12.0 --- sbt/sbt -Phive,hadoop-2.3 -Phive-0.12.0 assembly
   hive-0.13.1 --- sbt/sbt -Phive,hadoop-2.3 -Phive-0.13.1 assembly

 4 Since hive-thriftserver depend on hive subproject, this patch should be merged with #2241 to enable hive-0.13.1 for hive-thriftserver

Author: wangfei <wangfei1@huawei.com>
Author: scwf <wangfei1@huawei.com>

Closes #2685 from scwf/shim-thriftserver1 and squashes the following commits:

f26f3be [wangfei] remove clean to save time
f5cac74 [wangfei] remove local hivecontext test
578234d [wangfei] use new shaded hive
18fb1ff [wangfei] exclude kryo in hive pom
fa21d09 [wangfei] clean package assembly/assembly
8a4daf2 [wangfei] minor fix
0d7f6cf [wangfei] address comments
f7c93ae [wangfei] adding build with hive 0.13 before running tests
bcf943f [wangfei] Merge branch 'master' of https://github.com/apache/spark into shim-thriftserver1
c359822 [wangfei] reuse getCommandProcessor in hiveshim
52674a4 [scwf] sql/hive included since examples depend on it
3529e98 [scwf] move hive module to hive profile
f51ff4e [wangfei] update and fix conflicts
f48d3a5 [scwf] Merge branch 'master' of https://github.com/apache/spark into shim-thriftserver1
41f727b [scwf] revert pom changes
13afde0 [scwf] fix small bug
4b681f4 [scwf] enable thriftserver in profile hive-0.13.1
0bc53aa [scwf] fixed when result filed is null
dfd1c63 [scwf] update run-tests to run hive-0.12.0 default now
c6da3ce [scwf] Merge branch 'master' of https://github.com/apache/spark into shim-thriftserver
7c66b8e [scwf] update pom according spark-2706
ae47489 [scwf] update and fix conflicts
2014-10-31 11:27:59 -07:00
Davies Liu 872fc669b4 [SPARK-4124] [MLlib] [PySpark] simplify serialization in MLlib Python API
Create several helper functions to call MLlib Java API, convert the arguments to Java type and convert return value to Python object automatically, this simplify serialization in MLlib Python API very much.

After this, the MLlib Python API does not need to deal with serialization details anymore, it's easier to add new API.

cc mengxr

Author: Davies Liu <davies@databricks.com>

Closes #2995 from davies/cleanup and squashes the following commits:

8fa6ec6 [Davies Liu] address comments
16b85a0 [Davies Liu] Merge branch 'master' of github.com:apache/spark into cleanup
43743e5 [Davies Liu] bugfix
731331f [Davies Liu] simplify serialization in MLlib Python API
2014-10-30 22:25:18 -07:00
Davies Liu 8c0bfd08fc [SPARK-4133] [SQL] [PySpark] type conversionfor python udf
Call Python UDF on ArrayType/MapType/PrimitiveType, the returnType can also be ArrayType/MapType/PrimitiveType.

For StructType, it will act as tuple (without attributes). If returnType is StructType, it also should be tuple.

Author: Davies Liu <davies@databricks.com>

Closes #2973 from davies/udf_array and squashes the following commits:

306956e [Davies Liu] Merge branch 'master' of github.com:apache/spark into udf_array
2c00e43 [Davies Liu] fix merge
11395fa [Davies Liu] Merge branch 'master' of github.com:apache/spark into udf_array
9df50a2 [Davies Liu] address comments
79afb4e [Davies Liu] type conversionfor python udf
2014-10-28 19:38:16 -07:00
Daoyuan Wang 47a40f60d6 [SPARK-3988][SQL] add public API for date type
Add json and python api for date type.
By using Pickle, `java.sql.Date` was serialized as calendar, and recognized in python as `datetime.datetime`.

Author: Daoyuan Wang <daoyuan.wang@intel.com>

Closes #2901 from adrian-wang/spark3988 and squashes the following commits:

c51a24d [Daoyuan Wang] convert datetime to date
5670626 [Daoyuan Wang] minor line combine
f760d8e [Daoyuan Wang] fix indent
444f100 [Daoyuan Wang] fix a typo
1d74448 [Daoyuan Wang] fix scala style
8d7dd22 [Daoyuan Wang] add json and python api for date type
2014-10-28 13:43:25 -07:00
Kousuke Saruta 6c1b981c3f [SPARK-4058] [PySpark] Log file name is hard coded even though there is a variable '$LOG_FILE '
In a script 'python/run-tests', log file name is represented by a variable 'LOG_FILE' and it is used in run-tests. But, there are some hard-coded log file name in the script.

Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2905 from sarutak/SPARK-4058 and squashes the following commits:

7710490 [Kousuke Saruta] Fixed python/run-tests not to use hard-coded log file name
2014-10-28 12:58:25 -07:00
Davies Liu fae095bc7c [SPARK-3961] [MLlib] [PySpark] Python API for mllib.feature
Added completed Python API for MLlib.feature

Normalizer
StandardScalerModel
StandardScaler
HashTF
IDFModel
IDF

cc mengxr

Author: Davies Liu <davies@databricks.com>
Author: Davies Liu <davies.liu@gmail.com>

Closes #2819 from davies/feature and squashes the following commits:

4f48f48 [Davies Liu] add a note for HashingTF
67f6d21 [Davies Liu] address comments
b628693 [Davies Liu] rollback changes in Word2Vec
efb4f4f [Davies Liu] Merge branch 'master' into feature
806c7c2 [Davies Liu] address comments
3abb8c2 [Davies Liu] address comments
59781b9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into feature
a405ae7 [Davies Liu] fix tests
7a1891a [Davies Liu] fix tests
486795f [Davies Liu] update programming guide, HashTF -> HashingTF
8a50584 [Davies Liu] Python API for mllib.feature
2014-10-28 03:50:22 -07:00
Sean Owen bfa614b127 SPARK-4022 [CORE] [MLLIB] Replace colt dependency (LGPL) with commons-math
This change replaces usages of colt with commons-math3 equivalents, and makes some minor necessary adjustments to related code and tests to match.

Author: Sean Owen <sowen@cloudera.com>

Closes #2928 from srowen/SPARK-4022 and squashes the following commits:

61a232f [Sean Owen] Fix failure due to different sampling in JavaAPISuite.sample()
16d66b8 [Sean Owen] Simplify seeding with call to reseedRandomGenerator
a1a78e0 [Sean Owen] Use Well19937c
31c7641 [Sean Owen] Fix Python Poisson test by choosing a different seed; about 88% of seeds should work but 1 didn't, it seems
5c9c67f [Sean Owen] Additional test fixes from review
d8f88e0 [Sean Owen] Replace colt with commons-math3. Some tests do not pass yet.
2014-10-27 10:53:15 -07:00
Davies Liu e41786c774 [SPARK-4088] [PySpark] Python worker should exit after socket is closed by JVM
In case of take() or exception in Python, python worker may exit before JVM read() all the response, then the write thread may raise "Connection reset" exception.

Python should always wait JVM to close the socket first.

cc JoshRosen This is a warm fix, or the tests will be flaky, sorry for that.

Author: Davies Liu <davies@databricks.com>

Closes #2941 from davies/fix_exit and squashes the following commits:

9d4d21e [Davies Liu] fix race
2014-10-25 01:20:39 -07:00
Davies Liu d60a9d440b [SPARK-4051] [SQL] [PySpark] Convert Row into dictionary
Added a method to Row to turn row into dict:

```
>>> row = Row(a=1)
>>> row.asDict()
{'a': 1}
```

Author: Davies Liu <davies@databricks.com>

Closes #2896 from davies/dict and squashes the following commits:

8d97366 [Davies Liu] convert Row into dict
2014-10-24 10:48:03 -07:00
Davies Liu 809c785bcc [SPARK-2652] [PySpark] donot use KyroSerializer as default serializer
KyroSerializer can not serialize customized class without registered explicitly, use it as default serializer in PySpark will introduce some regression in MLlib.

cc mengxr

Author: Davies Liu <davies@databricks.com>

Closes #2916 from davies/revert and squashes the following commits:

43eb6d3 [Davies Liu] donot use KyroSerializer as default serializer
2014-10-23 23:58:00 -07:00
Davies Liu e595c8d08a [SPARK-3993] [PySpark] fix bug while reuse worker after take()
After take(), maybe there are some garbage left in the socket, then next task assigned to this worker will hang because of corrupted data.

We should make sure the socket is clean before reuse it, write END_OF_STREAM at the end, and check it after read out all result from python.

Author: Davies Liu <davies.liu@gmail.com>
Author: Davies Liu <davies@databricks.com>

Closes #2838 from davies/fix_reuse and squashes the following commits:

8872914 [Davies Liu] fix tests
660875b [Davies Liu] fix bug while reuse worker after take()
2014-10-23 17:20:00 -07:00
freeman 97cf19f64e Fix for sampling error in NumPy v1.9 [SPARK-3995][PYSPARK]
Change maximum value for default seed during RDD sampling so that it is strictly less than 2 ** 32. This prevents a bug in the most recent version of NumPy, which cannot accept random seeds above this bound.

Adds an extra test that uses the default seed (instead of setting it manually, as in the docstrings).

mengxr

Author: freeman <the.freeman.lab@gmail.com>

Closes #2889 from freeman-lab/pyspark-sampling and squashes the following commits:

dc385ef [freeman] Change maximum value for default seed
2014-10-22 09:33:12 -07:00
Michelangelo D'Agostino 1a623b2e16 SPARK-3770: Make userFeatures accessible from python
https://issues.apache.org/jira/browse/SPARK-3770

We need access to the underlying latent user features from python. However, the userFeatures RDD from the MatrixFactorizationModel isn't accessible from the python bindings. I've added a method to the underlying scala class to turn the RDD[(Int, Array[Double])] to an RDD[String]. This is then accessed from the python recommendation.py

Author: Michelangelo D'Agostino <mdagostino@civisanalytics.com>

Closes #2636 from mdagost/mf_user_features and squashes the following commits:

c98f9e2 [Michelangelo D'Agostino] Added unit tests for userFeatures and productFeatures and merged master.
d5eadf8 [Michelangelo D'Agostino] Merge branch 'master' into mf_user_features
2481a2a [Michelangelo D'Agostino] Merged master and resolved conflict.
a6ffb96 [Michelangelo D'Agostino] Eliminated a function from our first approach to this problem that is no longer needed now that we added the fromTuple2RDD function.
2aa1bf8 [Michelangelo D'Agostino] Implemented a function called fromTuple2RDD in PythonMLLibAPI and used it to expose the MF userFeatures and productFeatures in python.
34cb2a2 [Michelangelo D'Agostino] A couple of lint cleanups and a comment.
cdd98e3 [Michelangelo D'Agostino] It's working now.
e1fbe5e [Michelangelo D'Agostino] Added scala function to stringify userFeatures for access in python.
2014-10-21 11:49:39 -07:00
Holden Karau 2aeb84bc79 replace awaitTransformation with awaitTermination in scaladoc/javadoc
Author: Holden Karau <holden@pigscanfly.ca>

Closes #2861 from holdenk/SPARK-4015-Documentation-in-the-streaming-context-references-non-existent-function and squashes the following commits:

081db8a [Holden Karau] fix pyspark streaming doc too
0e03863 [Holden Karau] replace awaitTransformation with awaitTermination
2014-10-21 09:37:17 -07:00
Davies Liu 8570816834 [SPARK-4023] [MLlib] [PySpark] convert rdd into RDD of Vector
Convert the input rdd to RDD of Vector.

cc mengxr

Author: Davies Liu <davies@databricks.com>

Closes #2870 from davies/fix4023 and squashes the following commits:

1eac767 [Davies Liu] address comments
0871576 [Davies Liu] convert rdd into RDD of Vector
2014-10-21 09:29:45 -07:00
Qiping Li eadc4c590e [SPARK-3207][MLLIB]Choose splits for continuous features in DecisionTree more adaptively
DecisionTree splits on continuous features by choosing an array of values from a subsample of the data.
Currently, it does not check for identical values in the subsample, so it could end up having multiple copies of the same split. In this PR, we choose splits for a continuous feature in 3 steps:

1. Sort sample values for this feature
2. Get number of occurrence of each distinct value
3. Iterate the value count array computed in step 2 to choose splits.

After find splits, `numSplits` and `numBins` in metadata will be updated.

CC: mengxr manishamde jkbradley, please help me review this, thanks.

Author: Qiping Li <liqiping1991@gmail.com>
Author: chouqin <liqiping1991@gmail.com>
Author: liqi <liqiping1991@gmail.com>
Author: qiping.lqp <qiping.lqp@alibaba-inc.com>

Closes #2780 from chouqin/dt-findsplits and squashes the following commits:

18d0301 [Qiping Li] check explicitly findsplits return distinct splits
8dc28ab [chouqin] remove blank lines
ffc920f [chouqin] adjust code based on comments and add more test cases
9857039 [chouqin] Merge branch 'master' of https://github.com/apache/spark into dt-findsplits
d353596 [qiping.lqp] fix pyspark doc test
9e64699 [Qiping Li] fix random forest unit test
3c72913 [Qiping Li] fix random forest unit test
092efcb [Qiping Li] fix bug
f69f47f [Qiping Li] fix bug
ab303a4 [Qiping Li] fix bug
af6dc97 [Qiping Li] fix bug
2a8267a [Qiping Li] fix bug
c339a61 [Qiping Li] fix bug
369f812 [Qiping Li] fix style
8f46af6 [Qiping Li] add comments and unit test
9e7138e [Qiping Li] Merge branch 'dt-findsplits' of https://github.com/chouqin/spark into dt-findsplits
1b25a35 [Qiping Li] Merge branch 'master' of https://github.com/apache/spark into dt-findsplits
0cd744a [liqi] fix bug
3652823 [Qiping Li] fix bug
af7cb79 [Qiping Li] Choose splits for continuous features in DecisionTree more adaptively
2014-10-20 13:12:26 -07:00
Davies Liu 05db2da7dc [SPARK-3952] [Streaming] [PySpark] add Python examples in Streaming Programming Guide
Having Python examples in Streaming Programming Guide.

Also add RecoverableNetworkWordCount example.

Author: Davies Liu <davies.liu@gmail.com>
Author: Davies Liu <davies@databricks.com>

Closes #2808 from davies/pyguide and squashes the following commits:

8d4bec4 [Davies Liu] update readme
26a7e37 [Davies Liu] fix format
3821c4d [Davies Liu] address comments, add missing file
7e4bb8a [Davies Liu] add Python examples in Streaming Programming Guide
2014-10-18 19:14:48 -07:00
Michael Armbrust adcb7d3350 [SPARK-3855][SQL] Preserve the result attribute of python UDFs though transformations
In the current implementation it was possible for the reference to change after analysis.

Author: Michael Armbrust <michael@databricks.com>

Closes #2717 from marmbrus/pythonUdfResults and squashes the following commits:

da14879 [Michael Armbrust] Fix test
6343bcb [Michael Armbrust] add test
9533286 [Michael Armbrust] Correctly preserve the result attribute of python UDFs though transformations
2014-10-17 14:12:07 -07:00
Davies Liu 091d32c52e [SPARK-3971] [MLLib] [PySpark] hotfix: Customized pickler should work in cluster mode
Customized pickler should be registered before unpickling, but in executor, there is no way to register the picklers before run the tasks.

So, we need to register the picklers in the tasks itself, duplicate the javaToPython() and pythonToJava() in MLlib, call SerDe.initialize() before pickling or unpickling.

Author: Davies Liu <davies.liu@gmail.com>

Closes #2830 from davies/fix_pickle and squashes the following commits:

0c85fb9 [Davies Liu] revert the privacy change
6b94e15 [Davies Liu] use JavaConverters instead of JavaConversions
0f02050 [Davies Liu] hotfix: Customized pickler does not work in cluster
2014-10-16 14:56:50 -07:00
Masayoshi TSUZUKI 66af8e2508 [SPARK-3943] Some scripts bin\*.cmd pollutes environment variables in Windows
Modified not to pollute environment variables.
Just moved the main logic into `XXX2.cmd` from `XXX.cmd`, and call `XXX2.cmd` with cmd command in `XXX.cmd`.
`pyspark.cmd` and `spark-class.cmd` are already using the same way, but `spark-shell.cmd`, `spark-submit.cmd` and `/python/docs/make.bat` are not.

Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>

Closes #2797 from tsudukim/feature/SPARK-3943 and squashes the following commits:

b397a7d [Masayoshi TSUZUKI] [SPARK-3943] Some scripts bin\*.cmd pollutes environment variables in Windows
2014-10-14 18:50:14 -07:00
Masayoshi TSUZUKI 7ced88b0d6 [SPARK-3946] gitignore in /python includes wrong directory
Modified to ignore not the docs/ directory, but only the docs/_build/ which is the output directory of sphinx build.

Author: Masayoshi TSUZUKI <tsudukim@oss.nttdata.co.jp>

Closes #2796 from tsudukim/feature/SPARK-3946 and squashes the following commits:

2bea6a9 [Masayoshi TSUZUKI] [SPARK-3946] gitignore in /python includes wrong directory
2014-10-14 14:09:39 -07:00
yingjieMiao 49bbdcb660 [Spark] RDD take() method: overestimate too much
In the comment (Line 1083), it says: "Otherwise, interpolate the number of partitions we need to try, but overestimate it by 50%."

`(1.5 * num * partsScanned / buf.size).toInt` is the guess of "num of total partitions needed". In every iteration, we should consider the increment `(1.5 * num * partsScanned / buf.size).toInt - partsScanned`
Existing implementation 'exponentially' grows `partsScanned ` ( roughly: `x_{n+1} >= (1.5 + 1) x_n`)

This could be a performance problem. (unless this is the intended behavior)

Author: yingjieMiao <yingjie@42go.com>

Closes #2648 from yingjieMiao/rdd_take and squashes the following commits:

d758218 [yingjieMiao] scala style fix
a8e74bb [yingjieMiao] python style fix
4b6e777 [yingjieMiao] infix operator style fix
4391d3b [yingjieMiao] typo fix.
692f4e6 [yingjieMiao] cap numPartsToTry
c4483dc [yingjieMiao] style fix
1d2c410 [yingjieMiao] also change in rdd.py and AsyncRDD
d31ff7e [yingjieMiao] handle the edge case after 1 iteration
a2aa36b [yingjieMiao] RDD take method: overestimate too much
2014-10-13 13:11:55 -07:00
Ken Takagiwa d8b8c21078 Add echo "Run streaming tests ..."
Author: Ken Takagiwa <ugw.gi.world@gmail.com>

Closes #2778 from giwa/patch-2 and squashes the following commits:

a59f9a1 [Ken Takagiwa] Add echo "Run streaming tests ..."
2014-10-12 23:05:14 -07:00
giwa 69c67abaa9 [SPARK-2377] Python API for Streaming
This patch brings Python API for Streaming.

This patch is based on work from @giwa

Author: giwa <ugw.gi.world@gmail.com>
Author: Ken Takagiwa <ken@Kens-MacBook-Pro.local>
Author: Davies Liu <davies.liu@gmail.com>
Author: Ken Takagiwa <ken@kens-mbp.gateway.sonic.net>
Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Ken <ugw.gi.world@gmail.com>
Author: Ken Takagiwa <ugw.gi.world@gmail.com>
Author: Matthew Farrellee <matt@redhat.com>

Closes #2538 from davies/streaming and squashes the following commits:

64561e4 [Davies Liu] fix tests
331ecce [Davies Liu] fix example
3e2492b [Davies Liu] change updateStateByKey() to easy API
182be73 [Davies Liu] Merge branch 'master' of github.com:apache/spark into streaming
02d0575 [Davies Liu] add wrapper for foreachRDD()
bebeb4a [Davies Liu] address all comments
6db00da [Davies Liu] Merge branch 'master' of github.com:apache/spark into streaming
8380064 [Davies Liu] Merge branch 'master' of github.com:apache/spark into streaming
52c535b [Davies Liu] remove fix for sum()
e108ec1 [Davies Liu]  address comments
37fe06f [Davies Liu] use random port for callback server
d05871e [Davies Liu] remove reuse of PythonRDD
be5e5ff [Davies Liu] merge branch of env, make tests stable.
8071541 [Davies Liu] Merge branch 'env' into streaming
c7bbbce [Davies Liu] fix sphinx docs
6bb9d91 [Davies Liu] Merge branch 'master' of github.com:apache/spark into streaming
4d0ea8b [Davies Liu] clear reference of SparkEnv after stop
54bd92b [Davies Liu] improve tests
c2b31cb [Davies Liu] Merge branch 'master' of github.com:apache/spark into streaming
7a88f9f [Davies Liu] rollback RDD.setContext(), use textFileStream() to test checkpointing
bd8a4c2 [Davies Liu] fix scala style
7797c70 [Davies Liu] refactor
ff88bec [Davies Liu] rename RDDFunction to TransformFunction
d328aca [Davies Liu] fix serializer in queueStream
6f0da2f [Davies Liu] recover from checkpoint
fa7261b [Davies Liu] refactor
a13ff34 [Davies Liu] address comments
8466916 [Davies Liu] support checkpoint
9a16bd1 [Davies Liu] change number of partitions during tests
b98d63f [Davies Liu] change private[spark] to private[python]
eed6e2a [Davies Liu] rollback not needed changes
e00136b [Davies Liu] address comments
069a94c [Davies Liu] fix the number of partitions during window()
338580a [Davies Liu] change _first(), _take(), _collect() as private API
19797f9 [Davies Liu] clean up
6ebceca [Davies Liu] add more tests
c40c52d [Davies Liu] change first(), take(n) to has the same behavior as RDD
98ac6c2 [Davies Liu] support ssc.transform()
b983f0f [Davies Liu] address comments
847f9b9 [Davies Liu] add more docs, add first(), take()
e059ca2 [Davies Liu] move check of window into Python
fce0ef5 [Davies Liu] rafactor of foreachRDD()
7001b51 [Davies Liu] refactor of queueStream()
26ea396 [Davies Liu] refactor
74df565 [Davies Liu] fix print and docs
b32774c [Davies Liu] move java_import into streaming
604323f [Davies Liu] enable streaming tests
c499ba0 [Davies Liu] remove Time and Duration
3f0fb4b [Davies Liu] refactor fix tests
c28f520 [Davies Liu] support updateStateByKey
d357b70 [Davies Liu] support windowed dstream
bd13026 [Davies Liu] fix examples
eec401e [Davies Liu] refactor, combine TransformedRDD, fix reuse PythonRDD, fix union
9a57685 [Davies Liu] fix python style
bd27874 [Davies Liu] fix scala style
7339be0 [Davies Liu] delete tests
7f53086 [Davies Liu] support transform(), refactor and cleanup
df098fc [Davies Liu] Merge branch 'master' into giwa
550dfd9 [giwa] WIP fixing 1.1 merge
5cdb6fa [giwa] changed for SCCallSiteSync
e685853 [giwa] meged with rebased 1.1 branch
2d32a74 [giwa] added some StreamingContextTestSuite
4a59e1e [giwa] WIP:added more test for StreamingContext
8ffdbf1 [giwa] added atexit to handle callback server
d5f5fcb [giwa] added comment for StreamingContext.sparkContext
63c881a [giwa] added StreamingContext.sparkContext
d39f102 [giwa] added StreamingContext.remember
d542743 [giwa] clean up code
2fdf0de [Matthew Farrellee] Fix scalastyle errors
c0a06bc [giwa] delete not implemented functions
f385976 [giwa] delete inproper comments
b0f2015 [giwa] added comment in dstream._test_output
bebb3f3 [giwa] remove the last brank line
fbed8da [giwa] revert pom.xml
8ed93af [giwa] fixed explanaiton
066ba90 [giwa] revert pom.xml
fa4af88 [giwa] remove duplicated import
6ae3caa [giwa] revert pom.xml
7dc7391 [giwa] fixed typo
62dc7a3 [giwa] clean up exmples
f04882c [giwa] clen up examples
b171ec3 [giwa] fixed pep8 violation
f198d14 [giwa] clean up code
3166d31 [giwa] clean up
c00e091 [giwa] change test case not to use awaitTermination
e80647e [giwa] adopted the latest compression way of python command
58e41ff [giwa] merge with master
455e5af [giwa] removed wasted print in DStream
af336b7 [giwa] add comments
ddd4ee1 [giwa] added TODO coments
99ce042 [giwa] added saveAsTextFiles and saveAsPickledFiles
2a06cdb [giwa] remove waste duplicated code
c5ecfc1 [giwa] basic function test cases are passed
8dcda84 [giwa] all tests are passed if numSlice is 2 and the numver of each input is over 4
795b2cd [giwa] broke something
1e126bf [giwa] WIP: solved partitioned and None is not recognized
f67cf57 [giwa] added mapValues and flatMapVaules WIP for glom and mapPartitions test
953deb0 [giwa] edited the comment to add more precise description
af610d3 [giwa] removed unnesessary changes
c1d546e [giwa] fixed PEP-008 violation
99410be [giwa] delete waste file
b3b0362 [giwa] added basic operation test cases
9cde7c9 [giwa] WIP added test case
bd3ba53 [giwa] WIP
5c04a5f [giwa] WIP: added PythonTestInputStream
019ef38 [giwa] WIP
1934726 [giwa] update comment
376e3ac [giwa] WIP
932372a [giwa] clean up dstream.py
0b09cff [giwa] added stop in StreamingContext
92e333e [giwa] implemented reduce and count function in Dstream
1b83354 [giwa] Removed the waste line
88f7506 [Ken Takagiwa] Kill py4j callback server properly
54b5358 [Ken Takagiwa] tried to restart callback server
4f07163 [Tathagata Das] Implemented DStream.foreachRDD in the Python API using Py4J callback server.
fe02547 [Ken Takagiwa] remove waste file
2ad7bd3 [Ken Takagiwa] clean up codes
6197a11 [Ken Takagiwa] clean up code
eb4bf48 [Ken Takagiwa] fix map function
98c2a00 [Ken Takagiwa] added count operation but this implementation need double check
58591d2 [Ken Takagiwa] reduceByKey is working
0df7111 [Ken Takagiwa] delete old file
f485b1d [Ken Takagiwa] fied input of socketTextDStream
dd6de81 [Ken Takagiwa] initial commit for socketTextStream
247fd74 [Ken Takagiwa] modified the code base on comment in https://github.com/tdas/spark/pull/10
4bcb318 [Ken Takagiwa] implementing transform function in Python
38adf95 [Ken Takagiwa] added reducedByKey not working yet
66fcfff [Ken Takagiwa] modify dstream.py to fix indent error
41886c2 [Ken Takagiwa] comment PythonDStream.PairwiseDStream
0b99bec [Ken] initial commit for pySparkStreaming
c214199 [giwa] added testcase for combineByKey
5625bdc [giwa] added gorupByKey testcase
10ab87b [giwa] added sparkContext as input parameter in StreamingContext
10b5b04 [giwa] removed wasted print in DStream
e54f986 [giwa] add comments
16aa64f [giwa] added TODO coments
74535d4 [giwa] added saveAsTextFiles and saveAsPickledFiles
f76c182 [giwa] remove waste duplicated code
18c8723 [giwa] modified streaming test case to add coment
13fb44c [giwa] basic function test cases are passed
3000b2b [giwa] all tests are passed if numSlice is 2 and the numver of each input is over 4
ff14070 [giwa] broke something
bcdec33 [giwa] WIP: solved partitioned and None is not recognized
270a9e1 [giwa] added mapValues and flatMapVaules WIP for glom and mapPartitions test
bb10956 [giwa] edited the comment to add more precise description
253a863 [giwa] removed unnesessary changes
3d37822 [giwa] fixed PEP-008 violation
f21cab3 [giwa] delete waste file
878bad7 [giwa] added basic operation test cases
ce2acd2 [giwa] WIP added test case
9ad6855 [giwa] WIP
1df77f5 [giwa] WIP: added PythonTestInputStream
1523b66 [giwa] WIP
8a0fbbc [giwa] update comment
fe648e3 [giwa] WIP
29c2bc5 [giwa] initial commit for testcase
4d40d63 [giwa] clean up dstream.py
c462bb3 [giwa] added stop in StreamingContext
d2c01ba [giwa] clean up examples
3c45cd2 [giwa] implemented reduce and count function in Dstream
b349649 [giwa] Removed the waste line
3b498e1 [Ken Takagiwa] Kill py4j callback server properly
84a9668 [Ken Takagiwa] tried to restart callback server
9ab8952 [Tathagata Das] Added extra line.
05e991b [Tathagata Das] Added missing file
b1d2a30 [Tathagata Das] Implemented DStream.foreachRDD in the Python API using Py4J callback server.
678e854 [Ken Takagiwa] remove waste file
0a8bbbb [Ken Takagiwa] clean up codes
bab31c1 [Ken Takagiwa] clean up code
72b9738 [Ken Takagiwa] fix map function
d3ee86a [Ken Takagiwa] added count operation but this implementation need double check
15feea9 [Ken Takagiwa] edit python sparkstreaming example
6f98e50 [Ken Takagiwa] reduceByKey is working
c455c8d [Ken Takagiwa] added reducedByKey not working yet
dc6995d [Ken Takagiwa] delete old file
b31446a [Ken Takagiwa] fixed typo of network_workdcount.py
ccfd214 [Ken Takagiwa] added doctest for pyspark.streaming.duration
0d1b954 [Ken Takagiwa] fied input of socketTextDStream
f746109 [Ken Takagiwa] initial commit for socketTextStream
bb7ccf3 [Ken Takagiwa] remove unused import in python
224fc5e [Ken Takagiwa] add empty line
d2099d8 [Ken Takagiwa] sorted the import following Spark coding convention
5bac7ec [Ken Takagiwa] revert streaming/pom.xml
e1df940 [Ken Takagiwa] revert pom.xml
494cae5 [Ken Takagiwa] remove not implemented DStream functions in python
17a74c6 [Ken Takagiwa] modified the code base on comment in https://github.com/tdas/spark/pull/10
1a0f065 [Ken Takagiwa] implementing transform function in Python
d7b4d6f [Ken Takagiwa] added reducedByKey not working yet
87438e2 [Ken Takagiwa] modify dstream.py to fix indent error
b406252 [Ken Takagiwa] comment PythonDStream.PairwiseDStream
454981d [Ken] initial commit for pySparkStreaming
150b94c [giwa] added some StreamingContextTestSuite
f7bc8f9 [giwa] WIP:added more test for StreamingContext
ee50c5a [giwa] added atexit to handle callback server
fdc9125 [giwa] added comment for StreamingContext.sparkContext
f5bfb70 [giwa] added StreamingContext.sparkContext
da09768 [giwa] added StreamingContext.remember
d68b568 [giwa] clean up code
4afa390 [giwa] clean up code
1fd6bc7 [Ken Takagiwa] Merge pull request #2 from mattf/giwa-master
d9d59fe [Matthew Farrellee] Fix scalastyle errors
67473a9 [giwa] delete not implemented functions
c97377c [giwa] delete inproper comments
2ea769e [giwa] added comment in dstream._test_output
3b27bd4 [giwa] remove the last brank line
acfcaeb [giwa] revert pom.xml
93f7637 [giwa] fixed explanaiton
50fd6f9 [giwa] revert pom.xml
4f82c89 [giwa] remove duplicated import
9d1de23 [giwa] revert pom.xml
7339df2 [giwa] fixed typo
9c85e48 [giwa] clean up exmples
24f95db [giwa] clen up examples
0d30109 [giwa] fixed pep8 violation
b7dab85 [giwa] improve test case
583e66d [giwa] move tests for streaming inside streaming directory
1d84142 [giwa] remove unimplement test
f0ea311 [giwa] clean up code
171edeb [giwa] clean up
4dedd2d [giwa] change test case not to use awaitTermination
268a6a5 [giwa] Changed awaitTermination not to call awaitTermincation in Scala. Just use time.sleep instread
09a28bf [giwa] improve testcases
58150f5 [giwa] Changed the test case to focus the test operation
199e37f [giwa] adopted the latest compression way of python command
185fdbf [giwa] merge with master
f1798c4 [giwa] merge with master
e70f706 [giwa] added testcase for combineByKey
e162822 [giwa] added gorupByKey testcase
97742fe [giwa] added sparkContext as input parameter in StreamingContext
14d4c0e [giwa] removed wasted print in DStream
6d8190a [giwa] add comments
4aa99e4 [giwa] added TODO coments
e9fab72 [giwa] added saveAsTextFiles and saveAsPickledFiles
94f2b65 [giwa] remove waste duplicated code
580fbc2 [giwa] modified streaming test case to add coment
99e4bb3 [giwa] basic function test cases are passed
7051a84 [giwa] all tests are passed if numSlice is 2 and the numver of each input is over 4
35933e1 [giwa] broke something
9767712 [giwa] WIP: solved partitioned and None is not recognized
4f2d7e6 [giwa] added mapValues and flatMapVaules WIP for glom and mapPartitions test
33c0f94d [giwa] edited the comment to add more precise description
774f18d [giwa] removed unnesessary changes
3a671cc [giwa] remove export PYSPARK_PYTHON in spark submit
8efa266 [giwa] fixed PEP-008 violation
fa75d71 [giwa] delete waste file
7f96294 [giwa] added basic operation test cases
3dda31a [giwa] WIP added test case
1f68b78 [giwa] WIP
c05922c [giwa] WIP: added PythonTestInputStream
1fd12ae [giwa] WIP
c880a33 [giwa] update comment
5d22c92 [giwa] WIP
ea4b06b [giwa] initial commit for testcase
5a9b525 [giwa] clean up dstream.py
79c5809 [giwa] added stop in StreamingContext
189dcea [giwa] clean up examples
b8d7d24 [giwa] implemented reduce and count function in Dstream
b6468e6 [giwa] Removed the waste line
b47b5fd [Ken Takagiwa] Kill py4j callback server properly
19ddcdd [Ken Takagiwa] tried to restart callback server
c9fc124 [Tathagata Das] Added extra line.
4caae3f [Tathagata Das] Added missing file
4eff053 [Tathagata Das] Implemented DStream.foreachRDD in the Python API using Py4J callback server.
5e822d4 [Ken Takagiwa] remove waste file
aeaf8a5 [Ken Takagiwa] clean up codes
9fa249b [Ken Takagiwa] clean up code
05459c6 [Ken Takagiwa] fix map function
a9f4ecb [Ken Takagiwa] added count operation but this implementation need double check
d1ee6ca [Ken Takagiwa] edit python sparkstreaming example
0b8b7d0 [Ken Takagiwa] reduceByKey is working
d25d5cf [Ken Takagiwa] added reducedByKey not working yet
7f7c5d1 [Ken Takagiwa] delete old file
967dc26 [Ken Takagiwa] fixed typo of network_workdcount.py
57fb740 [Ken Takagiwa] added doctest for pyspark.streaming.duration
4b69fb1 [Ken Takagiwa] fied input of socketTextDStream
02f618a [Ken Takagiwa] initial commit for socketTextStream
4ce4058 [Ken Takagiwa] remove unused import in python
856d98e [Ken Takagiwa] add empty line
490e338 [Ken Takagiwa] sorted the import following Spark coding convention
5594bd4 [Ken Takagiwa] revert pom.xml
2adca84 [Ken Takagiwa] remove not implemented DStream functions in python
e551e13 [Ken Takagiwa] add coment for hack why PYSPARK_PYTHON is needed in spark-submit
3758175 [Ken Takagiwa] add coment for hack why PYSPARK_PYTHON is needed in spark-submit
c5518b4 [Ken Takagiwa] modified the code base on comment in https://github.com/tdas/spark/pull/10
dcf243f [Ken Takagiwa] implementing transform function in Python
9af03f4 [Ken Takagiwa] added reducedByKey not working yet
6e0d9c7 [Ken Takagiwa] modify dstream.py to fix indent error
e497b9b [Ken Takagiwa] comment PythonDStream.PairwiseDStream
5c3a683 [Ken] initial commit for pySparkStreaming
665bfdb [giwa] added testcase for combineByKey
a3d2379 [giwa] added gorupByKey testcase
636090a [giwa] added sparkContext as input parameter in StreamingContext
e7ebb08 [giwa] removed wasted print in DStream
d8b593b [giwa] add comments
ea9c873 [giwa] added TODO coments
89ae38a [giwa] added saveAsTextFiles and saveAsPickledFiles
e3033fc [giwa] remove waste duplicated code
a14c7e1 [giwa] modified streaming test case to add coment
536def4 [giwa] basic function test cases are passed
2112638 [giwa] all tests are passed if numSlice is 2 and the numver of each input is over 4
080541a [giwa] broke something
0704b86 [giwa] WIP: solved partitioned and None is not recognized
90a6484 [giwa] added mapValues and flatMapVaules WIP for glom and mapPartitions test
a65f302 [giwa] edited the comment to add more precise description
bdde697 [giwa] removed unnesessary changes
e8c7bfc [giwa] remove export PYSPARK_PYTHON in spark submit
3334169 [giwa] fixed PEP-008 violation
db0a303 [giwa] delete waste file
2cfd3a0 [giwa] added basic operation test cases
90ae568 [giwa] WIP added test case
a120d07 [giwa] WIP
f671cdb [giwa] WIP: added PythonTestInputStream
56fae45 [giwa] WIP
e35e101 [giwa] Merge branch 'master' into testcase
ba5112d [giwa] update comment
28aa56d [giwa] WIP
fb08559 [giwa] initial commit for testcase
a613b85 [giwa] clean up dstream.py
c40c0ef [giwa] added stop in StreamingContext
31e4260 [giwa] clean up examples
d2127d6 [giwa] implemented reduce and count function in Dstream
48f7746 [giwa] Removed the waste line
0f83eaa [Ken Takagiwa] delete py4j 0.8.1
1679808 [Ken Takagiwa] Kill py4j callback server properly
f96cd4e [Ken Takagiwa] tried to restart callback server
fe86198 [Ken Takagiwa] add py4j 0.8.2.1 but server is not launched
1064fe0 [Ken Takagiwa] Merge branch 'master' of https://github.com/giwa/spark
28c6620 [Ken Takagiwa] Implemented DStream.foreachRDD in the Python API using Py4J callback server
85b0fe1 [Ken Takagiwa] Merge pull request #1 from tdas/python-foreach
54e2e8c [Tathagata Das] Added extra line.
e185338 [Tathagata Das] Added missing file
a778d4b [Tathagata Das] Implemented DStream.foreachRDD in the Python API using Py4J callback server.
cc2092b [Ken Takagiwa] remove waste file
d042ac6 [Ken Takagiwa] clean up codes
84a021f [Ken Takagiwa] clean up code
bd20e17 [Ken Takagiwa] fix map function
d01a125 [Ken Takagiwa] added count operation but this implementation need double check
7d05109 [Ken Takagiwa] merge with remote branch
ae464e0 [Ken Takagiwa] edit python sparkstreaming example
04af046 [Ken Takagiwa] reduceByKey is working
3b6d7b0 [Ken Takagiwa] implementing transform function in Python
571d52d [Ken Takagiwa] added reducedByKey not working yet
5720979 [Ken Takagiwa] delete old file
e604fcb [Ken Takagiwa] fixed typo of network_workdcount.py
4b7c08b [Ken Takagiwa] Merge branch 'master' of https://github.com/giwa/spark
ce7d426 [Ken Takagiwa] added doctest for pyspark.streaming.duration
a8c9fd5 [Ken Takagiwa] fixed for socketTextStream
a61fa9e [Ken Takagiwa] fied input of socketTextDStream
1e84f41 [Ken Takagiwa] initial commit for socketTextStream
6d012f7 [Ken Takagiwa] remove unused import in python
25d30d5 [Ken Takagiwa] add empty line
6e0a64a [Ken Takagiwa] sorted the import following Spark coding convention
fa4a7fc [Ken Takagiwa] revert streaming/pom.xml
8f8202b [Ken Takagiwa] revert streaming pom.xml
c9d79dd [Ken Takagiwa] revert pom.xml
57e3e52 [Ken Takagiwa] remove not implemented DStream functions in python
0a516f5 [Ken Takagiwa] add coment for hack why PYSPARK_PYTHON is needed in spark-submit
a7a0b5c [Ken Takagiwa] add coment for hack why PYSPARK_PYTHON is needed in spark-submit
72bfc66 [Ken Takagiwa] modified the code base on comment in https://github.com/tdas/spark/pull/10
69e9cd3 [Ken Takagiwa] implementing transform function in Python
94a0787 [Ken Takagiwa] added reducedByKey not working yet
88068cf [Ken Takagiwa] modify dstream.py to fix indent error
1367be5 [Ken Takagiwa] comment PythonDStream.PairwiseDStream
eb2b3ba [Ken] Merge remote-tracking branch 'upstream/master'
d8e51f9 [Ken] initial commit for pySparkStreaming
2014-10-12 02:46:56 -07:00
cocoatomo 7a3f589ef8 [SPARK-3909][PySpark][Doc] A corrupted format in Sphinx documents and building warnings
Sphinx documents contains a corrupted ReST format and have some warnings.

The purpose of this issue is same as https://issues.apache.org/jira/browse/SPARK-3773.

commit: 0e8203f4fb

output
```
$ cd ./python/docs
$ make clean html
rm -rf _build/*
sphinx-build -b html -d _build/doctrees   . _build/html
Making output directory...
Running Sphinx v1.2.3
loading pickled environment... not yet created
building [html]: targets for 4 source files that are out of date
updating environment: 4 added, 0 changed, 0 removed
reading sources... [100%] pyspark.sql
/Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/feature.py:docstring of pyspark.mllib.feature.Word2VecModel.findSynonyms:4: WARNING: Field list ends without a blank line; unexpected unindent.
/Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/feature.py:docstring of pyspark.mllib.feature.Word2VecModel.transform:3: WARNING: Field list ends without a blank line; unexpected unindent.
/Users/<user>/MyRepos/Scala/spark/python/pyspark/sql.py:docstring of pyspark.sql:4: WARNING: Bullet list ends without a blank line; unexpected unindent.
looking for now-outdated files... none found
pickling environment... done
checking consistency... done
preparing documents... done
writing output... [100%] pyspark.sql
writing additional files... (12 module code pages) _modules/index search
copying static files... WARNING: html_static_path entry u'/Users/<user>/MyRepos/Scala/spark/python/docs/_static' does not exist
done
copying extra files... done
dumping search index... done
dumping object inventory... done
build succeeded, 4 warnings.

Build finished. The HTML pages are in _build/html.
```

Author: cocoatomo <cocoatomo77@gmail.com>

Closes #2766 from cocoatomo/issues/3909-sphinx-build-warnings and squashes the following commits:

2c7faa8 [cocoatomo] [SPARK-3909][PySpark][Doc] A corrupted format in Sphinx documents and building warnings
2014-10-11 11:51:59 -07:00
cocoatomo 81015a2ba4 [SPARK-3867][PySpark] ./python/run-tests failed when it run with Python 2.6 and unittest2 is not installed
./python/run-tests search a Python 2.6 executable on PATH and use it if available.
When using Python 2.6, it is going to import unittest2 module which is not a standard library in Python 2.6, so it fails with ImportError.

Author: cocoatomo <cocoatomo77@gmail.com>

Closes #2759 from cocoatomo/issues/3867-unittest2-import-error and squashes the following commits:

f068eb5 [cocoatomo] [SPARK-3867] ./python/run-tests failed when it run with Python 2.6 and unittest2 is not installed
2014-10-11 11:26:17 -07:00
Davies Liu 72f36ee571 [SPARK-3886] [PySpark] use AutoBatchedSerializer by default
Use AutoBatchedSerializer by default, which will choose the proper batch size based on size of serialized objects, let the size of serialized batch fall in into  [64k - 640k].

In JVM, the serializer will also track the objects in batch to figure out duplicated objects, larger batch may cause OOM in JVM.

Author: Davies Liu <davies.liu@gmail.com>

Closes #2740 from davies/batchsize and squashes the following commits:

52cdb88 [Davies Liu] update docs
185f2b9 [Davies Liu] use AutoBatchedSerializer by default
2014-10-10 14:14:05 -07:00
cocoatomo e7edb723d2 [SPARK-3868][PySpark] Hard to recognize which module is tested from unit-tests.log
./python/run-tests script display messages about which test it is running currently on stdout but not write them on unit-tests.log.
It is harder for us to recognize what test programs were executed and which test was failed.

Author: cocoatomo <cocoatomo77@gmail.com>

Closes #2724 from cocoatomo/issues/3868-display-testing-module-name and squashes the following commits:

c63d9fa [cocoatomo] [SPARK-3868][PySpark] Hard to recognize which module is tested from unit-tests.log
2014-10-09 13:46:26 -07:00
Cheng Lian a42cc08d21 [SPARK-3713][SQL] Uses JSON to serialize DataType objects
This PR uses JSON instead of `toString` to serialize `DataType`s. The latter is not only hard to parse but also flaky in many cases.

Since we already write schema information to Parquet metadata in the old style, we have to reserve the old `DataType` parser and ensure downward compatibility. The old parser is now renamed to `CaseClassStringParser` and moved into `object DataType`.

JoshRosen davies Please help review PySpark related changes, thanks!

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #2563 from liancheng/datatype-to-json and squashes the following commits:

fc92eb3 [Cheng Lian] Reverts debugging code, simplifies primitive type JSON representation
438c75f [Cheng Lian] Refactors PySpark DataType JSON SerDe per comments
6b6387b [Cheng Lian] Removes debugging code
6a3ee3a [Cheng Lian] Addresses per review comments
dc158b5 [Cheng Lian] Addresses PEP8 issues
99ab4ee [Cheng Lian] Adds compatibility est case for Parquet type conversion
a983a6c [Cheng Lian] Adds PySpark support
f608c6e [Cheng Lian] De/serializes DataType objects from/to JSON
2014-10-08 17:04:49 -07:00
Davies Liu 798ed22c28 [SPARK-3412] [PySpark] Replace Epydoc with Sphinx to generate Python API docs
Retire Epydoc, use Sphinx to generate API docs.

Refine Sphinx docs, also convert some docstrings into Sphinx style.

It looks like:
![api doc](https://cloud.githubusercontent.com/assets/40902/4538272/9e2d4f10-4dec-11e4-8d96-6e45a8fe51f9.png)

Author: Davies Liu <davies.liu@gmail.com>

Closes #2689 from davies/docs and squashes the following commits:

bf4a0a5 [Davies Liu] fix links
3fb1572 [Davies Liu] fix _static in jekyll
65a287e [Davies Liu] fix scripts and logo
8524042 [Davies Liu] Merge branch 'master' of github.com:apache/spark into docs
d5b874a [Davies Liu] Merge branch 'master' of github.com:apache/spark into docs
4bc1c3c [Davies Liu] refactor
746d0b6 [Davies Liu] @param -> :param
240b393 [Davies Liu] replace epydoc with sphinx doc
2014-10-07 18:09:27 -07:00
Liquan Pei 098c7344e6 [SPARK-3486][MLlib][PySpark] PySpark support for Word2Vec
mengxr
Added PySpark support for Word2Vec
Change list
(1) PySpark support for Word2Vec
(2) SerDe support of string sequence both on python side and JVM side
(3) Test for SerDe of string sequence on JVM side

Author: Liquan Pei <liquanpei@gmail.com>

Closes #2356 from Ishiihara/Word2Vec-python and squashes the following commits:

476ea34 [Liquan Pei] style fixes
b13a0b9 [Liquan Pei] resolve merge conflicts and minor fixes
8671eba [Liquan Pei] Merge remote-tracking branch 'upstream/master' into Word2Vec-python
daf88a6 [Liquan Pei] modification according to feedback
a73fa19 [Liquan Pei] clean up
3d8007b [Liquan Pei] fix findSynonyms for vector
1bdcd2e [Liquan Pei] minor fixes
cdef9f4 [Liquan Pei] add missing comments
b7447eb [Liquan Pei] modify according to feedback
b9a7383 [Liquan Pei] cache words RDD in fit
89490bf [Liquan Pei] add tests and Word2VecModelWrapper
78bbb53 [Liquan Pei] use pickle for seq string SerDe
a264b08 [Liquan Pei] Merge remote-tracking branch 'upstream/master' into Word2Vec-python
ca1e5ff [Liquan Pei] fix test
68e7276 [Liquan Pei] minor style fixes
48d5e72 [Liquan Pei] Functionality improvement
0ad3ac1 [Liquan Pei] minor fix
c867fdf [Liquan Pei] add Word2Vec to pyspark
2014-10-07 16:43:34 -07:00
cocoatomo 2300eb58ae [SPARK-3773][PySpark][Doc] Sphinx build warning
When building Sphinx documents for PySpark, we have 12 warnings.
Their causes are almost docstrings in broken ReST format.

To reproduce this issue, we should run following commands on the commit: 6e27cb630d.

```bash
$ cd ./python/docs
$ make clean html
...
/Users/<user>/MyRepos/Scala/spark/python/pyspark/__init__.py:docstring of pyspark.SparkContext.sequenceFile:4: ERROR: Unexpected indentation.
/Users/<user>/MyRepos/Scala/spark/python/pyspark/__init__.py:docstring of pyspark.RDD.saveAsSequenceFile:4: ERROR: Unexpected indentation.
/Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.LogisticRegressionWithSGD.train:14: ERROR: Unexpected indentation.
/Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.LogisticRegressionWithSGD.train:16: WARNING: Definition list ends without a blank line; unexpected unindent.
/Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.LogisticRegressionWithSGD.train:17: WARNING: Block quote ends without a blank line; unexpected unindent.
/Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.SVMWithSGD.train:14: ERROR: Unexpected indentation.
/Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.SVMWithSGD.train:16: WARNING: Definition list ends without a blank line; unexpected unindent.
/Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/classification.py:docstring of pyspark.mllib.classification.SVMWithSGD.train:17: WARNING: Block quote ends without a blank line; unexpected unindent.
/Users/<user>/MyRepos/Scala/spark/python/docs/pyspark.mllib.rst:50: WARNING: missing attribute mentioned in :members: or __all__: module pyspark.mllib.regression, attribute RidgeRegressionModelLinearRegressionWithSGD
/Users/<user>/MyRepos/Scala/spark/python/pyspark/mllib/tree.py:docstring of pyspark.mllib.tree.DecisionTreeModel.predict:3: ERROR: Unexpected indentation.
...
checking consistency... /Users/<user>/MyRepos/Scala/spark/python/docs/modules.rst:: WARNING: document isn't included in any toctree
...
copying static files... WARNING: html_static_path entry u'/Users/<user>/MyRepos/Scala/spark/python/docs/_static' does not exist
...
build succeeded, 12 warnings.
```

Author: cocoatomo <cocoatomo77@gmail.com>

Closes #2653 from cocoatomo/issues/3773-sphinx-build-warnings and squashes the following commits:

6f65661 [cocoatomo] [SPARK-3773][PySpark][Doc] Sphinx build warning
2014-10-06 14:08:40 -07:00
Davies Liu 4f01265f7d [SPARK-3786] [PySpark] speedup tests
This patch try to speed up tests of PySpark, re-use the SparkContext in tests.py and mllib/tests.py to reduce the overhead of create SparkContext, remove some test cases, which did not make sense. It also improve the performance of some cases, such as MergerTests and SortTests.

before this patch:

real	21m27.320s
user	4m42.967s
sys	0m17.343s

after this patch:

real	9m47.541s
user	2m12.947s
sys	0m14.543s

It almost cut the time by half.

Author: Davies Liu <davies.liu@gmail.com>

Closes #2646 from davies/tests and squashes the following commits:

c54de60 [Davies Liu] revert change about memory limit
6a2a4b0 [Davies Liu] refactor of tests, speedup 100%
2014-10-06 14:07:53 -07:00
Sandy Ryza 20ea54cc7a [SPARK-2461] [PySpark] Add a toString method to GeneralizedLinearModel
Add a toString method to GeneralizedLinearModel, also change `__str__` to `__repr__` for some classes, to provide better message in repr.

This PR is based on #1388, thanks to sryza!

closes #1388

Author: Sandy Ryza <sandy@cloudera.com>
Author: Davies Liu <davies.liu@gmail.com>

Closes #2625 from davies/string and squashes the following commits:

3544aad [Davies Liu] fix LinearModel
0bcd642 [Davies Liu] Merge branch 'sandy-spark-2461' of github.com:sryza/spark
1ce5c2d [Sandy Ryza] __repr__ back to __str__ in a couple places
aa9e962 [Sandy Ryza] Switch __str__ to __repr__
a0c5041 [Sandy Ryza] Add labels back in
1aa17f5 [Sandy Ryza] Match existing conventions
fac1bc4 [Sandy Ryza] Fix PEP8 error
f7b58ed [Sandy Ryza] SPARK-2461. Add a toString method to GeneralizedLinearModel
2014-10-06 14:05:45 -07:00
Davies Liu abf588f47a [SPARK-3749] [PySpark] fix bugs in broadcast large closure of RDD
1. broadcast is triggle unexpected
2. fd is leaked in JVM (also leak in parallelize())
3. broadcast is not unpersisted in JVM after RDD is not be used any more.

cc JoshRosen , sorry for these stupid bugs.

Author: Davies Liu <davies.liu@gmail.com>

Closes #2603 from davies/fix_broadcast and squashes the following commits:

080a743 [Davies Liu] fix bugs in broadcast large closure of RDD
2014-10-01 11:21:34 -07:00
Joseph K. Bradley 7bf6cc9701 [SPARK-3751] [mllib] DecisionTree: example update + print options
DecisionTreeRunner functionality additions:
* Allow user to pass in a test dataset
* Do not print full model if the model is too large.

As part of this, modify DecisionTreeModel and RandomForestModel to allow printing less info.  Proposed updates:
* toString: prints model summary
* toDebugString: prints full model (named after RDD.toDebugString)

Similar update to Python API:
* __repr__() now prints a model summary
* toDebugString() now prints the full model

CC: mengxr  chouqin manishamde codedeft  Small update (whomever can take a look).  Thanks!

Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #2604 from jkbradley/dtrunner-update and squashes the following commits:

b2b3c60 [Joseph K. Bradley] re-added python sql doc test, temporarily removed before
07b1fae [Joseph K. Bradley] repr() now prints a model summary toDebugString() now prints the full model
1d0d93d [Joseph K. Bradley] Updated DT and RF to print less when toString is called. Added toDebugString for verbose printing.
22eac8c [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into dtrunner-update
e007a95 [Joseph K. Bradley] Updated DecisionTreeRunner to accept a test dataset.
2014-10-01 01:03:24 -07:00
Davies Liu c5414b6818 [SPARK-3478] [PySpark] Profile the Python tasks
This patch add profiling support for PySpark, it will show the profiling results
before the driver exits, here is one example:

```
============================================================
Profile of RDD<id=3>
============================================================
         5146507 function calls (5146487 primitive calls) in 71.094 seconds

   Ordered by: internal time, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  5144576   68.331    0.000   68.331    0.000 statcounter.py:44(merge)
       20    2.735    0.137   71.071    3.554 statcounter.py:33(__init__)
       20    0.017    0.001    0.017    0.001 {cPickle.dumps}
     1024    0.003    0.000    0.003    0.000 t.py:16(<lambda>)
       20    0.001    0.000    0.001    0.000 {reduce}
       21    0.001    0.000    0.001    0.000 {cPickle.loads}
       20    0.001    0.000    0.001    0.000 copy_reg.py:95(_slotnames)
       41    0.001    0.000    0.001    0.000 serializers.py:461(read_int)
       40    0.001    0.000    0.002    0.000 serializers.py:179(_batched)
       62    0.000    0.000    0.000    0.000 {method 'read' of 'file' objects}
       20    0.000    0.000   71.072    3.554 rdd.py:863(<lambda>)
       20    0.000    0.000    0.001    0.000 serializers.py:198(load_stream)
    40/20    0.000    0.000   71.072    3.554 rdd.py:2093(pipeline_func)
       41    0.000    0.000    0.002    0.000 serializers.py:130(load_stream)
       40    0.000    0.000   71.072    1.777 rdd.py:304(func)
       20    0.000    0.000   71.094    3.555 worker.py:82(process)
```

Also, use can show profile result manually by `sc.show_profiles()` or dump it into disk
by `sc.dump_profiles(path)`, such as

```python
>>> sc._conf.set("spark.python.profile", "true")
>>> rdd = sc.parallelize(range(100)).map(str)
>>> rdd.count()
100
>>> sc.show_profiles()
============================================================
Profile of RDD<id=1>
============================================================
         284 function calls (276 primitive calls) in 0.001 seconds

   Ordered by: internal time, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        4    0.000    0.000    0.000    0.000 serializers.py:198(load_stream)
        4    0.000    0.000    0.000    0.000 {reduce}
     12/4    0.000    0.000    0.001    0.000 rdd.py:2092(pipeline_func)
        4    0.000    0.000    0.000    0.000 {cPickle.loads}
        4    0.000    0.000    0.000    0.000 {cPickle.dumps}
      104    0.000    0.000    0.000    0.000 rdd.py:852(<genexpr>)
        8    0.000    0.000    0.000    0.000 serializers.py:461(read_int)
       12    0.000    0.000    0.000    0.000 rdd.py:303(func)
```
The profiling is disabled by default, can be enabled by "spark.python.profile=true".

Also, users can dump the results into disks automatically for future analysis, by "spark.python.profile.dump=path_to_dump"

This is bugfix of #2351 cc JoshRosen

Author: Davies Liu <davies.liu@gmail.com>

Closes #2556 from davies/profiler and squashes the following commits:

e68df5a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
858e74c [Davies Liu] compatitable with python 2.6
7ef2aa0 [Davies Liu] bugfix, add tests for show_profiles and dump_profiles()
2b0daf2 [Davies Liu] fix docs
7a56c24 [Davies Liu] bugfix
cba9463 [Davies Liu] move show_profiles and dump_profiles to SparkContext
fb9565b [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
116d52a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
09d02c3 [Davies Liu] Merge branch 'master' into profiler
c23865c [Davies Liu] Merge branch 'master' into profiler
15d6f18 [Davies Liu] add docs for two configs
dadee1a [Davies Liu] add docs string and clear profiles after show or dump
4f8309d [Davies Liu] address comment, add tests
0a5b6eb [Davies Liu] fix Python UDF
4b20494 [Davies Liu] add profile for python
2014-09-30 18:24:57 -07:00
Xiangrui Meng d75496b189 [SPARK-3701][MLLIB] update python linalg api and small fixes
1. doc updates
2. simple checks on vector dimensions
3. use column major for matrices

davies jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #2548 from mengxr/mllib-py-clean and squashes the following commits:

6dce2df [Xiangrui Meng] address comments
116b5db [Xiangrui Meng] use np.dot instead of array.dot
75f2fcc [Xiangrui Meng] fix python style
fefce00 [Xiangrui Meng] better check of vector size with more tests
067ef71 [Xiangrui Meng] majored -> major
ef853f9 [Xiangrui Meng] update python linalg api and small fixes
2014-09-30 17:10:36 -07:00
Davies Liu 0d8cdf0ede [SPARK-3681] [SQL] [PySpark] fix serialization of List and Map in SchemaRDD
Currently, the schema of object in ArrayType or MapType is attached lazily, it will have better performance but introduce issues while serialization or accessing nested objects.

This patch will apply schema to the objects of ArrayType or MapType immediately when accessing them, will be a little bit slower, but much robust.

Author: Davies Liu <davies.liu@gmail.com>

Closes #2526 from davies/nested and squashes the following commits:

2399ae5 [Davies Liu] fix serialization of List and Map in SchemaRDD
2014-09-27 12:21:37 -07:00
Josh Rosen f872e4fb80 Revert "[SPARK-3478] [PySpark] Profile the Python tasks"
This reverts commit 1aa549ba98.
2014-09-26 14:47:14 -07:00
Davies Liu 1aa549ba98 [SPARK-3478] [PySpark] Profile the Python tasks
This patch add profiling support for PySpark, it will show the profiling results
before the driver exits, here is one example:

```
============================================================
Profile of RDD<id=3>
============================================================
         5146507 function calls (5146487 primitive calls) in 71.094 seconds

   Ordered by: internal time, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
  5144576   68.331    0.000   68.331    0.000 statcounter.py:44(merge)
       20    2.735    0.137   71.071    3.554 statcounter.py:33(__init__)
       20    0.017    0.001    0.017    0.001 {cPickle.dumps}
     1024    0.003    0.000    0.003    0.000 t.py:16(<lambda>)
       20    0.001    0.000    0.001    0.000 {reduce}
       21    0.001    0.000    0.001    0.000 {cPickle.loads}
       20    0.001    0.000    0.001    0.000 copy_reg.py:95(_slotnames)
       41    0.001    0.000    0.001    0.000 serializers.py:461(read_int)
       40    0.001    0.000    0.002    0.000 serializers.py:179(_batched)
       62    0.000    0.000    0.000    0.000 {method 'read' of 'file' objects}
       20    0.000    0.000   71.072    3.554 rdd.py:863(<lambda>)
       20    0.000    0.000    0.001    0.000 serializers.py:198(load_stream)
    40/20    0.000    0.000   71.072    3.554 rdd.py:2093(pipeline_func)
       41    0.000    0.000    0.002    0.000 serializers.py:130(load_stream)
       40    0.000    0.000   71.072    1.777 rdd.py:304(func)
       20    0.000    0.000   71.094    3.555 worker.py:82(process)
```

Also, use can show profile result manually by `sc.show_profiles()` or dump it into disk
by `sc.dump_profiles(path)`, such as

```python
>>> sc._conf.set("spark.python.profile", "true")
>>> rdd = sc.parallelize(range(100)).map(str)
>>> rdd.count()
100
>>> sc.show_profiles()
============================================================
Profile of RDD<id=1>
============================================================
         284 function calls (276 primitive calls) in 0.001 seconds

   Ordered by: internal time, cumulative time

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        4    0.000    0.000    0.000    0.000 serializers.py:198(load_stream)
        4    0.000    0.000    0.000    0.000 {reduce}
     12/4    0.000    0.000    0.001    0.000 rdd.py:2092(pipeline_func)
        4    0.000    0.000    0.000    0.000 {cPickle.loads}
        4    0.000    0.000    0.000    0.000 {cPickle.dumps}
      104    0.000    0.000    0.000    0.000 rdd.py:852(<genexpr>)
        8    0.000    0.000    0.000    0.000 serializers.py:461(read_int)
       12    0.000    0.000    0.000    0.000 rdd.py:303(func)
```
The profiling is disabled by default, can be enabled by "spark.python.profile=true".

Also, users can dump the results into disks automatically for future analysis, by "spark.python.profile.dump=path_to_dump"

Author: Davies Liu <davies.liu@gmail.com>

Closes #2351 from davies/profiler and squashes the following commits:

7ef2aa0 [Davies Liu] bugfix, add tests for show_profiles and dump_profiles()
2b0daf2 [Davies Liu] fix docs
7a56c24 [Davies Liu] bugfix
cba9463 [Davies Liu] move show_profiles and dump_profiles to SparkContext
fb9565b [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
116d52a [Davies Liu] Merge branch 'master' of github.com:apache/spark into profiler
09d02c3 [Davies Liu] Merge branch 'master' into profiler
c23865c [Davies Liu] Merge branch 'master' into profiler
15d6f18 [Davies Liu] add docs for two configs
dadee1a [Davies Liu] add docs string and clear profiles after show or dump
4f8309d [Davies Liu] address comment, add tests
0a5b6eb [Davies Liu] fix Python UDF
4b20494 [Davies Liu] add profile for python
2014-09-26 09:27:42 -07:00
Aaron Staple 8ca4ecb6a5 [SPARK-546] Add full outer join to RDD and DStream.
leftOuterJoin and rightOuterJoin are already implemented.  This patch adds fullOuterJoin.

Author: Aaron Staple <aaron.staple@gmail.com>

Closes #1395 from staple/SPARK-546 and squashes the following commits:

1f5595c [Aaron Staple] Fix python style
7ac0aa9 [Aaron Staple] [SPARK-546] Add full outer join to RDD and DStream.
3b5d137 [Aaron Staple] In JavaPairDStream, make class tag specification in rightOuterJoin consistent with other functions.
31f2956 [Aaron Staple] Fix left outer join documentation comments.
2014-09-24 20:39:09 -07:00
Davies Liu bb96012b73 [SPARK-3679] [PySpark] pickle the exact globals of functions
function.func_code.co_names has all the names used in the function, including name of attributes. It will pickle some unnecessary globals if there is a global having the same name with attribute (in co_names).

There is a regression introduced by #2144, revert part of changes in that PR.

cc JoshRosen

Author: Davies Liu <davies.liu@gmail.com>

Closes #2522 from davies/globals and squashes the following commits:

dfbccf5 [Davies Liu] fix bug while pickle globals of function
2014-09-24 13:00:05 -07:00
Davies Liu c854b9fcb5 [SPARK-3634] [PySpark] User's module should take precedence over system modules
Python modules added through addPyFile should take precedence over system modules.

This patch put the path for user added module in the front of sys.path (just after '').

Author: Davies Liu <davies.liu@gmail.com>

Closes #2492 from davies/path and squashes the following commits:

4a2af78 [Davies Liu] fix tests
f7ff4da [Davies Liu] ad license header
6b0002f [Davies Liu] add tests
c16c392 [Davies Liu] put addPyFile in front of sys.path
2014-09-24 12:10:09 -07:00
Matthew Farrellee 5f8833c672 [PySpark] remove unnecessary use of numSlices from pyspark tests
Author: Matthew Farrellee <matt@redhat.com>

Closes #2467 from mattf/master-pyspark-remove-numslices-from-tests and squashes the following commits:

c49a87b [Matthew Farrellee] [PySpark] remove unnecessary use of numSlices from pyspark tests
2014-09-20 15:09:35 -07:00
Davies Liu a95ad99e31 [SPARK-3592] [SQL] [PySpark] support applySchema to RDD of Row
Fix the issue when applySchema() to an RDD of Row.

Also add type mapping for BinaryType.

Author: Davies Liu <davies.liu@gmail.com>

Closes #2448 from davies/row and squashes the following commits:

dd220cf [Davies Liu] fix test
3f3f188 [Davies Liu] add more test
f559746 [Davies Liu] add tests, fix serialization
9688fd2 [Davies Liu] support applySchema to RDD of Row
2014-09-19 15:33:42 -07:00
Davies Liu fce5e251d6 [SPARK-3491] [MLlib] [PySpark] use pickle to serialize data in MLlib
Currently, we serialize the data between JVM and Python case by case manually, this cannot scale to support so many APIs in MLlib.

This patch will try to address this problem by serialize the data using pickle protocol, using Pyrolite library to serialize/deserialize in JVM. Pickle protocol can be easily extended to support customized class.

All the modules are refactored to use this protocol.

Known issues: There will be some performance regression (both CPU and memory, the serialized data increased)

Author: Davies Liu <davies.liu@gmail.com>

Closes #2378 from davies/pickle_mllib and squashes the following commits:

dffbba2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into pickle_mllib
810f97f [Davies Liu] fix equal of matrix
032cd62 [Davies Liu] add more type check and conversion for user_product
bd738ab [Davies Liu] address comments
e431377 [Davies Liu] fix cache of rdd, refactor
19d0967 [Davies Liu] refactor Picklers
2511e76 [Davies Liu] cleanup
1fccf1a [Davies Liu] address comments
a2cc855 [Davies Liu] fix tests
9ceff73 [Davies Liu] test size of serialized Rating
44e0551 [Davies Liu] fix cache
a379a81 [Davies Liu] fix pickle array in python2.7
df625c7 [Davies Liu] Merge commit '154d141' into pickle_mllib
154d141 [Davies Liu] fix autobatchedpickler
44736d7 [Davies Liu] speed up pickling array in Python 2.7
e1d1bfc [Davies Liu] refactor
708dc02 [Davies Liu] fix tests
9dcfb63 [Davies Liu] fix style
88034f0 [Davies Liu] rafactor, address comments
46a501e [Davies Liu] choose batch size automatically
df19464 [Davies Liu] memorize the module and class name during pickleing
f3506c5 [Davies Liu] Merge branch 'master' into pickle_mllib
722dd96 [Davies Liu] cleanup _common.py
0ee1525 [Davies Liu] remove outdated tests
b02e34f [Davies Liu] remove _common.py
84c721d [Davies Liu] Merge branch 'master' into pickle_mllib
4d7963e [Davies Liu] remove muanlly serialization
6d26b03 [Davies Liu] fix tests
c383544 [Davies Liu] classification
f2a0856 [Davies Liu] mllib/regression
d9f691f [Davies Liu] mllib/util
cccb8b1 [Davies Liu] mllib/tree
8fe166a [Davies Liu] Merge branch 'pickle' into pickle_mllib
aa2287e [Davies Liu] random
f1544c4 [Davies Liu] refactor clustering
52d1350 [Davies Liu] use new protocol in mllib/stat
b30ef35 [Davies Liu] use pickle to serialize data for mllib/recommendation
f44f771 [Davies Liu] enable tests about array
3908f5c [Davies Liu] Merge branch 'master' into pickle
c77c87b [Davies Liu] cleanup debugging code
60e4e2f [Davies Liu] support unpickle array.array for Python 2.6
2014-09-19 15:01:11 -07:00
Davies Liu e77fa81a61 [SPARK-3554] [PySpark] use broadcast automatically for large closure
Py4j can not handle large string efficiently, so we should use broadcast for large closure automatically. (Broadcast use local filesystem to pass through data).

Author: Davies Liu <davies.liu@gmail.com>

Closes #2417 from davies/command and squashes the following commits:

fbf4e97 [Davies Liu] bugfix
aefd508 [Davies Liu] use broadcast automatically for large closure
2014-09-18 18:11:48 -07:00
Davies Liu ec1adecbb7 [SPARK-3430] [PySpark] [Doc] generate PySpark API docs using Sphinx
Using Sphinx to generate API docs for PySpark.

requirement: Sphinx

```
$ cd python/docs/
$ make html
```

The generated API docs will be located at python/docs/_build/html/index.html

It can co-exists with those generated by Epydoc.

This is the first working version, after merging in, then we can continue to improve it and replace the epydoc finally.

Author: Davies Liu <davies.liu@gmail.com>

Closes #2292 from davies/sphinx and squashes the following commits:

425a3b1 [Davies Liu] cleanup
1573298 [Davies Liu] move docs to python/docs/
5fe3903 [Davies Liu] Merge branch 'master' into sphinx
9468ab0 [Davies Liu] fix makefile
b408f38 [Davies Liu] address all comments
e2ccb1b [Davies Liu] update name and version
9081ead [Davies Liu] generate PySpark API docs using Sphinx
2014-09-16 12:51:58 -07:00
Aaron Staple 8e7ae477ba [SPARK-2314][SQL] Override collect and take in python library, and count in java library, with optimized versions.
SchemaRDD overrides RDD functions, including collect, count, and take, with optimized versions making use of the query optimizer.  The java and python interface classes wrapping SchemaRDD need to ensure the optimized versions are called as well.  This patch overrides relevant calls in the python and java interfaces with optimized versions.

Adds a new Row serialization pathway between python and java, based on JList[Array[Byte]] versus the existing RDD[Array[Byte]]. I wasn’t overjoyed about doing this, but I noticed that some QueryPlans implement optimizations in executeCollect(), which outputs an Array[Row] rather than the typical RDD[Row] that can be shipped to python using the existing serialization code. To me it made sense to ship the Array[Row] over to python directly instead of converting it back to an RDD[Row] just for the purpose of sending the Rows to python using the existing serialization code.

Author: Aaron Staple <aaron.staple@gmail.com>

Closes #1592 from staple/SPARK-2314 and squashes the following commits:

89ff550 [Aaron Staple] Merge with master.
6bb7b6c [Aaron Staple] Fix typo.
b56d0ac [Aaron Staple] [SPARK-2314][SQL] Override count in JavaSchemaRDD, forwarding to SchemaRDD's count.
0fc9d40 [Aaron Staple] Fix comment typos.
f03cdfa [Aaron Staple] [SPARK-2314][SQL] Override collect and take in sql.py, forwarding to SchemaRDD's collect.
2014-09-16 11:45:35 -07:00
Matthew Farrellee 9d5fa763d8 [SPARK-3519] add distinct(n) to PySpark
Added missing rdd.distinct(numPartitions) and associated tests

Author: Matthew Farrellee <matt@redhat.com>

Closes #2383 from mattf/SPARK-3519 and squashes the following commits:

30b837a [Matthew Farrellee] Combine test cases to save on JVM startups
6bc4a2c [Matthew Farrellee] [SPARK-3519] add distinct(n) to SchemaRDD in PySpark
7a17f2b [Matthew Farrellee] [SPARK-3519] add distinct(n) to PySpark
2014-09-16 11:39:57 -07:00
Aaron Staple 60050f4288 [SPARK-1087] Move python traceback utilities into new traceback_utils.py file.
Also made some cosmetic cleanups.

Author: Aaron Staple <aaron.staple@gmail.com>

Closes #2385 from staple/SPARK-1087 and squashes the following commits:

7b3bb13 [Aaron Staple] Address review comments, cosmetic cleanups.
10ba6e1 [Aaron Staple] [SPARK-1087] Move python traceback utilities into new traceback_utils.py file.
2014-09-15 19:28:17 -07:00
Davies Liu da33acb8b6 [SPARK-2951] [PySpark] support unpickle array.array for Python 2.6
Pyrolite can not unpickle array.array which pickled by Python 2.6, this patch fix it by extend Pyrolite.

There is a bug in Pyrolite when unpickle array of float/double, this patch workaround it by reverse the endianness for float/double. This workaround should be removed after Pyrolite have a new release to fix this issue.

I had send an PR to Pyrolite to fix it:  https://github.com/irmen/Pyrolite/pull/11

Author: Davies Liu <davies.liu@gmail.com>

Closes #2365 from davies/pickle and squashes the following commits:

f44f771 [Davies Liu] enable tests about array
3908f5c [Davies Liu] Merge branch 'master' into pickle
c77c87b [Davies Liu] cleanup debugging code
60e4e2f [Davies Liu] support unpickle array.array for Python 2.6
2014-09-15 18:57:25 -07:00
qiping.lqp fdb302f49c [SPARK-3516] [mllib] DecisionTree: Add minInstancesPerNode, minInfoGain params to example and Python API
Added minInstancesPerNode, minInfoGain params to:
* DecisionTreeRunner.scala example
* Python API (tree.py)

Also:
* Fixed typo in tree suite test "do not choose split that does not satisfy min instance per node requirements"
* small style fixes

CC: mengxr

Author: qiping.lqp <qiping.lqp@alibaba-inc.com>
Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>
Author: chouqin <liqiping1991@gmail.com>

Closes #2349 from jkbradley/chouqin-dt-preprune and squashes the following commits:

61b2e72 [Joseph K. Bradley] Added max of 10GB for maxMemoryInMB in Strategy.
a95e7c8 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into chouqin-dt-preprune
95c479d [Joseph K. Bradley] * Fixed typo in tree suite test "do not choose split that does not satisfy min instance per node requirements" * small style fixes
e2628b6 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into chouqin-dt-preprune
19b01af [Joseph K. Bradley] Merge remote-tracking branch 'chouqin/dt-preprune' into chouqin-dt-preprune
f1d11d1 [chouqin] fix typo
c7ebaf1 [chouqin] fix typo
39f9b60 [chouqin] change edge `minInstancesPerNode` to 2 and add one more test
c6e2dfc [Joseph K. Bradley] Added minInstancesPerNode and minInfoGain parameters to DecisionTreeRunner.scala and to Python API in tree.py
0278a11 [chouqin] remove `noSplit` and set `Predict` private to tree
d593ec7 [chouqin] fix docs and change minInstancesPerNode to 1
efcc736 [qiping.lqp] fix bug
10b8012 [qiping.lqp] fix style
6728fad [qiping.lqp] minor fix: remove empty lines
bb465ca [qiping.lqp] Merge branch 'master' of https://github.com/apache/spark into dt-preprune
cadd569 [qiping.lqp] add api docs
46b891f [qiping.lqp] fix bug
e72c7e4 [qiping.lqp] add comments
845c6fa [qiping.lqp] fix style
f195e83 [qiping.lqp] fix style
987cbf4 [qiping.lqp] fix bug
ff34845 [qiping.lqp] separate calculation of predict of node from calculation of info gain
ac42378 [qiping.lqp] add min info gain and min instances per node parameters in decision tree
2014-09-15 17:43:26 -07:00
Davies Liu 4e3fbe8cdb [SPARK-3463] [PySpark] aggregate and show spilled bytes in Python
Aggregate the number of bytes spilled into disks during aggregation or sorting, show them in Web UI.

![spilled](https://cloud.githubusercontent.com/assets/40902/4209758/4b995562-386d-11e4-97c1-8e838ee1d4e3.png)

This patch is blocked by SPARK-3465. (It includes a fix for that).

Author: Davies Liu <davies.liu@gmail.com>

Closes #2336 from davies/metrics and squashes the following commits:

e37df38 [Davies Liu] remove outdated comments
1245eb7 [Davies Liu] remove the temporary fix
ebd2f43 [Davies Liu] Merge branch 'master' into metrics
7e4ad04 [Davies Liu] Merge branch 'master' into metrics
fbe9029 [Davies Liu] show spilled bytes in Python in web ui
2014-09-13 22:31:21 -07:00
Davies Liu 2aea0da84c [SPARK-3030] [PySpark] Reuse Python worker
Reuse Python worker to avoid the overhead of fork() Python process for each tasks. It also tracks the broadcasts for each worker, avoid sending repeated broadcasts.

This can reduce the time for dummy task from 22ms to 13ms (-40%). It can help to reduce the latency for Spark Streaming.

For a job with broadcast (43M after compress):
```
    b = sc.broadcast(set(range(30000000)))
    print sc.parallelize(range(24000), 100).filter(lambda x: x in b.value).count()
```
It will finish in 281s without reused worker, and it will finish in 65s with reused worker(4 CPUs). After reusing the worker, it can save about 9 seconds for transfer and deserialize the broadcast for each tasks.

It's enabled by default, could be disabled by `spark.python.worker.reuse = false`.

Author: Davies Liu <davies.liu@gmail.com>

Closes #2259 from davies/reuse-worker and squashes the following commits:

f11f617 [Davies Liu] Merge branch 'master' into reuse-worker
3939f20 [Davies Liu] fix bug in serializer in mllib
cf1c55e [Davies Liu] address comments
3133a60 [Davies Liu] fix accumulator with reused worker
760ab1f [Davies Liu] do not reuse worker if there are any exceptions
7abb224 [Davies Liu] refactor: sychronized with itself
ac3206e [Davies Liu] renaming
8911f44 [Davies Liu] synchronized getWorkerBroadcasts()
6325fc1 [Davies Liu] bugfix: bid >= 0
e0131a2 [Davies Liu] fix name of config
583716e [Davies Liu] only reuse completed and not interrupted worker
ace2917 [Davies Liu] kill python worker after timeout
6123d0f [Davies Liu] track broadcasts for each worker
8d2f08c [Davies Liu] reuse python worker
2014-09-13 16:22:04 -07:00
Davies Liu 885d1621bc [SPARK-3500] [SQL] use JavaSchemaRDD as SchemaRDD._jschema_rdd
Currently, SchemaRDD._jschema_rdd is SchemaRDD, the Scala API (coalesce(), repartition()) can not been called in Python easily, there is no way to specify the implicit parameter `ord`. The _jrdd is an JavaRDD, so _jschema_rdd should also be JavaSchemaRDD.

In this patch, change _schema_rdd to JavaSchemaRDD, also added an assert for it. If some methods are missing from JavaSchemaRDD, then it's called by _schema_rdd.baseSchemaRDD().xxx().

BTW, Do we need JavaSQLContext?

Author: Davies Liu <davies.liu@gmail.com>

Closes #2369 from davies/fix_schemardd and squashes the following commits:

abee159 [Davies Liu] use JavaSchemaRDD as SchemaRDD._jschema_rdd
2014-09-12 19:05:39 -07:00
Davies Liu 71af030b46 [SPARK-3094] [PySpark] compatitable with PyPy
After this patch, we can run PySpark in PyPy (testing with PyPy 2.3.1 in Mac 10.9), for example:

```
PYSPARK_PYTHON=pypy ./bin/spark-submit wordcount.py
```

The performance speed up will depend on work load (from 20% to 3000%). Here are some benchmarks:

 Job | CPython 2.7 | PyPy 2.3.1  | Speed up
 ------- | ------------ | ------------- | -------
 Word Count | 41s   | 15s  | 2.7x
 Sort | 46s |  44s | 1.05x
 Stats | 174s | 3.6s | 48x

Here is the code used for benchmark:

```python
rdd = sc.textFile("text")
def wordcount():
    rdd.flatMap(lambda x:x.split('/'))\
        .map(lambda x:(x,1)).reduceByKey(lambda x,y:x+y).collectAsMap()
def sort():
    rdd.sortBy(lambda x:x, 1).count()
def stats():
    sc.parallelize(range(1024), 20).flatMap(lambda x: xrange(5024)).stats()
```

Author: Davies Liu <davies.liu@gmail.com>

Closes #2144 from davies/pypy and squashes the following commits:

9aed6c5 [Davies Liu] use protocol 2 in CloudPickle
4bc1f04 [Davies Liu] refactor
b20ab3a [Davies Liu] pickle sys.stdout and stderr in portable way
3ca2351 [Davies Liu] Merge branch 'master' into pypy
fae8b19 [Davies Liu] improve attrgetter, add tests
591f830 [Davies Liu] try to run tests with PyPy in run-tests
c8d62ba [Davies Liu] cleanup
f651fd0 [Davies Liu] fix tests using array with PyPy
1b98fb3 [Davies Liu] serialize itemgetter/attrgetter in portable ways
3c1dbfe [Davies Liu] Merge branch 'master' into pypy
42fb5fa [Davies Liu] Merge branch 'master' into pypy
cb2d724 [Davies Liu] fix tests
9986692 [Davies Liu] Merge branch 'master' into pypy
25b4ca7 [Davies Liu] support PyPy
2014-09-12 18:42:50 -07:00
RJ Nowling 533377621f [PySpark] Add blank line so that Python RDD.top() docstring renders correctly
Author: RJ Nowling <rnowling@gmail.com>

Closes #2370 from rnowling/python_rdd_docstrings and squashes the following commits:

5230574 [RJ Nowling] Add blank line so that Python RDD.top() docstring renders correctly
2014-09-12 09:46:21 -07:00
Davies Liu 1ef656ea85 [SPARK-3047] [PySpark] add an option to use str in textFileRDD
str is much efficient than unicode (both CPU and memory), it'e better to use str in textFileRDD. In order to keep compatibility, use unicode by default. (Maybe change it in the future).

use_unicode=True:

daviesliudm:~/work/spark$ time python wc.py
(u'./universe/spark/sql/core/target/java/org/apache/spark/sql/execution/ExplainCommand$.java', 7776)

real	2m8.298s
user	0m0.185s
sys	0m0.064s

use_unicode=False

daviesliudm:~/work/spark$ time python wc.py
('./universe/spark/sql/core/target/java/org/apache/spark/sql/execution/ExplainCommand$.java', 7776)

real	1m26.402s
user	0m0.182s
sys	0m0.062s

We can see that it got 32% improvement!

Author: Davies Liu <davies.liu@gmail.com>

Closes #1951 from davies/unicode and squashes the following commits:

8352d57 [Davies Liu] update version number
a286f2f [Davies Liu] rollback loads()
85246e5 [Davies Liu] add docs for use_unicode
a0295e1 [Davies Liu] add an option to use str in textFile()
2014-09-11 11:50:36 -07:00
Matthew Farrellee 25b5b867d5 [SPARK-3458] enable python "with" statements for SparkContext
allow for best practice code,

```
try:
  sc = SparkContext()
  app(sc)
finally:
  sc.stop()
```

to be written using a "with" statement,

```
with SparkContext() as sc:
  app(sc)
```

Author: Matthew Farrellee <matt@redhat.com>

Closes #2335 from mattf/SPARK-3458 and squashes the following commits:

5b4e37c [Matthew Farrellee] [SPARK-3458] enable python "with" statements for SparkContext
2014-09-09 18:54:54 -07:00
Xiangrui Meng 50a4fa774a [SPARK-3443][MLLIB] update default values of tree:
Adjust the default values of decision tree, based on the memory requirement discussed in https://github.com/apache/spark/pull/2125 :

1. maxMemoryInMB: 128 -> 256
2. maxBins: 100 -> 32
3. maxDepth: 4 -> 5 (in some example code)

jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #2322 from mengxr/tree-defaults and squashes the following commits:

cda453a [Xiangrui Meng] fix tests
5900445 [Xiangrui Meng] update comments
8c81831 [Xiangrui Meng] update default values of tree:
2014-09-08 18:59:57 -07:00
Matthew Rocklin 939a322c85 [SPARK-3417] Use new-style classes in PySpark
Tiny PR making SQLContext a new-style class.  This allows various type logic to work more effectively

```Python
In [1]: import pyspark

In [2]: pyspark.sql.SQLContext.mro()
Out[2]: [pyspark.sql.SQLContext, object]
```

Author: Matthew Rocklin <mrocklin@gmail.com>

Closes #2288 from mrocklin/sqlcontext-new-style-class and squashes the following commits:

4aadab6 [Matthew Rocklin] update other old-style classes
a2dc02f [Matthew Rocklin] pyspark.sql.SQLContext is new-style class
2014-09-08 15:45:36 -07:00
Matthew Farrellee 386bc24ebe Provide a default PYSPARK_PYTHON for python/run_tests
Without this the version of python used in the test is not
recorded. The error is,

   Testing with Python version:
   ./run-tests: line 57: --version: command not found

Author: Matthew Farrellee <matt@redhat.com>

Closes #2300 from mattf/master-fix-python-run-tests and squashes the following commits:

65a09f5 [Matthew Farrellee] Provide a default PYSPARK_PYTHON for python/run_tests
2014-09-08 12:37:52 -07:00
Sandy Ryza 16a73c2473 SPARK-2978. Transformation with MR shuffle semantics
I didn't add this to the transformations list in the docs because it's kind of obscure, but would be happy to do so if others think it would be helpful.

Author: Sandy Ryza <sandy@cloudera.com>

Closes #2274 from sryza/sandy-spark-2978 and squashes the following commits:

4a5332a [Sandy Ryza] Fix Java test
c04b447 [Sandy Ryza] Fix Python doc and add back deleted code
433ad5b [Sandy Ryza] Add Java test
4c25a54 [Sandy Ryza] Add s at the end and a couple other fixes
9b0ba99 [Sandy Ryza] Fix compilation
36e0571 [Sandy Ryza] Fix import ordering
48c12c2 [Sandy Ryza] Add Java version and additional doc
e5381cd [Sandy Ryza] Fix python style warnings
f147634 [Sandy Ryza] SPARK-2978. Transformation with MR shuffle semantics
2014-09-08 11:20:00 -07:00
Prashant Sharma e16a8e7db5 SPARK-3337 Paranoid quoting in shell to allow install dirs with spaces within.
...

Tested ! TBH, it isn't a great idea to have directory with spaces within. Because emacs doesn't like it then hadoop doesn't like it. and so on...

Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #2229 from ScrapCodes/SPARK-3337/quoting-shell-scripts and squashes the following commits:

d4ad660 [Prashant Sharma] SPARK-3337 Paranoid quoting in shell to allow install dirs with spaces within.
2014-09-08 10:24:15 -07:00
Ward Viaene ecfa76cdfe [SPARK-3415] [PySpark] removes SerializingAdapter code
This code removes the SerializingAdapter code that was copied from PiCloud

Author: Ward Viaene <ward.viaene@bigdatapartnership.com>

Closes #2287 from wardviaene/feature/pythonsys and squashes the following commits:

5f0d426 [Ward Viaene] SPARK-3415: modified test class to do dump and load
5f5d559 [Ward Viaene] SPARK-3415: modified test class name and call cloudpickle.dumps instead using StringIO
afc4a9a [Ward Viaene] SPARK-3415: added newlines to pass lint
aaf10b7 [Ward Viaene] SPARK-3415: removed references to SerializingAdapter and rewrote test
65ffeff [Ward Viaene] removed duplicate test
a958866 [Ward Viaene] SPARK-3415: test script
e263bf5 [Ward Viaene] SPARK-3415: removes legacy SerializingAdapter code
2014-09-07 18:54:36 -07:00
Davies Liu 110fb8b24d [SPARK-2334] fix AttributeError when call PipelineRDD.id()
The underline JavaRDD for PipelineRDD is created lazily, it's delayed until call _jrdd.

The id of JavaRDD is cached as `_id`, it saves a RPC call in py4j for later calls.

closes #1276

Author: Davies Liu <davies.liu@gmail.com>

Closes #2296 from davies/id and squashes the following commits:

e197958 [Davies Liu] fix style
9721716 [Davies Liu] fix id of PipelineRDD
2014-09-06 16:12:29 -07:00
GuoQiang Li 21a1e1bb89 [SPARK-3273][SPARK-3301]We should read the version information from the same place
Author: GuoQiang Li <witgo@qq.com>

Closes #2175 from witgo/SPARK-3273 and squashes the following commits:

cf9c65a [GuoQiang Li] We should read the version information from the same place
2a44e2f [GuoQiang Li] The spark version in the welcome message of pyspark is not correct
2014-09-06 15:08:43 -07:00
Holden Karau da35330e83 Spark-3406 add a default storage level to python RDD persist API
Author: Holden Karau <holden@pigscanfly.ca>

Closes #2280 from holdenk/SPARK-3406-Python-RDD-persist-api-does-not-have-default-storage-level and squashes the following commits:

33eaade [Holden Karau] As Josh pointed out, sql also override persist. Make persist behave the same as in the underlying RDD as well
e658227 [Holden Karau] Fix the test I added
e95a6c5 [Holden Karau] The Python persist function did not have a default storageLevel unlike the Scala API. Noticed this issue because we got a bug report back from the book where we had documented it as if it was the same as the Scala API
2014-09-06 14:49:25 -07:00
Andrew Ash ba5bcaddec SPARK-3211 .take() is OOM-prone with empty partitions
Instead of jumping straight from 1 partition to all partitions, do exponential
growth and double the number of partitions to attempt each time instead.

Fix proposed by Paul Nepywoda

Author: Andrew Ash <andrew@andrewash.com>

Closes #2117 from ash211/SPARK-3211 and squashes the following commits:

8b2299a [Andrew Ash] Quadruple instead of double for a minor speedup
e5f7e4d [Andrew Ash] Update comment to better reflect what we're doing
09a27f7 [Andrew Ash] Update PySpark to be less OOM-prone as well
3a156b8 [Andrew Ash] SPARK-3211 .take() is OOM-prone with empty partitions
2014-09-05 18:52:05 -07:00
Kousuke Saruta dc1ba9e9fc [SPARK-3378] [DOCS] Replace the word "SparkSQL" with right word "Spark SQL"
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2251 from sarutak/SPARK-3378 and squashes the following commits:

0bfe234 [Kousuke Saruta] Merge branch 'master' of git://git.apache.org/spark into SPARK-3378
bb5938f [Kousuke Saruta] Replaced rest of "SparkSQL" with "Spark SQL"
6df66de [Kousuke Saruta] Replaced "SparkSQL" with "Spark SQL"
2014-09-04 15:06:08 -07:00
Kousuke Saruta 4feb46c5fe [SPARK-3401][PySpark] Wrong usage of tee command in python/run-tests
Author: Kousuke Saruta <sarutak@oss.nttdata.co.jp>

Closes #2272 from sarutak/SPARK-3401 and squashes the following commits:

2b35a59 [Kousuke Saruta] Modified wrong usage of tee command in python/run-tests
2014-09-04 10:29:11 -07:00
Matthew Farrellee 7c6e71f05f [SPARK-2435] Add shutdown hook to pyspark
Author: Matthew Farrellee <matt@redhat.com>

Closes #2183 from mattf/SPARK-2435 and squashes the following commits:

ee0ee99 [Matthew Farrellee] [SPARK-2435] Add shutdown hook to pyspark
2014-09-03 19:37:37 -07:00
Davies Liu c5cbc49233 [SPARK-3335] [SQL] [PySpark] support broadcast in Python UDF
After this patch, broadcast can be used in Python UDF.

Author: Davies Liu <davies.liu@gmail.com>

Closes #2243 from davies/udf_broadcast and squashes the following commits:

7b88861 [Davies Liu] support broadcast in UDF
2014-09-03 19:08:39 -07:00
Davies Liu 6481d27425 [SPARK-3309] [PySpark] Put all public API in __all__
Put all public API in __all__, also put them all in pyspark.__init__.py, then we can got all the documents for public API by `pydoc pyspark`. It also can be used by other programs (such as Sphinx or Epydoc) to generate only documents for public APIs.

Author: Davies Liu <davies.liu@gmail.com>

Closes #2205 from davies/public and squashes the following commits:

c6c5567 [Davies Liu] fix message
f7b35be [Davies Liu] put SchemeRDD, Row in pyspark.sql module
7e3016a [Davies Liu] add __all__ in mllib
6281b48 [Davies Liu] fix doc for SchemaRDD
6caab21 [Davies Liu] add public interfaces into pyspark.__init__.py
2014-09-03 11:49:45 -07:00
Davies Liu e2c901b4c7 [SPARK-2871] [PySpark] add countApproxDistinct() API
RDD.countApproxDistinct(relativeSD=0.05):

        :: Experimental ::
        Return approximate number of distinct elements in the RDD.

        The algorithm used is based on streamlib's implementation of
        "HyperLogLog in Practice: Algorithmic Engineering of a State
        of The Art Cardinality Estimation Algorithm", available
        <a href="http://dx.doi.org/10.1145/2452376.2452456">here</a>.

        This support all the types of objects, which is supported by
        Pyrolite, nearly all builtin types.

        param relativeSD Relative accuracy. Smaller values create
                           counters that require more space.
                           It must be greater than 0.000017.

        >>> n = sc.parallelize(range(1000)).map(str).countApproxDistinct()
        >>> 950 < n < 1050
        True
        >>> n = sc.parallelize([i % 20 for i in range(1000)]).countApproxDistinct()
        >>> 18 < n < 22
        True

Author: Davies Liu <davies.liu@gmail.com>

Closes #2142 from davies/countApproxDistinct and squashes the following commits:

e20da47 [Davies Liu] remove the correction in Python
c38c4e4 [Davies Liu] fix doc tests
2ab157c [Davies Liu] fix doc tests
9d2565f [Davies Liu] add commments and link for hash collision correction
d306492 [Davies Liu] change range of hash of tuple to [0, maxint]
ded624f [Davies Liu] calculate hash in Python
4cba98f [Davies Liu] add more tests
a85a8c6 [Davies Liu] Merge branch 'master' into countApproxDistinct
e97e342 [Davies Liu] add countApproxDistinct()
2014-09-02 15:47:47 -07:00
Holden Karau ba78383bac SPARK-3318: Documentation update in addFile on how to use SparkFiles.get
Rather than specifying the path to SparkFiles we need to use the filename.

Author: Holden Karau <holden@pigscanfly.ca>

Closes #2210 from holdenk/SPARK-3318-documentation-for-addfiles-should-say-to-use-file-not-path and squashes the following commits:

a25d27a [Holden Karau] Update the JavaSparkContext addFile method to be clear about using fileName with SparkFiles as well
0ebcb05 [Holden Karau] Documentation update in addFile on how to use SparkFiles.get to specify filename rather than path
2014-08-30 16:58:17 -07:00
Davies Liu e248328b39 [SPARK-3307] [PySpark] Fix doc string of SparkContext.broadcast()
remove invalid docs

Author: Davies Liu <davies.liu@gmail.com>

Closes #2202 from davies/keep and squashes the following commits:

aa3b44f [Davies Liu] remove invalid docs
2014-08-29 11:47:49 -07:00
Davies Liu 4fa2fda88f [SPARK-2871] [PySpark] add RDD.lookup(key)
RDD.lookup(key)

        Return the list of values in the RDD for key `key`. This operation
        is done efficiently if the RDD has a known partitioner by only
        searching the partition that the key maps to.

        >>> l = range(1000)
        >>> rdd = sc.parallelize(zip(l, l), 10)
        >>> rdd.lookup(42)  # slow
        [42]
        >>> sorted = rdd.sortByKey()
        >>> sorted.lookup(42)  # fast
        [42]

It also clean up the code in RDD.py, and fix several bugs (related to preservesPartitioning).

Author: Davies Liu <davies.liu@gmail.com>

Closes #2093 from davies/lookup and squashes the following commits:

1789cd4 [Davies Liu] `f` in foreach could be generator or not.
2871b80 [Davies Liu] Merge branch 'master' into lookup
c6390ea [Davies Liu] address all comments
0f1bce8 [Davies Liu] add test case for lookup()
be0e8ba [Davies Liu] fix preservesPartitioning
eb1305d [Davies Liu] add RDD.lookup(key)
2014-08-27 13:18:33 -07:00
Andrew Or 7557c4cfef [SPARK-3167] Handle special driver configs in Windows
This is an effort to bring the Windows scripts up to speed after recent splashing changes in #1845.

Author: Andrew Or <andrewor14@gmail.com>

Closes #2129 from andrewor14/windows-config and squashes the following commits:

881a8f0 [Andrew Or] Add reference to Windows taskkill
92e6047 [Andrew Or] Update a few comments (minor)
22b1acd [Andrew Or] Fix style again (minor)
afcffea [Andrew Or] Fix style (minor)
72004c2 [Andrew Or] Actually respect --driver-java-options
803218b [Andrew Or] Actually respect SPARK_*_CLASSPATH
eeb34a0 [Andrew Or] Update outdated comment (minor)
35caecc [Andrew Or] In Windows, actually kill Java processes on exit
f97daa2 [Andrew Or] Fix Windows spark shell stdin issue
83ebe60 [Andrew Or] Parse special driver configs in Windows (broken)
2014-08-26 22:52:16 -07:00
Davies Liu f1e71d4c3b [SPARK-3073] [PySpark] use external sort in sortBy() and sortByKey()
Using external sort to support sort large datasets in reduce stage.

Author: Davies Liu <davies.liu@gmail.com>

Closes #1978 from davies/sort and squashes the following commits:

bbcd9ba [Davies Liu] check spilled bytes in tests
b125d2f [Davies Liu] add test for external sort in rdd
eae0176 [Davies Liu] choose different disks from different processes and instances
1f075ed [Davies Liu] Merge branch 'master' into sort
eb53ca6 [Davies Liu] Merge branch 'master' into sort
644abaf [Davies Liu] add license in LICENSE
19f7873 [Davies Liu] improve tests
55602ee [Davies Liu] use external sort in sortBy() and sortByKey()
2014-08-26 16:57:40 -07:00
Takuya UESHIN 98c2bb0bbd [SPARK-2969][SQL] Make ScalaReflection be able to handle ArrayType.containsNull and MapType.valueContainsNull.
Make `ScalaReflection` be able to handle like:

- `Seq[Int]` as `ArrayType(IntegerType, containsNull = false)`
- `Seq[java.lang.Integer]` as `ArrayType(IntegerType, containsNull = true)`
- `Map[Int, Long]` as `MapType(IntegerType, LongType, valueContainsNull = false)`
- `Map[Int, java.lang.Long]` as `MapType(IntegerType, LongType, valueContainsNull = true)`

Author: Takuya UESHIN <ueshin@happy-camper.st>

Closes #1889 from ueshin/issues/SPARK-2969 and squashes the following commits:

24f1c5c [Takuya UESHIN] Change the default value of ArrayType.containsNull to true in Python API.
79f5b65 [Takuya UESHIN] Change the default value of ArrayType.containsNull to true in Java API.
7cd1a7a [Takuya UESHIN] Fix json test failures.
2cfb862 [Takuya UESHIN] Change the default value of ArrayType.containsNull to true.
2f38e61 [Takuya UESHIN] Revert the default value of MapTypes.valueContainsNull.
9fa02f5 [Takuya UESHIN] Fix a test failure.
1a9a96b [Takuya UESHIN] Modify ScalaReflection to handle ArrayType.containsNull and MapType.valueContainsNull.
2014-08-26 13:22:55 -07:00