Commit graph

215 commits

Author SHA1 Message Date
Michael Armbrust 273c2fd08d [SQL] SPARK-1424 Generalize insertIntoTable functions on SchemaRDDs
This makes it possible to create tables and insert into them using the DSL and SQL for the scala and java apis.

Author: Michael Armbrust <michael@databricks.com>

Closes #354 from marmbrus/insertIntoTable and squashes the following commits:

6c6f227 [Michael Armbrust] Create random temporary files in python parquet unit tests.
f5e6d5c [Michael Armbrust] Merge remote-tracking branch 'origin/master' into insertIntoTable
765c506 [Michael Armbrust] Add to JavaAPI.
77b512c [Michael Armbrust] typos.
5c3ef95 [Michael Armbrust] use names for boolean args.
882afdf [Michael Armbrust] Change createTableAs to saveAsTable.  Clean up api annotations.
d07d94b [Michael Armbrust] Add tests, support for creating parquet files and hive tables.
fa3fe81 [Michael Armbrust] Make insertInto available on JavaSchemaRDD as well.  Add createTableAs function.
2014-04-15 20:40:40 -07:00
Matei Zaharia 63ca581d9c [WIP] SPARK-1430: Support sparse data in Python MLlib
This PR adds a SparseVector class in PySpark and updates all the regression, classification and clustering algorithms and models to support sparse data, similar to MLlib. I chose to add this class because SciPy is quite difficult to install in many environments (more so than NumPy), but I plan to add support for SciPy sparse vectors later too, and make the methods work transparently on objects of either type.

On the Scala side, we keep Python sparse vectors sparse and pass them to MLlib. We always return dense vectors from our models.

Some to-do items left:
- [x] Support SciPy's scipy.sparse matrix objects when SciPy is available. We can easily add a function to convert these to our own SparseVector.
- [x] MLlib currently uses a vector with one extra column on the left to represent what we call LabeledPoint in Scala. Do we really want this? It may get annoying once you deal with sparse data since you must add/subtract 1 to each feature index when training. We can remove this API in 1.0 and use tuples for labeling.
- [x] Explain how to use these in the Python MLlib docs.

CC @mengxr, @joshrosen

Author: Matei Zaharia <matei@databricks.com>

Closes #341 from mateiz/py-ml-update and squashes the following commits:

d52e763 [Matei Zaharia] Remove no-longer-needed slice code and handle review comments
ea5a25a [Matei Zaharia] Fix remaining uses of copyto() after merge
b9f97a3 [Matei Zaharia] Fix test
1e1bd0f [Matei Zaharia] Add MLlib logistic regression example in Python
88bc01f [Matei Zaharia] Clean up inheritance of LinearModel in Python, and expose its parametrs
37ab747 [Matei Zaharia] Fix some examples and docs due to changes in MLlib API
da0f27e [Matei Zaharia] Added a MLlib K-means example and updated docs to discuss sparse data
c48e85a [Matei Zaharia] Added some tests for passing lists as input, and added mllib/tests.py to run-tests script.
a07ba10 [Matei Zaharia] Fix some typos and calculation of initial weights
74eefe7 [Matei Zaharia] Added LabeledPoint class in Python
889dde8 [Matei Zaharia] Support scipy.sparse matrices in all our algorithms and models
ab244d1 [Matei Zaharia] Allow SparseVectors to be initialized using a dict
a5d6426 [Matei Zaharia] Add linalg.py to run-tests script
0e7a3d8 [Matei Zaharia] Keep vectors sparse in Java when reading LabeledPoints
eaee759 [Matei Zaharia] Update regression, classification and clustering models for sparse data
2abbb44 [Matei Zaharia] Further work to get linear models working with sparse data
154f45d [Matei Zaharia] Update docs, name some magic values
881fef7 [Matei Zaharia] Added a sparse vector in Python and made Java-Python format more compact
2014-04-15 20:33:24 -07:00
Sandeep df36091799 SPARK-1426: Make MLlib work with NumPy versions older than 1.7
Currently it requires NumPy 1.7 due to using the copyto method (http://docs.scipy.org/doc/numpy/reference/generated/numpy.copyto.html) for extracting data out of an array.
Replace it with a fallback

Author: Sandeep <sandeep@techaddict.me>

Closes #391 from techaddict/1426 and squashes the following commits:

d365962 [Sandeep] SPARK-1426: Make MLlib work with NumPy versions older than 1.7 Currently it requires NumPy 1.7 due to using the copyto method (http://docs.scipy.org/doc/numpy/reference/generated/numpy.copyto.html) for extracting data out of an array. Replace it with a fallback
2014-04-15 00:19:43 -07:00
Ahir Reddy c99bcb7fea SPARK-1374: PySpark API for SparkSQL
An initial API that exposes SparkSQL functionality in PySpark. A PythonRDD composed of dictionaries, with string keys and primitive values (boolean, float, int, long, string) can be converted into a SchemaRDD that supports sql queries.

```
from pyspark.context import SQLContext
sqlCtx = SQLContext(sc)
rdd = sc.parallelize([{"field1" : 1, "field2" : "row1"}, {"field1" : 2, "field2": "row2"}, {"field1" : 3, "field2": "row3"}])
srdd = sqlCtx.applySchema(rdd)
sqlCtx.registerRDDAsTable(srdd, "table1")
srdd2 = sqlCtx.sql("SELECT field1 AS f1, field2 as f2 from table1")
srdd2.collect()
```
The last line yields ```[{"f1" : 1, "f2" : "row1"}, {"f1" : 2, "f2": "row2"}, {"f1" : 3, "f2": "row3"}]```

Author: Ahir Reddy <ahirreddy@gmail.com>
Author: Michael Armbrust <michael@databricks.com>

Closes #363 from ahirreddy/pysql and squashes the following commits:

0294497 [Ahir Reddy] Updated log4j properties to supress Hive Warns
307d6e0 [Ahir Reddy] Style fix
6f7b8f6 [Ahir Reddy] Temporary fix MIMA checker. Since we now assemble Spark jar with Hive, we don't want to check the interfaces of all of our hive dependencies
3ef074a [Ahir Reddy] Updated documentation because classes moved to sql.py
29245bf [Ahir Reddy] Cache underlying SchemaRDD instead of generating and caching PythonRDD
f2312c7 [Ahir Reddy] Moved everything into sql.py
a19afe4 [Ahir Reddy] Doc fixes
6d658ba [Ahir Reddy] Remove the metastore directory created by the HiveContext tests in SparkSQL
521ff6d [Ahir Reddy] Trying to get spark to build with hive
ab95eba [Ahir Reddy] Set SPARK_HIVE=true on jenkins
ded03e7 [Ahir Reddy] Added doc test for HiveContext
22de1d4 [Ahir Reddy] Fixed maven pyrolite dependency
e4da06c [Ahir Reddy] Display message if hive is not built into spark
227a0be [Michael Armbrust] Update API links. Fix Hive example.
58e2aa9 [Michael Armbrust] Build Docs for pyspark SQL Api.  Minor fixes.
4285340 [Michael Armbrust] Fix building of Hive API Docs.
38a92b0 [Michael Armbrust] Add note to future non-python developers about python docs.
337b201 [Ahir Reddy] Changed com.clearspring.analytics stream version from 2.4.0 to 2.5.1 to match SBT build, and added pyrolite to maven build
40491c9 [Ahir Reddy] PR Changes + Method Visibility
1836944 [Michael Armbrust] Fix comments.
e00980f [Michael Armbrust] First draft of python sql programming guide.
b0192d3 [Ahir Reddy] Added Long, Double and Boolean as usable types + unit test
f98a422 [Ahir Reddy] HiveContexts
79621cf [Ahir Reddy] cleaning up cruft
b406ba0 [Ahir Reddy] doctest formatting
20936a5 [Ahir Reddy] Added tests and documentation
e4d21b4 [Ahir Reddy] Added pyrolite dependency
79f739d [Ahir Reddy] added more tests
7515ba0 [Ahir Reddy] added more tests :)
d26ec5e [Ahir Reddy] added test
e9f5b8d [Ahir Reddy] adding tests
906d180 [Ahir Reddy] added todo explaining cost of creating Row object in python
251f99d [Ahir Reddy] for now only allow dictionaries as input
09b9980 [Ahir Reddy] made jrdd explicitly lazy
c608947 [Ahir Reddy] SchemaRDD now has all RDD operations
725c91e [Ahir Reddy] awesome row objects
55d1c76 [Ahir Reddy] return row objects
4fe1319 [Ahir Reddy] output dictionaries correctly
be079de [Ahir Reddy] returning dictionaries works
cd5f79f [Ahir Reddy] Switched to using Scala SQLContext
e948bd9 [Ahir Reddy] yippie
4886052 [Ahir Reddy] even better
c0fb1c6 [Ahir Reddy] more working
043ca85 [Ahir Reddy] working
5496f9f [Ahir Reddy] doesn't crash
b8b904b [Ahir Reddy] Added schema rdd class
67ba875 [Ahir Reddy] java to python, and python to java
bcc0f23 [Ahir Reddy] Java to python
ab6025d [Ahir Reddy] compiling
2014-04-15 00:07:55 -07:00
Ivan Wick 5cd11d51c1 Set spark.executor.uri from environment variable (needed by Mesos)
The Mesos backend uses this property when setting up a slave process.  It is similarly set in the Scala repl (org.apache.spark.repl.SparkILoop), but I couldn't find any analogous for pyspark.

Author: Ivan Wick <ivanwick+github@gmail.com>

This patch had conflicts when merged, resolved by
Committer: Matei Zaharia <matei@databricks.com>

Closes #311 from ivanwick/master and squashes the following commits:

da0c3e4 [Ivan Wick] Set spark.executor.uri from environment variable (needed by Mesos)
2014-04-10 17:49:30 -07:00
Sandeep 3bd312940e SPARK-1428: MLlib should convert non-float64 NumPy arrays to float64 instead of complaining
Author: Sandeep <sandeep@techaddict.me>

Closes #356 from techaddict/1428 and squashes the following commits:

3bdf5f6 [Sandeep] SPARK-1428: MLlib should convert non-float64 NumPy arrays to float64 instead of complaining
2014-04-10 11:17:41 -07:00
Holden Karau ce8ec54561 Spark 1271: Co-Group and Group-By should pass Iterable[X]
Author: Holden Karau <holden@pigscanfly.ca>

Closes #242 from holdenk/spark-1320-cogroupandgroupshouldpassiterator and squashes the following commits:

f289536 [Holden Karau] Fix bad merge, should have been Iterable rather than Iterator
77048f8 [Holden Karau] Fix merge up to master
d3fe909 [Holden Karau] use toSeq instead
7a092a3 [Holden Karau] switch resultitr to resultiterable
eb06216 [Holden Karau] maybe I should have had a coffee first. use correct import for guava iterables
c5075aa [Holden Karau] If guava 14 had iterables
2d06e10 [Holden Karau] Fix Java 8 cogroup tests for the new API
11e730c [Holden Karau] Fix streaming tests
66b583d [Holden Karau] Fix the core test suite to compile
4ed579b [Holden Karau] Refactor from iterator to iterable
d052c07 [Holden Karau] Python tests now pass with iterator pandas
3bcd81d [Holden Karau] Revert "Try and make pickling list iterators work"
cd1e81c [Holden Karau] Try and make pickling list iterators work
c60233a [Holden Karau] Start investigating moving to iterators for python API like the Java/Scala one. tl;dr: We will have to write our own iterator since the default one doesn't pickle well
88a5cef [Holden Karau] Fix cogroup test in JavaAPISuite for streaming
a5ee714 [Holden Karau] oops, was checking wrong iterator
e687f21 [Holden Karau] Fix groupbykey test in JavaAPISuite of streaming
ec8cc3e [Holden Karau] Fix test issues\!
4b0eeb9 [Holden Karau] Switch cast in PairDStreamFunctions
fa395c9 [Holden Karau] Revert "Add a join based on the problem in SVD"
ec99e32 [Holden Karau] Revert "Revert this but for now put things in list pandas"
b692868 [Holden Karau] Revert
7e533f7 [Holden Karau] Fix the bug
8a5153a [Holden Karau] Revert me, but we have some stuff to debug
b4e86a9 [Holden Karau] Add a join based on the problem in SVD
c4510e2 [Holden Karau] Revert this but for now put things in list pandas
b4e0b1d [Holden Karau] Fix style issues
71e8b9f [Holden Karau] I really need to stop calling size on iterators, it is the path of sadness.
b1ae51a [Holden Karau] Fix some of the types in the streaming JavaAPI suite. Probably still needs more work
37888ec [Holden Karau] core/tests now pass
249abde [Holden Karau] org.apache.spark.rdd.PairRDDFunctionsSuite passes
6698186 [Holden Karau] Revert "I think this might be a bad rabbit hole. Started work to make CoGroupedRDD use iterator and then went crazy"
fe992fe [Holden Karau] hmmm try and fix up basic operation suite
172705c [Holden Karau] Fix Java API suite
caafa63 [Holden Karau] I think this might be a bad rabbit hole. Started work to make CoGroupedRDD use iterator and then went crazy
88b3329 [Holden Karau] Fix groupbykey to actually give back an iterator
4991af6 [Holden Karau] Fix some tests
be50246 [Holden Karau] Calling size on an iterator is not so good if we want to use it after
687ffbc [Holden Karau] This is the it compiles point of replacing Seq with Iterator and JList with JIterator in the groupby and cogroup signatures
2014-04-08 18:15:59 -07:00
Aaron Davidson 0307db0f55 SPARK-1099: Introduce local[*] mode to infer number of cores
This is the default mode for running spark-shell and pyspark, intended to allow users running spark for the first time to see the performance benefits of using multiple cores, while not breaking backwards compatibility for users who use "local" mode and expect exactly 1 core.

Author: Aaron Davidson <aaron@databricks.com>

Closes #182 from aarondav/110 and squashes the following commits:

a88294c [Aaron Davidson] Rebased changes for new spark-shell
a9f393e [Aaron Davidson] SPARK-1099: Introduce local[*] mode to infer number of cores
2014-04-07 13:06:30 -07:00
Matei Zaharia 0b85516781 SPARK-1421. Make MLlib work on Python 2.6
The reason it wasn't working was passing a bytearray to stream.write(), which is not supported in Python 2.6 but is in 2.7. (This array came from NumPy when we converted data to send it over to Java). Now we just convert those bytearrays to strings of bytes, which preserves nonprintable characters as well.

Author: Matei Zaharia <matei@databricks.com>

Closes #335 from mateiz/mllib-python-2.6 and squashes the following commits:

f26c59f [Matei Zaharia] Update docs to no longer say we need Python 2.7
a84d6af [Matei Zaharia] SPARK-1421. Make MLlib work on Python 2.6
2014-04-05 20:52:05 -07:00
Haoyuan Li b50ddfde03 SPARK-1305: Support persisting RDD's directly to Tachyon
Move the PR#468 of apache-incubator-spark to the apache-spark
"Adding an option to persist Spark RDD blocks into Tachyon."

Author: Haoyuan Li <haoyuan@cs.berkeley.edu>
Author: RongGu <gurongwalker@gmail.com>

Closes #158 from RongGu/master and squashes the following commits:

72b7768 [Haoyuan Li] merge master
9f7fa1b [Haoyuan Li] fix code style
ae7834b [Haoyuan Li] minor cleanup
a8b3ec6 [Haoyuan Li] merge master branch
e0f4891 [Haoyuan Li] better check offheap.
55b5918 [RongGu] address matei's comment on the replication of offHeap storagelevel
7cd4600 [RongGu] remove some logic code for tachyonstore's replication
51149e7 [RongGu] address aaron's comment on returning value of the remove() function in tachyonstore
8adfcfa [RongGu] address arron's comment on inTachyonSize
120e48a [RongGu] changed the root-level dir name in Tachyon
5cc041c [Haoyuan Li] address aaron's comments
9b97935 [Haoyuan Li] address aaron's comments
d9a6438 [Haoyuan Li] fix for pspark
77d2703 [Haoyuan Li] change python api.git status
3dcace4 [Haoyuan Li] address matei's comments
91fa09d [Haoyuan Li] address patrick's comments
589eafe [Haoyuan Li] use TRY_CACHE instead of MUST_CACHE
64348b2 [Haoyuan Li] update conf docs.
ed73e19 [Haoyuan Li] Merge branch 'master' of github.com:RongGu/spark-1
619a9a8 [RongGu] set number of directories in TachyonStore back to 64; added a TODO tag for duplicated code from the DiskStore
be79d77 [RongGu] find a way to clean up some unnecessay metods and classed to make the code simpler
49cc724 [Haoyuan Li] update docs with off_headp option
4572f9f [RongGu] reserving the old apply function API of StorageLevel
04301d3 [RongGu] rename StorageLevel.TACHYON to Storage.OFF_HEAP
c9aeabf [RongGu] rename the StorgeLevel.TACHYON as StorageLevel.OFF_HEAP
76805aa [RongGu] unifies the config properties name prefix; add the configs into docs/configuration.md
e700d9c [RongGu] add the SparkTachyonHdfsLR example and some comments
fd84156 [RongGu] use randomUUID to generate sparkapp directory name on tachyon;minor code style fix
939e467 [Haoyuan Li] 0.4.1-thrift from maven central
86a2eab [Haoyuan Li] tachyon 0.4.1-thrift is in the staging repo. but jenkins failed to download it. temporarily revert it back to 0.4.1
16c5798 [RongGu] make the dependency on tachyon as tachyon-0.4.1-thrift
eacb2e8 [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1
bbeb4de [RongGu] fix the JsonProtocolSuite test failure problem
6adb58f [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1
d827250 [RongGu] fix JsonProtocolSuie test failure
716e93b [Haoyuan Li] revert the version
ca14469 [Haoyuan Li] bump tachyon version to 0.4.1-thrift
2825a13 [RongGu] up-merging to the current master branch of the apache spark
6a22c1a [Haoyuan Li] fix scalastyle
8968b67 [Haoyuan Li] exclude more libraries from tachyon dependency to be the same as referencing tachyon-client.
77be7e8 [RongGu] address mateiz's comment about the temp folder name problem. The implementation followed mateiz's advice.
1dcadf9 [Haoyuan Li] typo
bf278fa [Haoyuan Li] fix python tests
e82909c [Haoyuan Li] minor cleanup
776a56c [Haoyuan Li] address patrick's and ali's comments from the previous PR
8859371 [Haoyuan Li] various minor fixes and clean up
e3ddbba [Haoyuan Li] add doc to use Tachyon cache mode.
fcaeab2 [Haoyuan Li] address Aaron's comment
e554b1e [Haoyuan Li] add python code
47304b3 [Haoyuan Li] make tachyonStore in BlockMananger lazy val; add more comments StorageLevels.
dc8ef24 [Haoyuan Li] add old storelevel constructor
e01a271 [Haoyuan Li] update tachyon 0.4.1
8011a96 [RongGu] fix a brought-in mistake in StorageLevel
70ca182 [RongGu] a bit change in comment
556978b [RongGu] fix the scalastyle errors
791189b [RongGu] "Adding an option to persist Spark RDD blocks into Tachyon." move the PR#468 of apache-incubator-spark to the apache-spark
2014-04-04 20:38:20 -07:00
Matei Zaharia 60e18ce7dd SPARK-1414. Python API for SparkContext.wholeTextFiles
Also clarified comment on each file having to fit in memory

Author: Matei Zaharia <matei@databricks.com>

Closes #327 from mateiz/py-whole-files and squashes the following commits:

9ad64a5 [Matei Zaharia] SPARK-1414. Python API for SparkContext.wholeTextFiles
2014-04-04 17:29:29 -07:00
Prashant Sharma c1ea3afb51 Spark 1162 Implemented takeOrdered in pyspark.
Since python does not have a library for max heap and usual tricks like inverting values etc.. does not work for all cases.

We have our own implementation of max heap.

Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #97 from ScrapCodes/SPARK-1162/pyspark-top-takeOrdered2 and squashes the following commits:

35f86ba [Prashant Sharma] code review
2b1124d [Prashant Sharma] fixed tests
e8a08e2 [Prashant Sharma] Code review comments.
49e6ba7 [Prashant Sharma] SPARK-1162 added takeOrdered to pyspark
2014-04-03 15:42:17 -07:00
Xiangrui Meng 9c65fa76f9 [SPARK-1212, Part II] Support sparse data in MLlib
In PR https://github.com/apache/spark/pull/117, we added dense/sparse vector data model and updated KMeans to support sparse input. This PR is to replace all other `Array[Double]` usage by `Vector` in generalized linear models (GLMs) and Naive Bayes. Major changes:

1. `LabeledPoint` becomes `LabeledPoint(Double, Vector)`.
2. Methods that accept `RDD[Array[Double]]` now accept `RDD[Vector]`. We cannot support both in an elegant way because of type erasure.
3. Mark 'createModel' and 'predictPoint' protected because they are not for end users.
4. Add libSVMFile to MLContext.
5. NaiveBayes can accept arbitrary labels (introducing a breaking change to Python's `NaiveBayesModel`).
6. Gradient computation no longer creates temp vectors.
7. Column normalization and centering are removed from Lasso and Ridge because the operation will densify the data. Simple feature transformation can be done before training.

TODO:
1. ~~Use axpy when possible.~~
2. ~~Optimize Naive Bayes.~~

Author: Xiangrui Meng <meng@databricks.com>

Closes #245 from mengxr/vector and squashes the following commits:

eb6e793 [Xiangrui Meng] move libSVMFile to MLUtils and rename to loadLibSVMData
c26c4fc [Xiangrui Meng] update DecisionTree to use RDD[Vector]
11999c7 [Xiangrui Meng] Merge branch 'master' into vector
f7da54b [Xiangrui Meng] add minSplits to libSVMFile
da25e24 [Xiangrui Meng] revert the change to default addIntercept because it might change the behavior of existing code without warning
493f26f [Xiangrui Meng] Merge branch 'master' into vector
7c1bc01 [Xiangrui Meng] add a TODO to NB
b9b7ef7 [Xiangrui Meng] change default value of addIntercept to false
b01df54 [Xiangrui Meng] allow to change or clear threshold in LR and SVM
4addc50 [Xiangrui Meng] merge master
4ca5b1b [Xiangrui Meng] remove normalization from Lasso and update tests
f04fe8a [Xiangrui Meng] remove normalization from RidgeRegression and update tests
d088552 [Xiangrui Meng] use static constructor for MLContext
6f59eed [Xiangrui Meng] update libSVMFile to determine number of features automatically
3432e84 [Xiangrui Meng] update NaiveBayes to support sparse data
0f8759b [Xiangrui Meng] minor updates to NB
b11659c [Xiangrui Meng] style update
78c4671 [Xiangrui Meng] add libSVMFile to MLContext
f0fe616 [Xiangrui Meng] add a test for sparse linear regression
44733e1 [Xiangrui Meng] use in-place gradient computation
e981396 [Xiangrui Meng] use axpy in Updater
db808a1 [Xiangrui Meng] update JavaLR example
befa592 [Xiangrui Meng] passed scala/java tests
75c83a4 [Xiangrui Meng] passed test compile
1859701 [Xiangrui Meng] passed compile
834ada2 [Xiangrui Meng] optimized MLUtils.computeStats update some ml algorithms to use Vector (cont.)
135ab72 [Xiangrui Meng] merge glm
0e57aa4 [Xiangrui Meng] update Lasso and RidgeRegression to parse the weights correctly from GLM mark createModel protected mark predictPoint protected
d7f629f [Xiangrui Meng] fix a bug in GLM when intercept is not used
3f346ba [Xiangrui Meng] update some ml algorithms to use Vector
2014-04-02 14:01:12 -07:00
Prashant Sharma df1b9f7b1a SPARK-1336 Reducing the output of run-tests script.
Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Prashant Sharma <scrapcodes@gmail.com>

Closes #262 from ScrapCodes/SPARK-1336/ReduceVerbosity and squashes the following commits:

87dfa54 [Prashant Sharma] Further reduction in noise and made pyspark tests to fail fast.
811170f [Prashant Sharma] Reducing the ouput of run-tests script.
2014-03-29 23:03:03 -07:00
Prashant Sharma a0853a39e1 SPARK-1322, top in pyspark should sort result in descending order.
Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #235 from ScrapCodes/SPARK-1322/top-rev-sort and squashes the following commits:

f316266 [Prashant Sharma] Minor change in comment.
58e58c6 [Prashant Sharma] SPARK-1322, top in pyspark should sort result in descending order.
2014-03-26 09:16:37 -07:00
Jyotiska NK 67fa71cba2 Added doctest for map function in rdd.py
Doctest added for map in rdd.py

Author: Jyotiska NK <jyotiska123@gmail.com>

Closes #177 from jyotiska/pyspark_rdd_map_doctest and squashes the following commits:

a38527f [Jyotiska NK] Added doctest for map function in rdd.py
2014-03-19 14:04:45 -07:00
Dan McClary e3681f26fa Spark 1246 add min max to stat counter
Here's the addition of min and max to statscounter.py and min and max methods to rdd.py.

Author: Dan McClary <dan.mcclary@gmail.com>

Closes #144 from dwmclary/SPARK-1246-add-min-max-to-stat-counter and squashes the following commits:

fd3fd4b [Dan McClary] fixed  error, updated test
82cde0e [Dan McClary] flipped incorrectly assigned inf values in StatCounter
5d96799 [Dan McClary] added max and min to StatCounter repr for pyspark
21dd366 [Dan McClary] added max and min to StatCounter output, updated doc
1a97558 [Dan McClary] added max and min to StatCounter output, updated doc
a5c13b0 [Dan McClary] Added min and max to Scala and Java RDD, added min and max to StatCounter
ed67136 [Dan McClary] broke min/max out into separate transaction, added to rdd.py
1e7056d [Dan McClary] added underscore to getBucket
37a7dea [Dan McClary] cleaned up boundaries for histogram -- uses real min/max when buckets are derived
29981f2 [Dan McClary] fixed indentation on doctest comment
eaf89d9 [Dan McClary] added correct doctest for histogram
4916016 [Dan McClary] added histogram method, added max and min to statscounter
2014-03-18 00:45:47 -07:00
CodingCat dc9654638f SPARK-1240: handle the case of empty RDD when takeSample
https://spark-project.atlassian.net/browse/SPARK-1240

It seems that the current implementation does not handle the empty RDD case when run takeSample

In this patch, before calling sample() inside takeSample API, I add a checker for this case and returns an empty Array when it's a empty RDD; also in sample(), I add a checker for the invalid fraction value

In the test case, I also add several lines for this case

Author: CodingCat <zhunansjtu@gmail.com>

Closes #135 from CodingCat/SPARK-1240 and squashes the following commits:

fef57d4 [CodingCat] fix the same problem in PySpark
36db06b [CodingCat] create new test cases for takeSample from an empty red
810948d [CodingCat] further fix
a40e8fb [CodingCat] replace if with require
ad483fd [CodingCat] handle the case with empty RDD when take sample
2014-03-16 22:14:59 -07:00
Prashant Sharma b8afe30520 SPARK-1162 Added top in python.
Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #93 from ScrapCodes/SPARK-1162/pyspark-top-takeOrdered and squashes the following commits:

ece1fa4 [Prashant Sharma] Added top in python.
2014-03-12 15:57:44 -07:00
prabinb af7f2f1090 Spark-1163, Added missing Python RDD functions
Author: prabinb <prabin.banka@imaginea.com>

Closes #92 from prabinb/python-api-rdd and squashes the following commits:

51129ca [prabinb] Added missing Python RDD functions Added __repr__ function to StorageLevel class. Added doctest for RDD.getStorageLevel().
2014-03-11 23:57:05 -07:00
Prashant Sharma a59419c27e SPARK-1168, Added foldByKey to pyspark.
Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #115 from ScrapCodes/SPARK-1168/pyspark-foldByKey and squashes the following commits:

db6f67e [Prashant Sharma] SPARK-1168, Added foldByKey to pyspark.
2014-03-10 13:37:11 -07:00
jyotiska f5518989b6 [SPARK-972] Added detailed callsite info for ValueError in context.py (resubmitted)
Author: jyotiska <jyotiska123@gmail.com>

Closes #34 from jyotiska/pyspark_code and squashes the following commits:

c9439be [jyotiska] replaced dict with namedtuple
a6bf4cd [jyotiska] added callsite info for context.py
2014-03-10 13:34:49 -07:00
Prabin Banka e1e09e0ef6 SPARK-977 Added Python RDD.zip function
was raised earlier as a part of  apache/incubator-spark#486

Author: Prabin Banka <prabin.banka@imaginea.com>

Closes #76 from prabinb/python-api-zip and squashes the following commits:

b1a31a0 [Prabin Banka] Added Python RDD.zip function
2014-03-10 13:27:00 -07:00
Aaron Davidson 52834d761b SPARK-929: Fully deprecate usage of SPARK_MEM
(Continued from old repo, prior discussion at https://github.com/apache/incubator-spark/pull/615)

This patch cements our deprecation of the SPARK_MEM environment variable by replacing it with three more specialized variables:
SPARK_DAEMON_MEMORY, SPARK_EXECUTOR_MEMORY, and SPARK_DRIVER_MEMORY

The creation of the latter two variables means that we can safely set driver/job memory without accidentally setting the executor memory. Neither is public.

SPARK_EXECUTOR_MEMORY is only used by the Mesos scheduler (and set within SparkContext). The proper way of configuring executor memory is through the "spark.executor.memory" property.

SPARK_DRIVER_MEMORY is the new way of specifying the amount of memory run by jobs launched by spark-class, without possibly affecting executor memory.

Other memory considerations:
- The repl's memory can be set through the "--drivermem" command-line option, which really just sets SPARK_DRIVER_MEMORY.
- run-example doesn't use spark-class, so the only way to modify examples' memory is actually an unusual use of SPARK_JAVA_OPTS (which is normally overriden in all cases by spark-class).

This patch also fixes a lurking bug where spark-shell misused spark-class (the first argument is supposed to be the main class name, not java options), as well as a bug in the Windows spark-class2.cmd. I have not yet tested this patch on either Windows or Mesos, however.

Author: Aaron Davidson <aaron@databricks.com>

Closes #99 from aarondav/sparkmem and squashes the following commits:

9df4c68 [Aaron Davidson] SPARK-929: Fully deprecate usage of SPARK_MEM
2014-03-09 11:08:39 -07:00
Prashant Sharma 6e730edcde Spark 1165 rdd.intersection in python and java
Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Prashant Sharma <scrapcodes@gmail.com>

Closes #80 from ScrapCodes/SPARK-1165/RDD.intersection and squashes the following commits:

9b015e9 [Prashant Sharma] Added a note, shuffle is required for intersection.
1fea813 [Prashant Sharma] correct the lines wrapping
d0c71f3 [Prashant Sharma] SPARK-1165 RDD.intersection in java
d6effee [Prashant Sharma] SPARK-1165 Implemented RDD.intersection in python.
2014-03-07 18:48:07 -08:00
Prabin Banka 3d3acef047 SPARK-1187, Added missing Python APIs
The following Python APIs are added,
RDD.id()
SparkContext.setJobGroup()
SparkContext.setLocalProperty()
SparkContext.getLocalProperty()
SparkContext.sparkUser()

was raised earlier as a part of  apache/incubator-spark#486

Author: Prabin Banka <prabin.banka@imaginea.com>

Closes #75 from prabinb/python-api-backup and squashes the following commits:

cc3c6cd [Prabin Banka] Added missing Python APIs
2014-03-06 12:45:27 -08:00
Prashant Sharma 02836657cf SPARK-1109 wrong API docs for pyspark map function
Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #73 from ScrapCodes/SPARK-1109/wrong-API-docs and squashes the following commits:

1a55b58 [Prashant Sharma] SPARK-1109 wrong API docs for pyspark map function
2014-03-04 15:32:43 -08:00
Bouke van der Bijl 12738c1aec SPARK-1115: Catch depickling errors
This surroungs the complete worker code in a try/except block so we catch any error that arrives. An example would be the depickling failing for some reason

@JoshRosen

Author: Bouke van der Bijl <boukevanderbijl@gmail.com>

Closes #644 from bouk/catch-depickling-errors and squashes the following commits:

f0f67cc [Bouke van der Bijl] Lol indentation
0e4d504 [Bouke van der Bijl] Surround the complete python worker with the try block
2014-02-26 14:51:21 -08:00
jyotiska 722199fab0 doctest updated for mapValues, flatMapValues in rdd.py
Updated doctests for mapValues and flatMapValues in rdd.py

Author: jyotiska <jyotiska123@gmail.com>

Closes #621 from jyotiska/python_spark and squashes the following commits:

716f7cd [jyotiska] doctest updated for mapValues, flatMapValues in rdd.py
2014-02-22 15:10:31 -08:00
jyotiska 3ff077d489 Fixed minor typo in worker.py
Fixed minor typo in worker.py

Author: jyotiska <jyotiska123@gmail.com>

Closes #630 from jyotiska/pyspark_code and squashes the following commits:

ee44201 [jyotiska] typo fixed in worker.py
2014-02-22 10:09:50 -08:00
Ahir Reddy 59b1379594 SPARK-1114: Allow PySpark to use existing JVM and Gateway
Patch to allow PySpark to use existing JVM and Gateway. Changes to PySpark implementation of SparkConf to take existing SparkConf JVM handle. Change to PySpark SparkContext to allow subclass specific context initialization.

Author: Ahir Reddy <ahirreddy@gmail.com>

Closes #622 from ahirreddy/pyspark-existing-jvm and squashes the following commits:

a86f457 [Ahir Reddy] Patch to allow PySpark to use existing JVM and Gateway. Changes to PySpark implementation of SparkConf to take existing SparkConf JVM handle. Change to PySpark SparkContext to allow subclass specific context initialization.
2014-02-20 21:20:39 -08:00
jyotiska 2ef37c9366 Merge pull request #562 from jyotiska/master. Closes #562.
Added example Python code for sort

I added an example Python code for sort. Right now, PySpark has limited examples for new people willing to use the project. This example code sorts integers stored in a file. I was able to sort 5 million, 10 million and 25 million integers with this code.

Author: jyotiska <jyotiska123@gmail.com>

== Merge branch commits ==

commit 8ad8faf6c8e02ae1cd68565d98524edf165f54df
Author: jyotiska <jyotiska123@gmail.com>
Date:   Sun Feb 9 11:00:41 2014 +0530

    Added comments in code on collect() method

commit 6f98f1e313f4472a7c2207d36c4f0fbcebc95a8c
Author: jyotiska <jyotiska123@gmail.com>
Date:   Sat Feb 8 13:12:37 2014 +0530

    Updated python example code sort.py

commit 945e39a5d68daa7e5bab0d96cbd35d7c4b04eafb
Author: jyotiska <jyotiska123@gmail.com>
Date:   Sat Feb 8 12:59:09 2014 +0530

    Added example python code for sort
2014-02-08 23:36:48 -08:00
Mark Hamstra c2341c92bb Merge pull request #542 from markhamstra/versionBump. Closes #542.
Version number to 1.0.0-SNAPSHOT

Since 0.9.0-incubating is done and out the door, we shouldn't be building 0.9.0-incubating-SNAPSHOT anymore.

@pwendell

Author: Mark Hamstra <markhamstra@gmail.com>

== Merge branch commits ==

commit 1b00a8a7c1a7f251b4bb3774b84b9e64758eaa71
Author: Mark Hamstra <markhamstra@gmail.com>
Date:   Wed Feb 5 09:30:32 2014 -0800

    Version number to 1.0.0-SNAPSHOT
2014-02-08 16:00:43 -08:00
Prashant Sharma 084839ba35 Merge pull request #498 from ScrapCodes/python-api. Closes #498.
Python api additions

Author: Prashant Sharma <prashant.s@imaginea.com>

== Merge branch commits ==

commit 8b51591f1a7a79a62c13ee66ff8d83040f7eccd8
Author: Prashant Sharma <prashant.s@imaginea.com>
Date:   Fri Jan 24 11:50:29 2014 +0530

    Josh's and Patricks review comments.

commit d37f9677838e43bef6c18ef61fbf08055ba6d1ca
Author: Prashant Sharma <prashant.s@imaginea.com>
Date:   Thu Jan 23 17:27:17 2014 +0530

    fixed doc tests

commit 27cb54bf5c99b1ea38a73858c291d0a1c43d8b7c
Author: Prashant Sharma <prashant.s@imaginea.com>
Date:   Thu Jan 23 16:48:43 2014 +0530

    Added keys and values methods for PairFunctions in python

commit 4ce76b396fbaefef2386d7a36d611572bdef9b5d
Author: Prashant Sharma <prashant.s@imaginea.com>
Date:   Thu Jan 23 13:51:26 2014 +0530

    Added foreachPartition

commit 05f05341a187cba829ac0e6c2bdf30be49948c89
Author: Prashant Sharma <prashant.s@imaginea.com>
Date:   Thu Jan 23 13:02:59 2014 +0530

    Added coalesce fucntion to python API

commit 6568d2c2fa14845dc56322c0f39ba2e13b3b26dd
Author: Prashant Sharma <prashant.s@imaginea.com>
Date:   Thu Jan 23 12:52:44 2014 +0530

    added repartition function to python API.
2014-02-06 14:58:35 -08:00
Josh Rosen 1381fc72f7 Switch from MUTF8 to UTF8 in PySpark serializers.
This fixes SPARK-1043, a bug introduced in 0.9.0
where PySpark couldn't serialize strings > 64kB.

This fix was written by @tyro89 and @bouk in #512.
This commit squashes and rebases their pull request
in order to fix some merge conflicts.
2014-01-28 20:20:08 -08:00
Reynold Xin c40619d487 Merge pull request #504 from JoshRosen/SPARK-1025
Fix PySpark hang when input files are deleted (SPARK-1025)

This pull request addresses [SPARK-1025](https://spark-project.atlassian.net/browse/SPARK-1025), an issue where PySpark could hang if its input files were deleted.
2014-01-25 22:41:30 -08:00
Josh Rosen 4cebb79c9f Deprecate mapPartitionsWithSplit in PySpark.
Also, replace the last reference to it in the docs.

This fixes SPARK-1026.
2014-01-23 20:01:36 -08:00
Josh Rosen f83068497b Fix for SPARK-1025: PySpark hang on missing files. 2014-01-23 18:24:51 -08:00
Josh Rosen 61569906cc Fix SPARK-978: ClassCastException in PySpark cartesian. 2014-01-23 15:09:19 -08:00
Josh Rosen 0035dbbc81 Fix SPARK-1034: Py4JException on PySpark Cartesian Result 2014-01-23 13:05:59 -08:00
Patrick Wendell 4c16f79ce4 Merge pull request #426 from mateiz/py-ml-tests
Re-enable Python MLlib tests (require Python 2.7 and NumPy 1.7+)

We disabled these earlier because Jenkins didn't have these versions.
2014-01-18 16:21:43 -08:00
Patrick Wendell bf5699543b Merge pull request #462 from mateiz/conf-file-fix
Remove Typesafe Config usage and conf files to fix nested property names

With Typesafe Config we had the subtle problem of no longer allowing
nested property names, which are used for a few of our properties:
http://apache-spark-developers-list.1001551.n3.nabble.com/Config-properties-broken-in-master-td208.html

This PR is for branch 0.9 but should be added into master too.
(cherry picked from commit 34e911ce9a)

Signed-off-by: Patrick Wendell <pwendell@gmail.com>
2014-01-18 16:20:00 -08:00
Matei Zaharia 5b3a3e28d7 Complain if Python and NumPy versions are too old for MLlib 2014-01-14 12:27:58 -08:00
Matei Zaharia 938e4a0e16 Re-enable Python MLlib tests (require Python 2.7 and NumPy 1.7+) 2014-01-14 12:14:48 -08:00
Matei Zaharia cc93c2abb1 Disable MLlib tests for now while Jenkins is still on Python 2.6 2014-01-13 20:46:46 -08:00
Matei Zaharia 5741078c46 Log Python exceptions to stderr as well
This helps in case the exception happened while serializing a record to
be sent to Java, leaving the stream to Java in an inconsistent state
where PythonRDD won't be able to read the error.
2014-01-12 00:10:41 -08:00
Matei Zaharia 4c28a2bad8 Update some Python MLlib parameters to use camelCase, and tweak docs
We've used camel case in other Spark methods so it felt reasonable to
keep using it here and make the code match Scala/Java as much as
possible. Note that parameter names matter in Python because it allows
passing optional parameters by name.
2014-01-11 22:30:48 -08:00
Matei Zaharia 9a0dfdf868 Add Naive Bayes to Python MLlib, and some API fixes
- Added a Python wrapper for Naive Bayes
- Updated the Scala Naive Bayes to match the style of our other
  algorithms better and in particular make it easier to call from Java
  (added builder pattern, removed default value in train method)
- Updated Python MLlib functions to not require a SparkContext; we can
  get that from the RDD the user gives
- Added a toString method in LabeledPoint
- Made the Python MLlib tests run as part of run-tests as well (before
  they could only be run individually through each file)
2014-01-11 22:30:48 -08:00
Hossein Falaki 3a8beb46cb Merge branch 'master' into MatrixFactorizationModel-fix 2014-01-07 15:22:42 -08:00
Hossein Falaki 754f5300a1 Added predictAll python function to MatrixFactorizationModel 2014-01-06 12:19:43 -08:00