Commit graph

229 commits

Author SHA1 Message Date
Haoyuan Li b50ddfde03 SPARK-1305: Support persisting RDD's directly to Tachyon
Move the PR#468 of apache-incubator-spark to the apache-spark
"Adding an option to persist Spark RDD blocks into Tachyon."

Author: Haoyuan Li <haoyuan@cs.berkeley.edu>
Author: RongGu <gurongwalker@gmail.com>

Closes #158 from RongGu/master and squashes the following commits:

72b7768 [Haoyuan Li] merge master
9f7fa1b [Haoyuan Li] fix code style
ae7834b [Haoyuan Li] minor cleanup
a8b3ec6 [Haoyuan Li] merge master branch
e0f4891 [Haoyuan Li] better check offheap.
55b5918 [RongGu] address matei's comment on the replication of offHeap storagelevel
7cd4600 [RongGu] remove some logic code for tachyonstore's replication
51149e7 [RongGu] address aaron's comment on returning value of the remove() function in tachyonstore
8adfcfa [RongGu] address arron's comment on inTachyonSize
120e48a [RongGu] changed the root-level dir name in Tachyon
5cc041c [Haoyuan Li] address aaron's comments
9b97935 [Haoyuan Li] address aaron's comments
d9a6438 [Haoyuan Li] fix for pspark
77d2703 [Haoyuan Li] change python api.git status
3dcace4 [Haoyuan Li] address matei's comments
91fa09d [Haoyuan Li] address patrick's comments
589eafe [Haoyuan Li] use TRY_CACHE instead of MUST_CACHE
64348b2 [Haoyuan Li] update conf docs.
ed73e19 [Haoyuan Li] Merge branch 'master' of github.com:RongGu/spark-1
619a9a8 [RongGu] set number of directories in TachyonStore back to 64; added a TODO tag for duplicated code from the DiskStore
be79d77 [RongGu] find a way to clean up some unnecessay metods and classed to make the code simpler
49cc724 [Haoyuan Li] update docs with off_headp option
4572f9f [RongGu] reserving the old apply function API of StorageLevel
04301d3 [RongGu] rename StorageLevel.TACHYON to Storage.OFF_HEAP
c9aeabf [RongGu] rename the StorgeLevel.TACHYON as StorageLevel.OFF_HEAP
76805aa [RongGu] unifies the config properties name prefix; add the configs into docs/configuration.md
e700d9c [RongGu] add the SparkTachyonHdfsLR example and some comments
fd84156 [RongGu] use randomUUID to generate sparkapp directory name on tachyon;minor code style fix
939e467 [Haoyuan Li] 0.4.1-thrift from maven central
86a2eab [Haoyuan Li] tachyon 0.4.1-thrift is in the staging repo. but jenkins failed to download it. temporarily revert it back to 0.4.1
16c5798 [RongGu] make the dependency on tachyon as tachyon-0.4.1-thrift
eacb2e8 [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1
bbeb4de [RongGu] fix the JsonProtocolSuite test failure problem
6adb58f [RongGu] Merge branch 'master' of https://github.com/RongGu/spark-1
d827250 [RongGu] fix JsonProtocolSuie test failure
716e93b [Haoyuan Li] revert the version
ca14469 [Haoyuan Li] bump tachyon version to 0.4.1-thrift
2825a13 [RongGu] up-merging to the current master branch of the apache spark
6a22c1a [Haoyuan Li] fix scalastyle
8968b67 [Haoyuan Li] exclude more libraries from tachyon dependency to be the same as referencing tachyon-client.
77be7e8 [RongGu] address mateiz's comment about the temp folder name problem. The implementation followed mateiz's advice.
1dcadf9 [Haoyuan Li] typo
bf278fa [Haoyuan Li] fix python tests
e82909c [Haoyuan Li] minor cleanup
776a56c [Haoyuan Li] address patrick's and ali's comments from the previous PR
8859371 [Haoyuan Li] various minor fixes and clean up
e3ddbba [Haoyuan Li] add doc to use Tachyon cache mode.
fcaeab2 [Haoyuan Li] address Aaron's comment
e554b1e [Haoyuan Li] add python code
47304b3 [Haoyuan Li] make tachyonStore in BlockMananger lazy val; add more comments StorageLevels.
dc8ef24 [Haoyuan Li] add old storelevel constructor
e01a271 [Haoyuan Li] update tachyon 0.4.1
8011a96 [RongGu] fix a brought-in mistake in StorageLevel
70ca182 [RongGu] a bit change in comment
556978b [RongGu] fix the scalastyle errors
791189b [RongGu] "Adding an option to persist Spark RDD blocks into Tachyon." move the PR#468 of apache-incubator-spark to the apache-spark
2014-04-04 20:38:20 -07:00
Matei Zaharia 60e18ce7dd SPARK-1414. Python API for SparkContext.wholeTextFiles
Also clarified comment on each file having to fit in memory

Author: Matei Zaharia <matei@databricks.com>

Closes #327 from mateiz/py-whole-files and squashes the following commits:

9ad64a5 [Matei Zaharia] SPARK-1414. Python API for SparkContext.wholeTextFiles
2014-04-04 17:29:29 -07:00
Prashant Sharma c1ea3afb51 Spark 1162 Implemented takeOrdered in pyspark.
Since python does not have a library for max heap and usual tricks like inverting values etc.. does not work for all cases.

We have our own implementation of max heap.

Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #97 from ScrapCodes/SPARK-1162/pyspark-top-takeOrdered2 and squashes the following commits:

35f86ba [Prashant Sharma] code review
2b1124d [Prashant Sharma] fixed tests
e8a08e2 [Prashant Sharma] Code review comments.
49e6ba7 [Prashant Sharma] SPARK-1162 added takeOrdered to pyspark
2014-04-03 15:42:17 -07:00
Xiangrui Meng 9c65fa76f9 [SPARK-1212, Part II] Support sparse data in MLlib
In PR https://github.com/apache/spark/pull/117, we added dense/sparse vector data model and updated KMeans to support sparse input. This PR is to replace all other `Array[Double]` usage by `Vector` in generalized linear models (GLMs) and Naive Bayes. Major changes:

1. `LabeledPoint` becomes `LabeledPoint(Double, Vector)`.
2. Methods that accept `RDD[Array[Double]]` now accept `RDD[Vector]`. We cannot support both in an elegant way because of type erasure.
3. Mark 'createModel' and 'predictPoint' protected because they are not for end users.
4. Add libSVMFile to MLContext.
5. NaiveBayes can accept arbitrary labels (introducing a breaking change to Python's `NaiveBayesModel`).
6. Gradient computation no longer creates temp vectors.
7. Column normalization and centering are removed from Lasso and Ridge because the operation will densify the data. Simple feature transformation can be done before training.

TODO:
1. ~~Use axpy when possible.~~
2. ~~Optimize Naive Bayes.~~

Author: Xiangrui Meng <meng@databricks.com>

Closes #245 from mengxr/vector and squashes the following commits:

eb6e793 [Xiangrui Meng] move libSVMFile to MLUtils and rename to loadLibSVMData
c26c4fc [Xiangrui Meng] update DecisionTree to use RDD[Vector]
11999c7 [Xiangrui Meng] Merge branch 'master' into vector
f7da54b [Xiangrui Meng] add minSplits to libSVMFile
da25e24 [Xiangrui Meng] revert the change to default addIntercept because it might change the behavior of existing code without warning
493f26f [Xiangrui Meng] Merge branch 'master' into vector
7c1bc01 [Xiangrui Meng] add a TODO to NB
b9b7ef7 [Xiangrui Meng] change default value of addIntercept to false
b01df54 [Xiangrui Meng] allow to change or clear threshold in LR and SVM
4addc50 [Xiangrui Meng] merge master
4ca5b1b [Xiangrui Meng] remove normalization from Lasso and update tests
f04fe8a [Xiangrui Meng] remove normalization from RidgeRegression and update tests
d088552 [Xiangrui Meng] use static constructor for MLContext
6f59eed [Xiangrui Meng] update libSVMFile to determine number of features automatically
3432e84 [Xiangrui Meng] update NaiveBayes to support sparse data
0f8759b [Xiangrui Meng] minor updates to NB
b11659c [Xiangrui Meng] style update
78c4671 [Xiangrui Meng] add libSVMFile to MLContext
f0fe616 [Xiangrui Meng] add a test for sparse linear regression
44733e1 [Xiangrui Meng] use in-place gradient computation
e981396 [Xiangrui Meng] use axpy in Updater
db808a1 [Xiangrui Meng] update JavaLR example
befa592 [Xiangrui Meng] passed scala/java tests
75c83a4 [Xiangrui Meng] passed test compile
1859701 [Xiangrui Meng] passed compile
834ada2 [Xiangrui Meng] optimized MLUtils.computeStats update some ml algorithms to use Vector (cont.)
135ab72 [Xiangrui Meng] merge glm
0e57aa4 [Xiangrui Meng] update Lasso and RidgeRegression to parse the weights correctly from GLM mark createModel protected mark predictPoint protected
d7f629f [Xiangrui Meng] fix a bug in GLM when intercept is not used
3f346ba [Xiangrui Meng] update some ml algorithms to use Vector
2014-04-02 14:01:12 -07:00
Prashant Sharma a0853a39e1 SPARK-1322, top in pyspark should sort result in descending order.
Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #235 from ScrapCodes/SPARK-1322/top-rev-sort and squashes the following commits:

f316266 [Prashant Sharma] Minor change in comment.
58e58c6 [Prashant Sharma] SPARK-1322, top in pyspark should sort result in descending order.
2014-03-26 09:16:37 -07:00
Jyotiska NK 67fa71cba2 Added doctest for map function in rdd.py
Doctest added for map in rdd.py

Author: Jyotiska NK <jyotiska123@gmail.com>

Closes #177 from jyotiska/pyspark_rdd_map_doctest and squashes the following commits:

a38527f [Jyotiska NK] Added doctest for map function in rdd.py
2014-03-19 14:04:45 -07:00
Dan McClary e3681f26fa Spark 1246 add min max to stat counter
Here's the addition of min and max to statscounter.py and min and max methods to rdd.py.

Author: Dan McClary <dan.mcclary@gmail.com>

Closes #144 from dwmclary/SPARK-1246-add-min-max-to-stat-counter and squashes the following commits:

fd3fd4b [Dan McClary] fixed  error, updated test
82cde0e [Dan McClary] flipped incorrectly assigned inf values in StatCounter
5d96799 [Dan McClary] added max and min to StatCounter repr for pyspark
21dd366 [Dan McClary] added max and min to StatCounter output, updated doc
1a97558 [Dan McClary] added max and min to StatCounter output, updated doc
a5c13b0 [Dan McClary] Added min and max to Scala and Java RDD, added min and max to StatCounter
ed67136 [Dan McClary] broke min/max out into separate transaction, added to rdd.py
1e7056d [Dan McClary] added underscore to getBucket
37a7dea [Dan McClary] cleaned up boundaries for histogram -- uses real min/max when buckets are derived
29981f2 [Dan McClary] fixed indentation on doctest comment
eaf89d9 [Dan McClary] added correct doctest for histogram
4916016 [Dan McClary] added histogram method, added max and min to statscounter
2014-03-18 00:45:47 -07:00
CodingCat dc9654638f SPARK-1240: handle the case of empty RDD when takeSample
https://spark-project.atlassian.net/browse/SPARK-1240

It seems that the current implementation does not handle the empty RDD case when run takeSample

In this patch, before calling sample() inside takeSample API, I add a checker for this case and returns an empty Array when it's a empty RDD; also in sample(), I add a checker for the invalid fraction value

In the test case, I also add several lines for this case

Author: CodingCat <zhunansjtu@gmail.com>

Closes #135 from CodingCat/SPARK-1240 and squashes the following commits:

fef57d4 [CodingCat] fix the same problem in PySpark
36db06b [CodingCat] create new test cases for takeSample from an empty red
810948d [CodingCat] further fix
a40e8fb [CodingCat] replace if with require
ad483fd [CodingCat] handle the case with empty RDD when take sample
2014-03-16 22:14:59 -07:00
Prashant Sharma b8afe30520 SPARK-1162 Added top in python.
Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #93 from ScrapCodes/SPARK-1162/pyspark-top-takeOrdered and squashes the following commits:

ece1fa4 [Prashant Sharma] Added top in python.
2014-03-12 15:57:44 -07:00
prabinb af7f2f1090 Spark-1163, Added missing Python RDD functions
Author: prabinb <prabin.banka@imaginea.com>

Closes #92 from prabinb/python-api-rdd and squashes the following commits:

51129ca [prabinb] Added missing Python RDD functions Added __repr__ function to StorageLevel class. Added doctest for RDD.getStorageLevel().
2014-03-11 23:57:05 -07:00
Prashant Sharma a59419c27e SPARK-1168, Added foldByKey to pyspark.
Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #115 from ScrapCodes/SPARK-1168/pyspark-foldByKey and squashes the following commits:

db6f67e [Prashant Sharma] SPARK-1168, Added foldByKey to pyspark.
2014-03-10 13:37:11 -07:00
jyotiska f5518989b6 [SPARK-972] Added detailed callsite info for ValueError in context.py (resubmitted)
Author: jyotiska <jyotiska123@gmail.com>

Closes #34 from jyotiska/pyspark_code and squashes the following commits:

c9439be [jyotiska] replaced dict with namedtuple
a6bf4cd [jyotiska] added callsite info for context.py
2014-03-10 13:34:49 -07:00
Prabin Banka e1e09e0ef6 SPARK-977 Added Python RDD.zip function
was raised earlier as a part of  apache/incubator-spark#486

Author: Prabin Banka <prabin.banka@imaginea.com>

Closes #76 from prabinb/python-api-zip and squashes the following commits:

b1a31a0 [Prabin Banka] Added Python RDD.zip function
2014-03-10 13:27:00 -07:00
Aaron Davidson 52834d761b SPARK-929: Fully deprecate usage of SPARK_MEM
(Continued from old repo, prior discussion at https://github.com/apache/incubator-spark/pull/615)

This patch cements our deprecation of the SPARK_MEM environment variable by replacing it with three more specialized variables:
SPARK_DAEMON_MEMORY, SPARK_EXECUTOR_MEMORY, and SPARK_DRIVER_MEMORY

The creation of the latter two variables means that we can safely set driver/job memory without accidentally setting the executor memory. Neither is public.

SPARK_EXECUTOR_MEMORY is only used by the Mesos scheduler (and set within SparkContext). The proper way of configuring executor memory is through the "spark.executor.memory" property.

SPARK_DRIVER_MEMORY is the new way of specifying the amount of memory run by jobs launched by spark-class, without possibly affecting executor memory.

Other memory considerations:
- The repl's memory can be set through the "--drivermem" command-line option, which really just sets SPARK_DRIVER_MEMORY.
- run-example doesn't use spark-class, so the only way to modify examples' memory is actually an unusual use of SPARK_JAVA_OPTS (which is normally overriden in all cases by spark-class).

This patch also fixes a lurking bug where spark-shell misused spark-class (the first argument is supposed to be the main class name, not java options), as well as a bug in the Windows spark-class2.cmd. I have not yet tested this patch on either Windows or Mesos, however.

Author: Aaron Davidson <aaron@databricks.com>

Closes #99 from aarondav/sparkmem and squashes the following commits:

9df4c68 [Aaron Davidson] SPARK-929: Fully deprecate usage of SPARK_MEM
2014-03-09 11:08:39 -07:00
Prashant Sharma 6e730edcde Spark 1165 rdd.intersection in python and java
Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Prashant Sharma <scrapcodes@gmail.com>

Closes #80 from ScrapCodes/SPARK-1165/RDD.intersection and squashes the following commits:

9b015e9 [Prashant Sharma] Added a note, shuffle is required for intersection.
1fea813 [Prashant Sharma] correct the lines wrapping
d0c71f3 [Prashant Sharma] SPARK-1165 RDD.intersection in java
d6effee [Prashant Sharma] SPARK-1165 Implemented RDD.intersection in python.
2014-03-07 18:48:07 -08:00
Prabin Banka 3d3acef047 SPARK-1187, Added missing Python APIs
The following Python APIs are added,
RDD.id()
SparkContext.setJobGroup()
SparkContext.setLocalProperty()
SparkContext.getLocalProperty()
SparkContext.sparkUser()

was raised earlier as a part of  apache/incubator-spark#486

Author: Prabin Banka <prabin.banka@imaginea.com>

Closes #75 from prabinb/python-api-backup and squashes the following commits:

cc3c6cd [Prabin Banka] Added missing Python APIs
2014-03-06 12:45:27 -08:00
Prashant Sharma 02836657cf SPARK-1109 wrong API docs for pyspark map function
Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #73 from ScrapCodes/SPARK-1109/wrong-API-docs and squashes the following commits:

1a55b58 [Prashant Sharma] SPARK-1109 wrong API docs for pyspark map function
2014-03-04 15:32:43 -08:00
Bouke van der Bijl 12738c1aec SPARK-1115: Catch depickling errors
This surroungs the complete worker code in a try/except block so we catch any error that arrives. An example would be the depickling failing for some reason

@JoshRosen

Author: Bouke van der Bijl <boukevanderbijl@gmail.com>

Closes #644 from bouk/catch-depickling-errors and squashes the following commits:

f0f67cc [Bouke van der Bijl] Lol indentation
0e4d504 [Bouke van der Bijl] Surround the complete python worker with the try block
2014-02-26 14:51:21 -08:00
jyotiska 722199fab0 doctest updated for mapValues, flatMapValues in rdd.py
Updated doctests for mapValues and flatMapValues in rdd.py

Author: jyotiska <jyotiska123@gmail.com>

Closes #621 from jyotiska/python_spark and squashes the following commits:

716f7cd [jyotiska] doctest updated for mapValues, flatMapValues in rdd.py
2014-02-22 15:10:31 -08:00
jyotiska 3ff077d489 Fixed minor typo in worker.py
Fixed minor typo in worker.py

Author: jyotiska <jyotiska123@gmail.com>

Closes #630 from jyotiska/pyspark_code and squashes the following commits:

ee44201 [jyotiska] typo fixed in worker.py
2014-02-22 10:09:50 -08:00
Ahir Reddy 59b1379594 SPARK-1114: Allow PySpark to use existing JVM and Gateway
Patch to allow PySpark to use existing JVM and Gateway. Changes to PySpark implementation of SparkConf to take existing SparkConf JVM handle. Change to PySpark SparkContext to allow subclass specific context initialization.

Author: Ahir Reddy <ahirreddy@gmail.com>

Closes #622 from ahirreddy/pyspark-existing-jvm and squashes the following commits:

a86f457 [Ahir Reddy] Patch to allow PySpark to use existing JVM and Gateway. Changes to PySpark implementation of SparkConf to take existing SparkConf JVM handle. Change to PySpark SparkContext to allow subclass specific context initialization.
2014-02-20 21:20:39 -08:00
Mark Hamstra c2341c92bb Merge pull request #542 from markhamstra/versionBump. Closes #542.
Version number to 1.0.0-SNAPSHOT

Since 0.9.0-incubating is done and out the door, we shouldn't be building 0.9.0-incubating-SNAPSHOT anymore.

@pwendell

Author: Mark Hamstra <markhamstra@gmail.com>

== Merge branch commits ==

commit 1b00a8a7c1a7f251b4bb3774b84b9e64758eaa71
Author: Mark Hamstra <markhamstra@gmail.com>
Date:   Wed Feb 5 09:30:32 2014 -0800

    Version number to 1.0.0-SNAPSHOT
2014-02-08 16:00:43 -08:00
Prashant Sharma 084839ba35 Merge pull request #498 from ScrapCodes/python-api. Closes #498.
Python api additions

Author: Prashant Sharma <prashant.s@imaginea.com>

== Merge branch commits ==

commit 8b51591f1a7a79a62c13ee66ff8d83040f7eccd8
Author: Prashant Sharma <prashant.s@imaginea.com>
Date:   Fri Jan 24 11:50:29 2014 +0530

    Josh's and Patricks review comments.

commit d37f9677838e43bef6c18ef61fbf08055ba6d1ca
Author: Prashant Sharma <prashant.s@imaginea.com>
Date:   Thu Jan 23 17:27:17 2014 +0530

    fixed doc tests

commit 27cb54bf5c99b1ea38a73858c291d0a1c43d8b7c
Author: Prashant Sharma <prashant.s@imaginea.com>
Date:   Thu Jan 23 16:48:43 2014 +0530

    Added keys and values methods for PairFunctions in python

commit 4ce76b396fbaefef2386d7a36d611572bdef9b5d
Author: Prashant Sharma <prashant.s@imaginea.com>
Date:   Thu Jan 23 13:51:26 2014 +0530

    Added foreachPartition

commit 05f05341a187cba829ac0e6c2bdf30be49948c89
Author: Prashant Sharma <prashant.s@imaginea.com>
Date:   Thu Jan 23 13:02:59 2014 +0530

    Added coalesce fucntion to python API

commit 6568d2c2fa14845dc56322c0f39ba2e13b3b26dd
Author: Prashant Sharma <prashant.s@imaginea.com>
Date:   Thu Jan 23 12:52:44 2014 +0530

    added repartition function to python API.
2014-02-06 14:58:35 -08:00
Josh Rosen 1381fc72f7 Switch from MUTF8 to UTF8 in PySpark serializers.
This fixes SPARK-1043, a bug introduced in 0.9.0
where PySpark couldn't serialize strings > 64kB.

This fix was written by @tyro89 and @bouk in #512.
This commit squashes and rebases their pull request
in order to fix some merge conflicts.
2014-01-28 20:20:08 -08:00
Reynold Xin c40619d487 Merge pull request #504 from JoshRosen/SPARK-1025
Fix PySpark hang when input files are deleted (SPARK-1025)

This pull request addresses [SPARK-1025](https://spark-project.atlassian.net/browse/SPARK-1025), an issue where PySpark could hang if its input files were deleted.
2014-01-25 22:41:30 -08:00
Josh Rosen 4cebb79c9f Deprecate mapPartitionsWithSplit in PySpark.
Also, replace the last reference to it in the docs.

This fixes SPARK-1026.
2014-01-23 20:01:36 -08:00
Josh Rosen f83068497b Fix for SPARK-1025: PySpark hang on missing files. 2014-01-23 18:24:51 -08:00
Josh Rosen 61569906cc Fix SPARK-978: ClassCastException in PySpark cartesian. 2014-01-23 15:09:19 -08:00
Josh Rosen 0035dbbc81 Fix SPARK-1034: Py4JException on PySpark Cartesian Result 2014-01-23 13:05:59 -08:00
Patrick Wendell 4c16f79ce4 Merge pull request #426 from mateiz/py-ml-tests
Re-enable Python MLlib tests (require Python 2.7 and NumPy 1.7+)

We disabled these earlier because Jenkins didn't have these versions.
2014-01-18 16:21:43 -08:00
Patrick Wendell bf5699543b Merge pull request #462 from mateiz/conf-file-fix
Remove Typesafe Config usage and conf files to fix nested property names

With Typesafe Config we had the subtle problem of no longer allowing
nested property names, which are used for a few of our properties:
http://apache-spark-developers-list.1001551.n3.nabble.com/Config-properties-broken-in-master-td208.html

This PR is for branch 0.9 but should be added into master too.
(cherry picked from commit 34e911ce9a)

Signed-off-by: Patrick Wendell <pwendell@gmail.com>
2014-01-18 16:20:00 -08:00
Matei Zaharia 5b3a3e28d7 Complain if Python and NumPy versions are too old for MLlib 2014-01-14 12:27:58 -08:00
Matei Zaharia 5741078c46 Log Python exceptions to stderr as well
This helps in case the exception happened while serializing a record to
be sent to Java, leaving the stream to Java in an inconsistent state
where PythonRDD won't be able to read the error.
2014-01-12 00:10:41 -08:00
Matei Zaharia 4c28a2bad8 Update some Python MLlib parameters to use camelCase, and tweak docs
We've used camel case in other Spark methods so it felt reasonable to
keep using it here and make the code match Scala/Java as much as
possible. Note that parameter names matter in Python because it allows
passing optional parameters by name.
2014-01-11 22:30:48 -08:00
Matei Zaharia 9a0dfdf868 Add Naive Bayes to Python MLlib, and some API fixes
- Added a Python wrapper for Naive Bayes
- Updated the Scala Naive Bayes to match the style of our other
  algorithms better and in particular make it easier to call from Java
  (added builder pattern, removed default value in train method)
- Updated Python MLlib functions to not require a SparkContext; we can
  get that from the RDD the user gives
- Added a toString method in LabeledPoint
- Made the Python MLlib tests run as part of run-tests as well (before
  they could only be run individually through each file)
2014-01-11 22:30:48 -08:00
Hossein Falaki 3a8beb46cb Merge branch 'master' into MatrixFactorizationModel-fix 2014-01-07 15:22:42 -08:00
Hossein Falaki 754f5300a1 Added predictAll python function to MatrixFactorizationModel 2014-01-06 12:19:43 -08:00
Hossein Falaki 04132ea9b2 Added Rating deserializer 2014-01-06 12:19:08 -08:00
Hossein Falaki 8d0c2f7399 Added python binding for bulk recommendation 2014-01-04 16:23:17 -08:00
Patrick Wendell 604fad9c39 Merge remote-tracking branch 'apache-github/master' into remove-binaries
Conflicts:
	core/src/test/scala/org/apache/spark/DriverSuite.scala
	docs/python-programming-guide.md
2014-01-03 21:29:33 -08:00
Patrick Wendell 9e6f3bdcda Changes on top of Prashant's patch.
Closes #316
2014-01-03 18:30:17 -08:00
Patrick Wendell 4ae101ff38 Merge pull request #317 from ScrapCodes/spark-915-segregate-scripts
Spark-915 segregate scripts
2014-01-03 11:24:35 -08:00
Prashant Sharma 74ba97fcf7 sbin/spark-class* -> bin/spark-class* 2014-01-03 15:08:01 +05:30
Matei Zaharia ca67909cd4 Merge pull request #311 from tmyklebu/master
SPARK-991: Report information gleaned from a Python stacktrace in the UI

Scala:

- Added setCallSite/clearCallSite to SparkContext and JavaSparkContext.
  These functions mutate a LocalProperty called "externalCallSite."
- Add a wrapper, getCallSite, that checks for an externalCallSite and, if
  none is found, calls the usual Utils.formatSparkCallSite.
- Change everything that calls Utils.formatSparkCallSite to call
  getCallSite instead. Except getCallSite.
- Add wrappers to setCallSite/clearCallSite wrappers to JavaSparkContext.

Python:

- Add a gruesome hack to rdd.py that inspects the traceback and guesses
  what you want to see in the UI.
- Add a RAII wrapper around said gruesome hack that calls
  setCallSite/clearCallSite as appropriate.
- Wire said RAII wrapper up around three calls into the Scala code.
  I'm not sure that I hit all the spots with the RAII wrapper. I'm also
  not sure that my gruesome hack does exactly what we want.

One could also approach this change by refactoring
runJob/submitJob/runApproximateJob to take a call site, then threading
that parameter through everything that needs to know it.

One might object to the pointless-looking wrappers in JavaSparkContext.
Unfortunately, I can't directly access the SparkContext from
Python---or, if I can, I don't know how---so I need to wrap everything
that matters in JavaSparkContext.

Conflicts:
	core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala
2014-01-02 15:54:54 -05:00
Prashant Sharma a3f90a2ecf pyspark -> bin/pyspark 2014-01-02 18:50:12 +05:30
Prashant Sharma 980afd280a Merge branch 'scripts-reorg' of github.com:shane-huang/incubator-spark into spark-915-segregate-scripts
Conflicts:
	bin/spark-shell
	core/pom.xml
	core/src/main/scala/org/apache/spark/SparkContext.scala
	core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala
	core/src/main/scala/org/apache/spark/ui/UIWorkloadGenerator.scala
	core/src/test/scala/org/apache/spark/DriverSuite.scala
	python/run-tests
	sbin/compute-classpath.sh
	sbin/spark-class
	sbin/stop-slaves.sh
2014-01-02 17:55:21 +05:30
Matei Zaharia 7e8d2e8a5c Fix Python code after change of getOrElse 2014-01-01 23:21:34 -05:00
Matei Zaharia e2c68642c6 Miscellaneous fixes from code review.
Also replaced SparkConf.getOrElse with just a "get" that takes a default
value, and added getInt, getLong, etc to make code that uses this
simpler later on.
2014-01-01 22:03:39 -05:00
Matei Zaharia ba9338f104 Merge remote-tracking branch 'apache/master' into conf2
Conflicts:
	core/src/main/scala/org/apache/spark/rdd/CheckpointRDD.scala
	streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala
	streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
2013-12-31 18:23:14 -05:00
Patrick Wendell 55b7e2fdff Merge pull request #289 from tdas/filestream-fix
Bug fixes for file input stream and checkpointing

- Fixed bugs in the file input stream that led the stream to fail due to transient HDFS errors (listing files when a background thread it deleting fails caused errors, etc.)
- Updated Spark's CheckpointRDD and Streaming's CheckpointWriter to use SparkContext.hadoopConfiguration, to allow checkpoints to be written to any HDFS compatible store requiring special configuration.
- Changed the API of SparkContext.setCheckpointDir() - eliminated the unnecessary 'useExisting' parameter. Now SparkContext will always create a unique subdirectory within the user specified checkpoint directory. This is to ensure that previous checkpoint files are not accidentally overwritten.
- Fixed bug where setting checkpoint directory as a relative local path caused the checkpointing to fail.
2013-12-31 10:12:51 -08:00
Matei Zaharia 0fa5809768 Updated docs for SparkConf and handled review comments 2013-12-30 22:17:28 -05:00
Matei Zaharia 994f080f8a Properly show Spark properties on web UI, and change app name property 2013-12-29 22:19:33 -05:00
Matei Zaharia eaa8a68ff0 Fix some Python docs and make sure to unset SPARK_TESTING in Python
tests so we don't get the test spark.conf on the classpath.
2013-12-29 20:15:07 -05:00
Matei Zaharia b4ceed40d6 Merge remote-tracking branch 'origin/master' into conf2
Conflicts:
	core/src/main/scala/org/apache/spark/SparkContext.scala
	core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
	core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala
	core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala
	core/src/main/scala/org/apache/spark/scheduler/local/LocalScheduler.scala
	core/src/main/scala/org/apache/spark/util/MetadataCleaner.scala
	core/src/test/scala/org/apache/spark/scheduler/TaskResultGetterSuite.scala
	core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala
	new-yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala
	streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala
	streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala
	streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala
	streaming/src/test/scala/org/apache/spark/streaming/BasicOperationsSuite.scala
	streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala
	streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala
	streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala
	streaming/src/test/scala/org/apache/spark/streaming/WindowOperationsSuite.scala
2013-12-29 15:08:08 -05:00
Matei Zaharia 58c6fa2041 Add Python docs about SparkConf 2013-12-29 14:46:59 -05:00
Matei Zaharia 615fb649d6 Fix some other Python tests due to initializing JVM in a different way
The test in context.py created two different instances of the
SparkContext class by copying "globals", so that some tests can have a
global "sc" object and others can try initializing their own contexts.
This led to two JVM gateways being created since SparkConf also looked
at pyspark.context.SparkContext to get the JVM.
2013-12-29 14:32:05 -05:00
Matei Zaharia cd00225db9 Add SparkConf support in Python 2013-12-29 14:03:39 -05:00
Matei Zaharia 1c11f54a9b Fix Python use of getLocalDir 2013-12-29 00:11:36 -05:00
Tor Myklebust fec01664a7 Make Python function/line appear in the UI. 2013-12-28 23:34:16 -05:00
Matei Zaharia c344ed04c7 Merge pull request #283 from tmyklebu/master
Python bindings for mllib

This pull request contains Python bindings for the regression, clustering, classification, and recommendation tools in mllib.

For each 'train' frontend exposed, there is a Scala stub in PythonMLLibAPI.scala and a Python stub in mllib.py.  The Python stub serialises the input RDD and any vector/matrix arguments into a mutually-understood format and calls the Scala stub.  The Scala stub deserialises the RDD and the vector/matrix arguments, calls the appropriate 'train' function, serialises the resulting model, and returns the serialised model.

ALSModel is slightly different since a MatrixFactorizationModel has RDDs inside.  The Scala stub returns a handle to a Scala MatrixFactorizationModel; prediction is done by calling the Scala predict method.

I have tested these bindings on an x86_64 machine running Linux.  There is a risk that these bindings may fail on some choose-your-own-endian platform if Python's endian differs from java.nio.ByteBuffer's idea of the native byte order.
2013-12-26 01:31:06 -05:00
Tor Myklebust 9cbcf81453 Remove commented code in __init__.py. 2013-12-25 14:12:42 -05:00
Tor Myklebust 5e71354cb7 Fix copypasta in __init__.py. Don't import anything directly into pyspark.mllib. 2013-12-25 14:10:55 -05:00
Tor Myklebust 02208a175c Initial weights in Scala are ones; do that too. Also fix some errors. 2013-12-25 00:53:48 -05:00
Tor Myklebust 05163057a1 Split the mllib bindings into a whole bunch of modules and rename some things. 2013-12-25 00:08:05 -05:00
Andrew Ash 3665c722b5 Typo: avaiable -> available 2013-12-24 17:25:04 -08:00
Tathagata Das d4dfab503a Fixed Python API for sc.setCheckpointDir. Also other fixes based on Reynold's comments on PR 289. 2013-12-24 14:01:13 -08:00
Tor Myklebust 86e38c4942 Remove useless line from test stub. 2013-12-24 16:49:31 -05:00
Tor Myklebust 4efec6eb94 Python change for move of PythonMLLibAPI. 2013-12-24 16:49:03 -05:00
Tor Myklebust cbb2811189 Release JVM reference to the ALSModel when done. 2013-12-22 15:03:58 -05:00
Tor Myklebust 076fc16221 Python stubs for ALSModel. 2013-12-21 14:54:01 -05:00
Tor Myklebust 0b494c2167 Un-semicolon mllib.py. 2013-12-20 02:05:55 -05:00
Tor Myklebust 0a5cacb961 Change some docstrings and add some others. 2013-12-20 02:05:15 -05:00
Tor Myklebust b835ddf3df Licence notice. 2013-12-20 01:55:03 -05:00
Tor Myklebust d89cc1e28a Whitespace. 2013-12-20 01:50:42 -05:00
Tor Myklebust 319520b9bb Remove gigantic endian-specific test and exception tests. 2013-12-20 01:48:44 -05:00
Tor Myklebust 2940201ad8 Tests for the Python side of the mllib bindings. 2013-12-20 01:33:32 -05:00
Tor Myklebust 73e17064c6 Python stubs for classification and clustering. 2013-12-20 00:12:48 -05:00
Tor Myklebust 2328bdd00f Python side of python bindings for linear, Lasso, and ridge regression 2013-12-19 22:45:16 -05:00
Reynold Xin 7990c56375 Merge pull request #276 from shivaram/collectPartition
Add collectPartition to JavaRDD interface.

This interface is useful for implementing `take` from other language frontends where the data is serialized. Also remove `takePartition` from PythonRDD and use `collectPartition` in rdd.py.

Thanks @concretevitamin for the original change and tests.
2013-12-19 13:35:09 -08:00
Shivaram Venkataraman d3234f9726 Make collectPartitions take an array of partitions
Change the implementation to use runJob instead of PartitionPruningRDD.
Also update the unit tests and the python take implementation
to use the new interface.
2013-12-19 11:40:34 -08:00
Nick Pentreath a76f53416c Add toString to Java RDD, and __repr__ to Python RDD 2013-12-19 14:38:20 +02:00
Tor Myklebust bf20591a00 Incorporate most of Josh's style suggestions. I don't want to deal with the type and length checking errors until we've got at least one working stub that we're all happy with. 2013-12-19 03:40:57 -05:00
Tor Myklebust bf491bb3c0 The rest of the Python side of those bindings. 2013-12-19 01:29:51 -05:00
Tor Myklebust 95915f8b3b First cut at python mllib bindings. Only LinearRegression is supported. 2013-12-19 01:29:09 -05:00
Shivaram Venkataraman af0cd6bd27 Add collectPartition to JavaRDD interface.
Also remove takePartition from PythonRDD and use collectPartition in rdd.py.
2013-12-18 11:40:07 -08:00
Prashant Sharma 603af51bb5 Merge branch 'master' into akka-bug-fix
Conflicts:
	core/pom.xml
	core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala
	pom.xml
	project/SparkBuild.scala
	streaming/pom.xml
	yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala
2013-12-11 10:21:53 +05:30
Josh Rosen 3787f514d9 Fix UnicodeEncodeError in PySpark saveAsTextFile().
Fixes SPARK-970.
2013-11-28 23:44:56 -08:00
Prashant Sharma 17987778da Merge branch 'master' into wip-scala-2.10
Conflicts:
	core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala
	core/src/main/scala/org/apache/spark/rdd/MapPartitionsRDD.scala
	core/src/main/scala/org/apache/spark/rdd/MapPartitionsWithContextRDD.scala
	core/src/main/scala/org/apache/spark/rdd/RDD.scala
	python/pyspark/rdd.py
2013-11-27 14:44:12 +05:30
Josh Rosen 1b74a27da0 Removed unused basestring case from dump_stream. 2013-11-26 14:35:12 -08:00
Raymond Liu 0f2e3c6e31 Merge branch 'master' into scala-2.10 2013-11-13 16:55:11 +08:00
Josh Rosen 13122ceb8c FramedSerializer: _dumps => dumps, _loads => loads. 2013-11-10 17:53:25 -08:00
Josh Rosen ffa5bedf46 Send PySpark commands as bytes insetad of strings. 2013-11-10 16:46:00 -08:00
Josh Rosen cbb7f04aef Add custom serializer support to PySpark.
For now, this only adds MarshalSerializer, but it lays the groundwork
for other supporting custom serializers.  Many of these mechanisms
can also be used to support deserialization of different data formats
sent by Java, such as data encoded by MsgPack.

This also fixes a bug in SparkContext.union().
2013-11-10 16:45:38 -08:00
Josh Rosen 7d68a81a8e Remove Pickle-wrapping of Java objects in PySpark.
If we support custom serializers, the Python
worker will know what type of input to expect,
so we won't need to wrap Tuple2 and Strings into
pickled tuples and strings.
2013-11-03 11:03:02 -08:00
Josh Rosen a48d88d206 Replace magic lengths with constants in PySpark.
Write the length of the accumulators section up-front rather
than terminating it with a negative length.  I find this
easier to read.
2013-11-03 10:54:24 -08:00
Ewen Cheslack-Postava 317a9eb1ce Pass self to SparkContext._ensure_initialized.
The constructor for SparkContext should pass in self so that we track
the current context and produce errors if another one is created. Add
a doctest to make sure creating multiple contexts triggers the
exception.
2013-10-22 11:26:49 -07:00
Ewen Cheslack-Postava 56d230e614 Add classmethod to SparkContext to set system properties.
Add a new classmethod to SparkContext to set system properties like is
possible in Scala/Java. Unlike the Java/Scala implementations, there's
no access to System until the JVM bridge is created. Since
SparkContext handles that, move the initialization of the JVM
connection to a separate classmethod that can safely be called
repeatedly as long as the same instance (or no instance) is provided.
2013-10-22 00:22:37 -07:00
Ewen Cheslack-Postava 7eaa56de7f Add an add() method to pyspark accumulators.
Add a regular method for adding a term to accumulators in
pyspark. Currently if you have a non-global accumulator, adding to it
is awkward. The += operator can't be used for non-global accumulators
captured via closure because it's involves an assignment. The only way
to do it is using __iadd__ directly.

Adding this method lets you write code like this:

def main():
    sc = SparkContext()
    accum = sc.accumulator(0)

    rdd = sc.parallelize([1,2,3])
    def f(x):
        accum.add(x)
    rdd.foreach(f)
    print accum.value

where using accum += x instead would have caused UnboundLocalError
exceptions in workers. Currently it would have to be written as
accum.__iadd__(x).
2013-10-19 19:55:39 -07:00
Prashant Sharma 026ab75661 Merge branch 'master' of github.com:apache/incubator-spark into scala-2.10 2013-10-10 09:42:55 +05:30
Matei Zaharia 478b2b7edc Fix PySpark docs and an overly long line of code after fdbae41e 2013-10-09 12:08:04 -07:00