ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Matei Zaharia	938e4a0e16	Re-enable Python MLlib tests (require Python 2.7 and NumPy 1.7+)	2014-01-14 12:14:48 -08:00
Matei Zaharia	cc93c2abb1	Disable MLlib tests for now while Jenkins is still on Python 2.6	2014-01-13 20:46:46 -08:00
Matei Zaharia	5741078c46	Log Python exceptions to stderr as well This helps in case the exception happened while serializing a record to be sent to Java, leaving the stream to Java in an inconsistent state where PythonRDD won't be able to read the error.	2014-01-12 00:10:41 -08:00
Matei Zaharia	4c28a2bad8	Update some Python MLlib parameters to use camelCase, and tweak docs We've used camel case in other Spark methods so it felt reasonable to keep using it here and make the code match Scala/Java as much as possible. Note that parameter names matter in Python because it allows passing optional parameters by name.	2014-01-11 22:30:48 -08:00
Matei Zaharia	9a0dfdf868	Add Naive Bayes to Python MLlib, and some API fixes - Added a Python wrapper for Naive Bayes - Updated the Scala Naive Bayes to match the style of our other algorithms better and in particular make it easier to call from Java (added builder pattern, removed default value in train method) - Updated Python MLlib functions to not require a SparkContext; we can get that from the RDD the user gives - Added a toString method in LabeledPoint - Made the Python MLlib tests run as part of run-tests as well (before they could only be run individually through each file)	2014-01-11 22:30:48 -08:00
Hossein Falaki	3a8beb46cb	Merge branch 'master' into MatrixFactorizationModel-fix	2014-01-07 15:22:42 -08:00
Hossein Falaki	754f5300a1	Added predictAll python function to MatrixFactorizationModel	2014-01-06 12:19:43 -08:00
Hossein Falaki	04132ea9b2	Added Rating deserializer	2014-01-06 12:19:08 -08:00
Hossein Falaki	8d0c2f7399	Added python binding for bulk recommendation	2014-01-04 16:23:17 -08:00
Patrick Wendell	604fad9c39	Merge remote-tracking branch 'apache-github/master' into remove-binaries Conflicts: core/src/test/scala/org/apache/spark/DriverSuite.scala docs/python-programming-guide.md	2014-01-03 21:29:33 -08:00
Patrick Wendell	9e6f3bdcda	Changes on top of Prashant's patch. Closes #316	2014-01-03 18:30:17 -08:00
Patrick Wendell	4ae101ff38	Merge pull request #317 from ScrapCodes/spark-915-segregate-scripts Spark-915 segregate scripts	2014-01-03 11:24:35 -08:00
Prashant Sharma	74ba97fcf7	sbin/spark-class* -> bin/spark-class*	2014-01-03 15:08:01 +05:30
Prashant Sharma	94f2fffa23	fixed review comments	2014-01-03 14:43:37 +05:30
Matei Zaharia	ca67909cd4	Merge pull request #311 from tmyklebu/master SPARK-991: Report information gleaned from a Python stacktrace in the UI Scala: - Added setCallSite/clearCallSite to SparkContext and JavaSparkContext. These functions mutate a LocalProperty called "externalCallSite." - Add a wrapper, getCallSite, that checks for an externalCallSite and, if none is found, calls the usual Utils.formatSparkCallSite. - Change everything that calls Utils.formatSparkCallSite to call getCallSite instead. Except getCallSite. - Add wrappers to setCallSite/clearCallSite wrappers to JavaSparkContext. Python: - Add a gruesome hack to rdd.py that inspects the traceback and guesses what you want to see in the UI. - Add a RAII wrapper around said gruesome hack that calls setCallSite/clearCallSite as appropriate. - Wire said RAII wrapper up around three calls into the Scala code. I'm not sure that I hit all the spots with the RAII wrapper. I'm also not sure that my gruesome hack does exactly what we want. One could also approach this change by refactoring runJob/submitJob/runApproximateJob to take a call site, then threading that parameter through everything that needs to know it. One might object to the pointless-looking wrappers in JavaSparkContext. Unfortunately, I can't directly access the SparkContext from Python---or, if I can, I don't know how---so I need to wrap everything that matters in JavaSparkContext. Conflicts: core/src/main/scala/org/apache/spark/api/java/JavaSparkContext.scala	2014-01-02 15:54:54 -05:00
Prashant Sharma	a3f90a2ecf	pyspark -> bin/pyspark	2014-01-02 18:50:12 +05:30
Prashant Sharma	980afd280a	Merge branch 'scripts-reorg' of github.com:shane-huang/incubator-spark into spark-915-segregate-scripts Conflicts: bin/spark-shell core/pom.xml core/src/main/scala/org/apache/spark/SparkContext.scala core/src/main/scala/org/apache/spark/scheduler/cluster/mesos/CoarseMesosSchedulerBackend.scala core/src/main/scala/org/apache/spark/ui/UIWorkloadGenerator.scala core/src/test/scala/org/apache/spark/DriverSuite.scala python/run-tests sbin/compute-classpath.sh sbin/spark-class sbin/stop-slaves.sh	2014-01-02 17:55:21 +05:30
Matei Zaharia	7e8d2e8a5c	Fix Python code after change of getOrElse	2014-01-01 23:21:34 -05:00
Matei Zaharia	e2c68642c6	Miscellaneous fixes from code review. Also replaced SparkConf.getOrElse with just a "get" that takes a default value, and added getInt, getLong, etc to make code that uses this simpler later on.	2014-01-01 22:03:39 -05:00
Matei Zaharia	ba9338f104	Merge remote-tracking branch 'apache/master' into conf2 Conflicts: core/src/main/scala/org/apache/spark/rdd/CheckpointRDD.scala streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala	2013-12-31 18:23:14 -05:00
Patrick Wendell	55b7e2fdff	Merge pull request #289 from tdas/filestream-fix Bug fixes for file input stream and checkpointing - Fixed bugs in the file input stream that led the stream to fail due to transient HDFS errors (listing files when a background thread it deleting fails caused errors, etc.) - Updated Spark's CheckpointRDD and Streaming's CheckpointWriter to use SparkContext.hadoopConfiguration, to allow checkpoints to be written to any HDFS compatible store requiring special configuration. - Changed the API of SparkContext.setCheckpointDir() - eliminated the unnecessary 'useExisting' parameter. Now SparkContext will always create a unique subdirectory within the user specified checkpoint directory. This is to ensure that previous checkpoint files are not accidentally overwritten. - Fixed bug where setting checkpoint directory as a relative local path caused the checkpointing to fail.	2013-12-31 10:12:51 -08:00
Matei Zaharia	0fa5809768	Updated docs for SparkConf and handled review comments	2013-12-30 22:17:28 -05:00
Matei Zaharia	994f080f8a	Properly show Spark properties on web UI, and change app name property	2013-12-29 22:19:33 -05:00
Matei Zaharia	eaa8a68ff0	Fix some Python docs and make sure to unset SPARK_TESTING in Python tests so we don't get the test spark.conf on the classpath.	2013-12-29 20:15:07 -05:00
Matei Zaharia	b4ceed40d6	Merge remote-tracking branch 'origin/master' into conf2 Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala core/src/main/scala/org/apache/spark/scheduler/TaskSchedulerImpl.scala core/src/main/scala/org/apache/spark/scheduler/cluster/ClusterTaskSetManager.scala core/src/main/scala/org/apache/spark/scheduler/local/LocalScheduler.scala core/src/main/scala/org/apache/spark/util/MetadataCleaner.scala core/src/test/scala/org/apache/spark/scheduler/TaskResultGetterSuite.scala core/src/test/scala/org/apache/spark/scheduler/TaskSetManagerSuite.scala new-yarn/src/main/scala/org/apache/spark/deploy/yarn/Client.scala streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala streaming/src/main/scala/org/apache/spark/streaming/api/java/JavaStreamingContext.scala streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala streaming/src/test/scala/org/apache/spark/streaming/BasicOperationsSuite.scala streaming/src/test/scala/org/apache/spark/streaming/CheckpointSuite.scala streaming/src/test/scala/org/apache/spark/streaming/InputStreamsSuite.scala streaming/src/test/scala/org/apache/spark/streaming/TestSuiteBase.scala streaming/src/test/scala/org/apache/spark/streaming/WindowOperationsSuite.scala	2013-12-29 15:08:08 -05:00
Matei Zaharia	58c6fa2041	Add Python docs about SparkConf	2013-12-29 14:46:59 -05:00
Matei Zaharia	615fb649d6	Fix some other Python tests due to initializing JVM in a different way The test in context.py created two different instances of the SparkContext class by copying "globals", so that some tests can have a global "sc" object and others can try initializing their own contexts. This led to two JVM gateways being created since SparkConf also looked at pyspark.context.SparkContext to get the JVM.	2013-12-29 14:32:05 -05:00
Matei Zaharia	cd00225db9	Add SparkConf support in Python	2013-12-29 14:03:39 -05:00
Matei Zaharia	1c11f54a9b	Fix Python use of getLocalDir	2013-12-29 00:11:36 -05:00
Tor Myklebust	fec01664a7	Make Python function/line appear in the UI.	2013-12-28 23:34:16 -05:00
Matei Zaharia	c344ed04c7	Merge pull request #283 from tmyklebu/master Python bindings for mllib This pull request contains Python bindings for the regression, clustering, classification, and recommendation tools in mllib. For each 'train' frontend exposed, there is a Scala stub in PythonMLLibAPI.scala and a Python stub in mllib.py. The Python stub serialises the input RDD and any vector/matrix arguments into a mutually-understood format and calls the Scala stub. The Scala stub deserialises the RDD and the vector/matrix arguments, calls the appropriate 'train' function, serialises the resulting model, and returns the serialised model. ALSModel is slightly different since a MatrixFactorizationModel has RDDs inside. The Scala stub returns a handle to a Scala MatrixFactorizationModel; prediction is done by calling the Scala predict method. I have tested these bindings on an x86_64 machine running Linux. There is a risk that these bindings may fail on some choose-your-own-endian platform if Python's endian differs from java.nio.ByteBuffer's idea of the native byte order.	2013-12-26 01:31:06 -05:00
Tor Myklebust	9cbcf81453	Remove commented code in __init__.py.	2013-12-25 14:12:42 -05:00
Tor Myklebust	5e71354cb7	Fix copypasta in __init__.py. Don't import anything directly into pyspark.mllib.	2013-12-25 14:10:55 -05:00
Tor Myklebust	02208a175c	Initial weights in Scala are ones; do that too. Also fix some errors.	2013-12-25 00:53:48 -05:00
Tor Myklebust	05163057a1	Split the mllib bindings into a whole bunch of modules and rename some things.	2013-12-25 00:08:05 -05:00
Andrew Ash	3665c722b5	Typo: avaiable -> available	2013-12-24 17:25:04 -08:00
Tathagata Das	d4dfab503a	Fixed Python API for sc.setCheckpointDir. Also other fixes based on Reynold's comments on PR 289.	2013-12-24 14:01:13 -08:00
Tor Myklebust	86e38c4942	Remove useless line from test stub.	2013-12-24 16:49:31 -05:00
Tor Myklebust	4efec6eb94	Python change for move of PythonMLLibAPI.	2013-12-24 16:49:03 -05:00
Tor Myklebust	cbb2811189	Release JVM reference to the ALSModel when done.	2013-12-22 15:03:58 -05:00
Tor Myklebust	076fc16221	Python stubs for ALSModel.	2013-12-21 14:54:01 -05:00
Tor Myklebust	0b494c2167	Un-semicolon mllib.py.	2013-12-20 02:05:55 -05:00
Tor Myklebust	0a5cacb961	Change some docstrings and add some others.	2013-12-20 02:05:15 -05:00
Tor Myklebust	b835ddf3df	Licence notice.	2013-12-20 01:55:03 -05:00
Tor Myklebust	d89cc1e28a	Whitespace.	2013-12-20 01:50:42 -05:00
Tor Myklebust	319520b9bb	Remove gigantic endian-specific test and exception tests.	2013-12-20 01:48:44 -05:00
Tor Myklebust	2940201ad8	Tests for the Python side of the mllib bindings.	2013-12-20 01:33:32 -05:00
Tor Myklebust	73e17064c6	Python stubs for classification and clustering.	2013-12-20 00:12:48 -05:00
Tor Myklebust	2328bdd00f	Python side of python bindings for linear, Lasso, and ridge regression	2013-12-19 22:45:16 -05:00
Reynold Xin	7990c56375	Merge pull request #276 from shivaram/collectPartition Add collectPartition to JavaRDD interface. This interface is useful for implementing `take` from other language frontends where the data is serialized. Also remove `takePartition` from PythonRDD and use `collectPartition` in rdd.py. Thanks @concretevitamin for the original change and tests.	2013-12-19 13:35:09 -08:00
Shivaram Venkataraman	d3234f9726	Make collectPartitions take an array of partitions Change the implementation to use runJob instead of PartitionPruningRDD. Also update the unit tests and the python take implementation to use the new interface.	2013-12-19 11:40:34 -08:00
Nick Pentreath	a76f53416c	Add toString to Java RDD, and __repr__ to Python RDD	2013-12-19 14:38:20 +02:00
Tor Myklebust	bf20591a00	Incorporate most of Josh's style suggestions. I don't want to deal with the type and length checking errors until we've got at least one working stub that we're all happy with.	2013-12-19 03:40:57 -05:00
Tor Myklebust	bf491bb3c0	The rest of the Python side of those bindings.	2013-12-19 01:29:51 -05:00
Tor Myklebust	95915f8b3b	First cut at python mllib bindings. Only LinearRegression is supported.	2013-12-19 01:29:09 -05:00
Shivaram Venkataraman	af0cd6bd27	Add collectPartition to JavaRDD interface. Also remove takePartition from PythonRDD and use collectPartition in rdd.py.	2013-12-18 11:40:07 -08:00
Prashant Sharma	603af51bb5	Merge branch 'master' into akka-bug-fix Conflicts: core/pom.xml core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala pom.xml project/SparkBuild.scala streaming/pom.xml yarn/src/main/scala/org/apache/spark/deploy/yarn/YarnAllocationHandler.scala	2013-12-11 10:21:53 +05:30
Patrick Wendell	5b74609d97	License headers	2013-12-09 16:41:01 -08:00
Josh Rosen	3787f514d9	Fix UnicodeEncodeError in PySpark saveAsTextFile(). Fixes SPARK-970.	2013-11-28 23:44:56 -08:00
Prashant Sharma	17987778da	Merge branch 'master' into wip-scala-2.10 Conflicts: core/src/main/scala/org/apache/spark/api/python/PythonRDD.scala core/src/main/scala/org/apache/spark/rdd/MapPartitionsRDD.scala core/src/main/scala/org/apache/spark/rdd/MapPartitionsWithContextRDD.scala core/src/main/scala/org/apache/spark/rdd/RDD.scala python/pyspark/rdd.py	2013-11-27 14:44:12 +05:30
Josh Rosen	1b74a27da0	Removed unused basestring case from dump_stream.	2013-11-26 14:35:12 -08:00
Raymond Liu	0f2e3c6e31	Merge branch 'master' into scala-2.10	2013-11-13 16:55:11 +08:00
Josh Rosen	13122ceb8c	FramedSerializer: _dumps => dumps, _loads => loads.	2013-11-10 17:53:25 -08:00
Josh Rosen	ffa5bedf46	Send PySpark commands as bytes insetad of strings.	2013-11-10 16:46:00 -08:00
Josh Rosen	cbb7f04aef	Add custom serializer support to PySpark. For now, this only adds MarshalSerializer, but it lays the groundwork for other supporting custom serializers. Many of these mechanisms can also be used to support deserialization of different data formats sent by Java, such as data encoded by MsgPack. This also fixes a bug in SparkContext.union().	2013-11-10 16:45:38 -08:00
Josh Rosen	7d68a81a8e	Remove Pickle-wrapping of Java objects in PySpark. If we support custom serializers, the Python worker will know what type of input to expect, so we won't need to wrap Tuple2 and Strings into pickled tuples and strings.	2013-11-03 11:03:02 -08:00
Josh Rosen	a48d88d206	Replace magic lengths with constants in PySpark. Write the length of the accumulators section up-front rather than terminating it with a negative length. I find this easier to read.	2013-11-03 10:54:24 -08:00
Ewen Cheslack-Postava	317a9eb1ce	Pass self to SparkContext._ensure_initialized. The constructor for SparkContext should pass in self so that we track the current context and produce errors if another one is created. Add a doctest to make sure creating multiple contexts triggers the exception.	2013-10-22 11:26:49 -07:00
Ewen Cheslack-Postava	56d230e614	Add classmethod to SparkContext to set system properties. Add a new classmethod to SparkContext to set system properties like is possible in Scala/Java. Unlike the Java/Scala implementations, there's no access to System until the JVM bridge is created. Since SparkContext handles that, move the initialization of the JVM connection to a separate classmethod that can safely be called repeatedly as long as the same instance (or no instance) is provided.	2013-10-22 00:22:37 -07:00
Ewen Cheslack-Postava	7eaa56de7f	Add an add() method to pyspark accumulators. Add a regular method for adding a term to accumulators in pyspark. Currently if you have a non-global accumulator, adding to it is awkward. The += operator can't be used for non-global accumulators captured via closure because it's involves an assignment. The only way to do it is using __iadd__ directly. Adding this method lets you write code like this: def main(): sc = SparkContext() accum = sc.accumulator(0) rdd = sc.parallelize([1,2,3]) def f(x): accum.add(x) rdd.foreach(f) print accum.value where using accum += x instead would have caused UnboundLocalError exceptions in workers. Currently it would have to be written as accum.__iadd__(x).	2013-10-19 19:55:39 -07:00
Prashant Sharma	026ab75661	Merge branch 'master' of github.com:apache/incubator-spark into scala-2.10	2013-10-10 09:42:55 +05:30
Matei Zaharia	478b2b7edc	Fix PySpark docs and an overly long line of code after `fdbae41e`	2013-10-09 12:08:04 -07:00
Prashant Sharma	7be75682b9	Merge branch 'master' into wip-merge-master Conflicts: bagel/pom.xml core/pom.xml core/src/test/scala/org/apache/spark/ui/UISuite.scala examples/pom.xml mllib/pom.xml pom.xml project/SparkBuild.scala repl/pom.xml streaming/pom.xml tools/pom.xml In scala 2.10, a shorter representation is used for naming artifacts so changed to shorter scala version for artifacts and made it a property in pom.	2013-10-08 11:29:40 +05:30
Andre Schumacher	fdbae41e88	SPARK-705: implement sortByKey() in PySpark	2013-10-07 12:16:33 -07:00
Andre Schumacher	c84946fe21	Fixing SPARK-602: PythonPartitioner Currently PythonPartitioner determines partition ID by hashing a byte-array representation of PySpark's key. This PR lets PythonPartitioner use the actual partition ID, which is required e.g. for sorting via PySpark.	2013-10-04 11:56:47 -07:00
Prashant Sharma	5829692885	Merge branch 'master' into scala-2.10 Conflicts: core/src/main/scala/org/apache/spark/ui/jobs/JobProgressUI.scala docs/_config.yml project/SparkBuild.scala repl/src/main/scala/org/apache/spark/repl/SparkILoop.scala	2013-10-01 11:57:24 +05:30
shane-huang	84849baf88	Merge branch 'reorgscripts' into scripts-reorg	2013-09-27 09:28:33 +08:00
shane-huang	e8b1ee04fc	fix paths and change spark to use APP_MEM as application driver memory instead of SPARK_MEM, user should add application jars to SPARK_CLASSPATH Signed-off-by: shane-huang <shengsheng.huang@intel.com>	2013-09-26 17:08:47 +08:00
Patrick Wendell	6079721fa1	Update build version in master	2013-09-24 11:41:51 -07:00
shane-huang	1d53792a0a	add scripts in bin Signed-off-by: shane-huang <shengsheng.huang@intel.com>	2013-09-23 16:13:46 +08:00
shane-huang	dfbdc9ddb7	added spark-class and spark-executor to sbin Signed-off-by: shane-huang <shengsheng.huang@intel.com>	2013-09-23 11:28:58 +08:00
Prashant Sharma	383e151fd7	Merge branch 'master' of git://github.com/mesos/spark into scala-2.10 Conflicts: core/src/main/scala/org/apache/spark/SparkContext.scala project/SparkBuild.scala	2013-09-15 10:55:12 +05:30
Aaron Davidson	a3868544be	Whoopsy daisy	2013-09-08 00:30:47 -07:00
Aaron Davidson	c1cc8c4da2	Export StorageLevel and refactor	2013-09-07 14:41:31 -07:00
Aaron Davidson	8001687af5	Remove reflection, hard-code StorageLevels The sc.StorageLevel -> StorageLevel pathway is a bit janky, but otherwise the shell would have to call a private method of SparkContext. Having StorageLevel available in sc also doesn't seem like the end of the world. There may be a better solution, though. As for creating the StorageLevel object itself, this seems to be the best way in Python 2 for creating singleton, enum-like objects: http://stackoverflow.com/questions/36932/how-can-i-represent-an-enum-in-python	2013-09-07 09:34:07 -07:00
Aaron Davidson	b8a0b6ea5e	Memoize StorageLevels read from JVM	2013-09-06 15:36:04 -07:00
Prashant Sharma	4106ae9fbf	Merged with master	2013-09-06 17:53:01 +05:30
Aaron Davidson	a63d4c7dc2	SPARK-660: Add StorageLevel support in Python It uses reflection... I am not proud of that fact, but it at least ensures compatibility (sans refactoring of the StorageLevel stuff).	2013-09-05 23:36:27 -07:00
Matei Zaharia	12b2f1f9c9	Add missing license headers found with RAT	2013-09-02 12:23:03 -07:00
Matei Zaharia	2ba695292a	Exclude some private modules in epydoc	2013-09-02 12:22:52 -07:00
Matei Zaharia	141f54279e	Further fixes to get PySpark to work on Windows	2013-09-02 01:19:29 +00:00
Matei Zaharia	6550e5e60c	Allow PySpark to launch worker.py directly on Windows	2013-09-01 18:06:15 -07:00
Matei Zaharia	0a8cc30921	Move some classes to more appropriate packages: * RDD, RDDFunctions -> org.apache.spark.rdd Utils, ClosureCleaner, SizeEstimator -> org.apache.spark.util * JavaSerializer, KryoSerializer -> org.apache.spark.serializer	2013-09-01 14:13:16 -07:00
Matei Zaharia	bbaa9d7d6e	Add banner to PySpark and make wordcount output nicer	2013-09-01 14:13:16 -07:00
Matei Zaharia	46eecd110a	Initial work to rename package to org.apache.spark	2013-09-01 14:13:13 -07:00
Matei Zaharia	6edef9c833	Merge pull request #861 from AndreSchumacher/pyspark_sampling_function Pyspark sampling function	2013-08-31 13:39:24 -07:00
Matei Zaharia	fd89835965	Merge pull request #870 from JoshRosen/spark-885 Don't send SIGINT / ctrl-c to Py4J gateway subprocess	2013-08-31 13:18:12 -07:00
Matei Zaharia	618f0ecb43	Merge pull request #869 from AndreSchumacher/subtract PySpark: implementing subtractByKey(), subtract() and keyBy()	2013-08-30 18:17:13 -07:00
Andre Schumacher	96571c2524	PySpark: replacing class manifest by class tag for Scala 2.10.2 inside rdd.py	2013-08-30 15:00:42 -07:00
Matei Zaharia	ab0e625d9e	Fix PySpark for assembly run and include it in dist	2013-08-29 21:19:06 -07:00

1 2 3 4 5

222 commits