ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Ahir Reddy	59b1379594	SPARK-1114: Allow PySpark to use existing JVM and Gateway Patch to allow PySpark to use existing JVM and Gateway. Changes to PySpark implementation of SparkConf to take existing SparkConf JVM handle. Change to PySpark SparkContext to allow subclass specific context initialization. Author: Ahir Reddy <ahirreddy@gmail.com> Closes #622 from ahirreddy/pyspark-existing-jvm and squashes the following commits: a86f457 [Ahir Reddy] Patch to allow PySpark to use existing JVM and Gateway. Changes to PySpark implementation of SparkConf to take existing SparkConf JVM handle. Change to PySpark SparkContext to allow subclass specific context initialization.	2014-02-20 21:20:39 -08:00
Josh Rosen	1381fc72f7	Switch from MUTF8 to UTF8 in PySpark serializers. This fixes SPARK-1043, a bug introduced in 0.9.0 where PySpark couldn't serialize strings > 64kB. This fix was written by @tyro89 and @bouk in #512. This commit squashes and rebases their pull request in order to fix some merge conflicts.	2014-01-28 20:20:08 -08:00
Matei Zaharia	7e8d2e8a5c	Fix Python code after change of getOrElse	2014-01-01 23:21:34 -05:00
Matei Zaharia	ba9338f104	Merge remote-tracking branch 'apache/master' into conf2 Conflicts: core/src/main/scala/org/apache/spark/rdd/CheckpointRDD.scala streaming/src/main/scala/org/apache/spark/streaming/Checkpoint.scala streaming/src/main/scala/org/apache/spark/streaming/scheduler/JobGenerator.scala	2013-12-31 18:23:14 -05:00
Matei Zaharia	0fa5809768	Updated docs for SparkConf and handled review comments	2013-12-30 22:17:28 -05:00
Matei Zaharia	994f080f8a	Properly show Spark properties on web UI, and change app name property	2013-12-29 22:19:33 -05:00
Matei Zaharia	eaa8a68ff0	Fix some Python docs and make sure to unset SPARK_TESTING in Python tests so we don't get the test spark.conf on the classpath.	2013-12-29 20:15:07 -05:00
Matei Zaharia	58c6fa2041	Add Python docs about SparkConf	2013-12-29 14:46:59 -05:00
Matei Zaharia	615fb649d6	Fix some other Python tests due to initializing JVM in a different way The test in context.py created two different instances of the SparkContext class by copying "globals", so that some tests can have a global "sc" object and others can try initializing their own contexts. This led to two JVM gateways being created since SparkConf also looked at pyspark.context.SparkContext to get the JVM.	2013-12-29 14:32:05 -05:00
Matei Zaharia	cd00225db9	Add SparkConf support in Python	2013-12-29 14:03:39 -05:00
Matei Zaharia	1c11f54a9b	Fix Python use of getLocalDir	2013-12-29 00:11:36 -05:00
Tathagata Das	d4dfab503a	Fixed Python API for sc.setCheckpointDir. Also other fixes based on Reynold's comments on PR 289.	2013-12-24 14:01:13 -08:00
Shivaram Venkataraman	af0cd6bd27	Add collectPartition to JavaRDD interface. Also remove takePartition from PythonRDD and use collectPartition in rdd.py.	2013-12-18 11:40:07 -08:00
Josh Rosen	13122ceb8c	FramedSerializer: _dumps => dumps, _loads => loads.	2013-11-10 17:53:25 -08:00
Josh Rosen	cbb7f04aef	Add custom serializer support to PySpark. For now, this only adds MarshalSerializer, but it lays the groundwork for other supporting custom serializers. Many of these mechanisms can also be used to support deserialization of different data formats sent by Java, such as data encoded by MsgPack. This also fixes a bug in SparkContext.union().	2013-11-10 16:45:38 -08:00
Josh Rosen	7d68a81a8e	Remove Pickle-wrapping of Java objects in PySpark. If we support custom serializers, the Python worker will know what type of input to expect, so we won't need to wrap Tuple2 and Strings into pickled tuples and strings.	2013-11-03 11:03:02 -08:00
Ewen Cheslack-Postava	317a9eb1ce	Pass self to SparkContext._ensure_initialized. The constructor for SparkContext should pass in self so that we track the current context and produce errors if another one is created. Add a doctest to make sure creating multiple contexts triggers the exception.	2013-10-22 11:26:49 -07:00
Ewen Cheslack-Postava	56d230e614	Add classmethod to SparkContext to set system properties. Add a new classmethod to SparkContext to set system properties like is possible in Scala/Java. Unlike the Java/Scala implementations, there's no access to System until the JVM bridge is created. Since SparkContext handles that, move the initialization of the JVM connection to a separate classmethod that can safely be called repeatedly as long as the same instance (or no instance) is provided.	2013-10-22 00:22:37 -07:00
Aaron Davidson	a3868544be	Whoopsy daisy	2013-09-08 00:30:47 -07:00
Aaron Davidson	c1cc8c4da2	Export StorageLevel and refactor	2013-09-07 14:41:31 -07:00
Aaron Davidson	8001687af5	Remove reflection, hard-code StorageLevels The sc.StorageLevel -> StorageLevel pathway is a bit janky, but otherwise the shell would have to call a private method of SparkContext. Having StorageLevel available in sc also doesn't seem like the end of the world. There may be a better solution, though. As for creating the StorageLevel object itself, this seems to be the best way in Python 2 for creating singleton, enum-like objects: http://stackoverflow.com/questions/36932/how-can-i-represent-an-enum-in-python	2013-09-07 09:34:07 -07:00
Aaron Davidson	b8a0b6ea5e	Memoize StorageLevels read from JVM	2013-09-06 15:36:04 -07:00
Aaron Davidson	a63d4c7dc2	SPARK-660: Add StorageLevel support in Python It uses reflection... I am not proud of that fact, but it at least ensures compatibility (sans refactoring of the StorageLevel stuff).	2013-09-05 23:36:27 -07:00
Matei Zaharia	0a8cc30921	Move some classes to more appropriate packages: * RDD, RDDFunctions -> org.apache.spark.rdd Utils, ClosureCleaner, SizeEstimator -> org.apache.spark.util * JavaSerializer, KryoSerializer -> org.apache.spark.serializer	2013-09-01 14:13:16 -07:00
Matei Zaharia	46eecd110a	Initial work to rename package to org.apache.spark	2013-09-01 14:13:13 -07:00
Andre Schumacher	c7e348faec	Implementing SPARK-878 for PySpark: adding zip and egg files to context and passing it down to workers which add these to their sys.path	2013-08-16 11:58:20 -07:00
Matei Zaharia	feba7ee540	SPARK-815. Python parallelize() should split lists before batching One unfortunate consequence of this fix is that we materialize any collections that are given to us as generators, but this seems necessary to get reasonable behavior on small collections. We could add a batchSize parameter later to bypass auto-computation of batch size if this becomes a problem (e.g. if users really want to parallelize big generators nicely)	2013-07-29 02:51:43 -04:00
Matei Zaharia	af3c9d5042	Add Apache license headers and LICENSE and NOTICE files	2013-07-16 17:21:33 -07:00
Josh Rosen	2415c18f48	Fix reporting of PySpark doctest failures.	2013-02-03 06:44:11 +00:00
Josh Rosen	e211f405bc	Use spark.local.dir for PySpark temp files (SPARK-580).	2013-02-01 11:50:27 -08:00
Josh Rosen	9cc6ff9c4e	Do not launch JavaGateways on workers (SPARK-674). The problem was that the gateway was being initialized whenever the pyspark.context module was loaded. The fix uses lazy initialization that occurs only when SparkContext instances are actually constructed. I also made the gateway and jvm variables private. This change results in ~3-4x performance improvement when running the PySpark unit tests.	2013-02-01 11:13:10 -08:00
Matei Zaharia	a2f4891d1d	Merge pull request #396 from JoshRosen/spark-653 Make PySpark AccumulatorParam an abstract base class	2013-01-24 13:05:03 -08:00
Josh Rosen	ae2ed2947d	Allow PySpark's SparkFiles to be used from driver Fix minor documentation formatting issues.	2013-01-23 10:58:50 -08:00
Josh Rosen	35168d9c89	Fix sys.path bug in PySpark SparkContext.addPyFile	2013-01-22 17:54:11 -08:00
Josh Rosen	c75ae3622e	Make AccumulatorParam an abstract base class.	2013-01-21 22:32:57 -08:00
Josh Rosen	ef711902c1	Don't download files to master's working directory. This should avoid exceptions caused by existing files with different contents. I also removed some unused code.	2013-01-21 17:34:17 -08:00
Josh Rosen	5b6ea9e9a0	Update checkpointing API docs in Python/Java.	2013-01-20 15:31:41 -08:00
Josh Rosen	d0ba80dc72	Add checkpointFile() and more tests to PySpark.	2013-01-20 13:59:45 -08:00
Josh Rosen	7ed1bf4b48	Add RDD checkpointing to Python API.	2013-01-20 13:19:19 -08:00
Matei Zaharia	8e7f098a2c	Added accumulators to PySpark	2013-01-20 01:57:44 -08:00
Josh Rosen	49c74ba2af	Change PYSPARK_PYTHON_EXEC to PYSPARK_PYTHON.	2013-01-10 08:10:59 -08:00
Josh Rosen	33beba3965	Change PySpark RDD.take() to not call iterator().	2013-01-03 14:52:21 -08:00
Josh Rosen	b58340dbd9	Rename top-level 'pyspark' directory to 'python'	2013-01-01 15:05:00 -08:00

43 commits