ODIn/spark-instrumented-optimizer

Author	SHA1	Message	Date
Matei Zaharia	d3525babee	Merge pull request #813 from AndreSchumacher/add_files_pyspark Implementing SPARK-865: Add the equivalent of ADD_JARS to PySpark	2013-08-12 21:02:39 -07:00
Andre Schumacher	8fd5c7bc00	Implementing SPARK-865: Add the equivalent of ADD_JARS to PySpark Now ADD_FILES uses a comma as file name separator.	2013-08-12 20:22:52 -07:00
Josh Rosen	b95732632b	Do not inherit master's PYTHONPATH on workers. This fixes SPARK-832, an issue where PySpark would not work when the master and workers used different SPARK_HOME paths. This change may potentially break code that relied on the master's PYTHONPATH being used on workers. To have custom PYTHONPATH additions used on the workers, users should set a custom PYTHONPATH in spark-env.sh rather than setting it in the shell.	2013-07-29 22:08:57 -07:00
Matei Zaharia	feba7ee540	SPARK-815. Python parallelize() should split lists before batching One unfortunate consequence of this fix is that we materialize any collections that are given to us as generators, but this seems necessary to get reasonable behavior on small collections. We could add a batchSize parameter later to bypass auto-computation of batch size if this becomes a problem (e.g. if users really want to parallelize big generators nicely)	2013-07-29 02:51:43 -04:00
Matei Zaharia	d75c308695	Use None instead of empty string as it's slightly smaller/faster	2013-07-29 02:51:43 -04:00
Matei Zaharia	b5ec355622	Optimize Python foreach() to not return as many objects	2013-07-29 02:51:43 -04:00
Matei Zaharia	b9d6783f36	Optimize Python take() to not compute entire first partition	2013-07-29 02:51:43 -04:00
Matei Zaharia	af3c9d5042	Add Apache license headers and LICENSE and NOTICE files	2013-07-16 17:21:33 -07:00
root	ec31e68d5d	Fixed PySpark perf regression by not using socket.makefile(), and improved debuggability by letting "print" statements show up in the executor's stderr Conflicts: core/src/main/scala/spark/api/python/PythonRDD.scala	2013-07-01 06:26:31 +00:00
Jey Kottalam	c75bed0eeb	Fix reporting of PySpark exceptions	2013-06-21 12:14:16 -04:00
Jey Kottalam	7c5ff733ee	PySpark daemon: fix deadlock, improve error handling	2013-06-21 12:14:16 -04:00
Jey Kottalam	62c4781400	Add tests and fixes for Python daemon shutdown	2013-06-21 12:14:16 -04:00
Jey Kottalam	c79a6078c3	Prefork Python worker processes	2013-06-21 12:14:16 -04:00
Jey Kottalam	40afe0d2a5	Add Python timing instrumentation	2013-06-21 12:14:16 -04:00
Jey Kottalam	9a731f5a6d	Fix Python saveAsTextFile doctest to not expect order to be preserved	2013-04-02 11:59:20 -07:00
Josh Rosen	2c966c98fb	Change numSplits to numPartitions in PySpark.	2013-02-24 13:25:09 -08:00
Mark Hamstra	b7a1fb5c5d	Add commutative requirement for 'reduce' to Python docstring.	2013-02-09 12:14:11 -08:00
Josh Rosen	e61729113d	Remove unnecessary doctest __main__ methods.	2013-02-03 21:29:40 -08:00
Josh Rosen	8fbd5380b7	Fetch fewer objects in PySpark's take() method.	2013-02-03 06:44:49 +00:00
Josh Rosen	2415c18f48	Fix reporting of PySpark doctest failures.	2013-02-03 06:44:11 +00:00
Josh Rosen	e211f405bc	Use spark.local.dir for PySpark temp files (SPARK-580).	2013-02-01 11:50:27 -08:00
Josh Rosen	9cc6ff9c4e	Do not launch JavaGateways on workers (SPARK-674). The problem was that the gateway was being initialized whenever the pyspark.context module was loaded. The fix uses lazy initialization that occurs only when SparkContext instances are actually constructed. I also made the gateway and jvm variables private. This change results in ~3-4x performance improvement when running the PySpark unit tests.	2013-02-01 11:13:10 -08:00
Josh Rosen	57b64d0d19	Fix stdout redirection in PySpark.	2013-02-01 00:25:19 -08:00
Patrick Wendell	3446d5c8d6	SPARK-673: Capture and re-throw Python exceptions This patch alters the Python <-> executor protocol to pass on exception data when they occur in user Python code.	2013-01-31 18:06:11 -08:00
Matei Zaharia	55327a283e	Merge pull request #430 from pwendell/pyspark-guide Minor improvements to PySpark docs	2013-01-30 15:35:29 -08:00
Patrick Wendell	3f945e3b83	Make module help available in python shell. Also, adds a line in doc explaining how to use.	2013-01-30 15:04:06 -08:00
Stephen Haberman	7dfb82a992	Replace old 'master' term with 'driver'.	2013-01-25 11:03:00 -06:00
Matei Zaharia	a2f4891d1d	Merge pull request #396 from JoshRosen/spark-653 Make PySpark AccumulatorParam an abstract base class	2013-01-24 13:05:03 -08:00
Josh Rosen	b47d054cfc	Remove use of abc.ABCMeta due to cloudpickle issue. cloudpickle runs into issues while pickling subclasses of AccumulatorParam, which may be related to this Python issue: http://bugs.python.org/issue7689 This seems hard to fix and the ABCMeta wasn't necessary, so I removed it.	2013-01-23 11:47:27 -08:00
Josh Rosen	ae2ed2947d	Allow PySpark's SparkFiles to be used from driver Fix minor documentation formatting issues.	2013-01-23 10:58:50 -08:00
Josh Rosen	35168d9c89	Fix sys.path bug in PySpark SparkContext.addPyFile	2013-01-22 17:54:11 -08:00
Josh Rosen	c75ae3622e	Make AccumulatorParam an abstract base class.	2013-01-21 22:32:57 -08:00
Josh Rosen	ef711902c1	Don't download files to master's working directory. This should avoid exceptions caused by existing files with different contents. I also removed some unused code.	2013-01-21 17:34:17 -08:00
Matei Zaharia	c7b5e5f1ec	Merge pull request #389 from JoshRosen/python_rdd_checkpointing Add checkpointing to the Python API	2013-01-20 17:10:44 -08:00
Josh Rosen	9f211dd3f0	Fix PythonPartitioner equality; see SPARK-654. PythonPartitioner did not take the Python-side partitioning function into account when checking for equality, which might cause problems in the future.	2013-01-20 15:41:42 -08:00
Josh Rosen	00d70cd660	Clean up setup code in PySpark checkpointing tests	2013-01-20 15:38:11 -08:00
Josh Rosen	5b6ea9e9a0	Update checkpointing API docs in Python/Java.	2013-01-20 15:31:41 -08:00
Josh Rosen	d0ba80dc72	Add checkpointFile() and more tests to PySpark.	2013-01-20 13:59:45 -08:00
Josh Rosen	7ed1bf4b48	Add RDD checkpointing to Python API.	2013-01-20 13:19:19 -08:00
Josh Rosen	17035db159	Add __repr__ to Accumulator; fix bug in sc.accumulator	2013-01-20 11:58:57 -08:00
Matei Zaharia	a23ed25f3c	Add a class comment to Accumulator	2013-01-20 02:10:25 -08:00
Matei Zaharia	8e7f098a2c	Added accumulators to PySpark	2013-01-20 01:57:44 -08:00
Josh Rosen	49c74ba2af	Change PYSPARK_PYTHON_EXEC to PYSPARK_PYTHON.	2013-01-10 08:10:59 -08:00
Josh Rosen	b57dd0f160	Add mapPartitionsWithSplit() to PySpark.	2013-01-08 16:05:02 -08:00
Josh Rosen	33beba3965	Change PySpark RDD.take() to not call iterator().	2013-01-03 14:52:21 -08:00
Josh Rosen	ce9f1bbe20	Add `pyspark` script to replace the other scripts. Expand the PySpark programming guide.	2013-01-01 21:25:49 -08:00
Josh Rosen	b58340dbd9	Rename top-level 'pyspark' directory to 'python'	2013-01-01 15:05:00 -08:00

47 commits