Commit graph

3115 commits

Author SHA1 Message Date
Prashant Sharma e508f599f8 [SPARK-2108] Mark SparkContext methods that return block information as developer API's
Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #1047 from ScrapCodes/SPARK-2108/mark-as-dev-api and squashes the following commits:

073ee34 [Prashant Sharma] [SPARK-2108] Mark SparkContext methods that return block information as developer API's
2014-06-11 10:49:34 -07:00
Tathagata Das 4823bf470e [SPARK-1940] Enabling rolling of executor logs, and automatic cleanup of old executor logs
Currently, in the default log4j configuration, all the executor logs get sent to the file <code>[executor-working-dir]/stderr</code>. This does not all log files to be rolled, so old logs cannot be removed.

Using log4j RollingFileAppender allows log4j logs to be rolled, but all the logs get sent to a different set of files, other than the files <code>stdout</code> and <code>stderr</code> . So the logs are not visible in the Spark web UI any more as Spark web UI only reads the files <code>stdout</code> and <code>stderr</code>. Furthermore, it still does not allow the stdout and stderr to be cleared periodically in case a large amount of stuff gets written to them (e.g. by explicit `println` inside map function).

This PR solves this by implementing a simple `RollingFileAppender` within Spark (disabled by default). When enabled (using configuration parameter `spark.executor.rollingLogs.enabled`), the logs can get rolled over either by time interval (set with `spark.executor.rollingLogs.interval`, set to daily by default), or by size of logs (set with  `spark.executor.rollingLogs.size`). Finally, old logs can be automatically deleted by specifying how many of the latest log files to keep (set with `spark.executor.rollingLogs.keepLastN`).  The web UI has also been modified to show the logs across the rolled-over files.

You can test this locally (without waiting a whole day) by setting  configuration `spark.executor.rollingLogs.enabled=true` and `spark.executor.rollingLogs.interval=minutely`. Continuously generate logs by running spark jobs and the generated logs files would look like this (`stderr` and `stdout` are the most current log file that are being written to).

```
stderr
stderr--2014-05-27--14-37
stderr--2014-05-27--14-47
stderr--2014-05-27--15-05
stdout
stdout--2014-05-27--14-47
```

The web ui should show logs across these files.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #895 from tdas/rolling-logs and squashes the following commits:

fd8f87f [Tathagata Das] Minor change.
d326aee [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into rolling-logs
ad956c1 [Tathagata Das] Scala style fix.
1f0a6ec [Tathagata Das] Some more changes based on Patrick's PR comments.
c8bfe4e [Tathagata Das] Refactore FileAppender to a package spark.util.logging and broke up the file into multiple files. Changed configuration parameter names.
4224409 [Tathagata Das] Style fix.
108a9f8 [Tathagata Das] Added better constraint handling for rolling policies.
f7da977 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into rolling-logs
9134495 [Tathagata Das] Simplified rolling logs by removing Daily/Hourly/MinutelyRollingFileAppender, and removing the setting rollingLogs.enabled
312d874 [Tathagata Das] Minor fixes based on PR comments.
8a67d83 [Tathagata Das] Fixed comments.
b36cfd6 [Tathagata Das] Implemented RollingPolicy, TimeBasedRollingPolicy and SizeBasedRollingPolicy, and changed RollingFileAppender accordingly.
b7e8272 [Tathagata Das] Style fix,
374c9a9 [Tathagata Das] Added missing license.
24354ea [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into rolling-logs
6cc09c7 [Tathagata Das] Fixed bugs in rolling logs, and added more debug statements.
adf4910 [Tathagata Das] Merge remote-tracking branch 'apache-github/master' into rolling-logs
931f8fb [Tathagata Das] Changed log viewer in Spark web UI to handle rolling log files.
cb4fb6d [Tathagata Das] Added FileAppender and RollingFileAppender to generate rolling executor logs.
2014-06-10 20:22:02 -07:00
Nick Pentreath f971d6cb60 SPARK-1416: PySpark support for SequenceFile and Hadoop InputFormats
So I finally resurrected this PR. It seems the old one against the incubator mirror is no longer available, so I cannot reference it.

This adds initial support for reading Hadoop ```SequenceFile```s, as well as arbitrary Hadoop ```InputFormat```s, in PySpark.

# Overview
The basics are as follows:
1. ```PythonRDD``` object contains the relevant methods, that are in turn invoked by ```SparkContext``` in PySpark
2. The SequenceFile or InputFormat is read on the Scala side and converted from ```Writable``` instances to the relevant Scala classes (in the case of primitives)
3. Pyrolite is used to serialize Java objects. If this fails, the fallback is ```toString```
4. ```PickleSerializer``` on the Python side deserializes.

This works "out the box" for simple ```Writable```s:
* ```Text```
* ```IntWritable```, ```DoubleWritable```, ```FloatWritable```
* ```NullWritable```
* ```BooleanWritable```
* ```BytesWritable```
* ```MapWritable```

It also works for simple, "struct-like" classes. Due to the way Pyrolite works, this requires that the classes satisfy the JavaBeans convenstions (i.e. with fields and a no-arg constructor and getters/setters). (Perhaps in future some sugar for case classes and reflection could be added).

I've tested it out with ```ESInputFormat```  as an example and it works very nicely:
```python
conf = {"es.resource" : "index/type" }
rdd = sc.newAPIHadoopRDD("org.elasticsearch.hadoop.mr.EsInputFormat", "org.apache.hadoop.io.NullWritable", "org.elasticsearch.hadoop.mr.LinkedMapWritable", conf=conf)
rdd.first()
```

I suspect for things like HBase/Cassandra it will be a bit trickier to get it to work out the box.

# Some things still outstanding:
1. ~~Requires ```msgpack-python``` and will fail without it. As originally discussed with Josh, add a ```as_strings``` argument that defaults to ```False```, that can be used if ```msgpack-python``` is not available~~
2. ~~I see from https://github.com/apache/spark/pull/363 that Pyrolite is being used there for SerDe between Scala and Python. @ahirreddy @mateiz what is the plan behind this - is Pyrolite preferred? It seems from a cursory glance that adapting the ```msgpack```-based SerDe here to use Pyrolite wouldn't be too hard~~
3. ~~Support the key and value "wrapper" that would allow a Scala/Java function to be plugged in that would transform whatever the key/value Writable class is into something that can be serialized (e.g. convert some custom Writable to a JavaBean or ```java.util.Map``` that can be easily serialized)~~
4. Support ```saveAsSequenceFile``` and ```saveAsHadoopFile``` etc. This would require SerDe in the reverse direction, that can be handled by Pyrolite. Will work on this as a separate PR

Author: Nick Pentreath <nick.pentreath@gmail.com>

Closes #455 from MLnick/pyspark-inputformats and squashes the following commits:

268df7e [Nick Pentreath] Documentation changes mer @pwendell comments
761269b [Nick Pentreath] Address @pwendell comments, simplify default writable conversions and remove registry.
4c972d8 [Nick Pentreath] Add license headers
d150431 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
cde6af9 [Nick Pentreath] Parameterize converter trait
5ebacfa [Nick Pentreath] Update docs for PySpark input formats
a985492 [Nick Pentreath] Move Converter examples to own package
365d0be [Nick Pentreath] Make classes private[python]. Add docs and @Experimental annotation to Converter interface.
eeb8205 [Nick Pentreath] Fix path relative to SPARK_HOME in tests
1eaa08b [Nick Pentreath] HBase -> Cassandra app name oversight
3f90c3e [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
2c18513 [Nick Pentreath] Add examples for reading HBase and Cassandra InputFormats from Python
b65606f [Nick Pentreath] Add converter interface
5757f6e [Nick Pentreath] Default key/value classes for sequenceFile asre None
085b55f [Nick Pentreath] Move input format tests to tests.py and clean up docs
43eb728 [Nick Pentreath] PySpark InputFormats docs into programming guide
94beedc [Nick Pentreath] Clean up args in PythonRDD. Set key/value converter defaults to None for PySpark context.py methods
1a4a1d6 [Nick Pentreath] Address @mateiz style comments
01e0813 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
15a7d07 [Nick Pentreath] Remove default args for key/value classes. Arg names to camelCase
9fe6bd5 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
84fe8e3 [Nick Pentreath] Python programming guide space formatting
d0f52b6 [Nick Pentreath] Python programming guide
7caa73a [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
93ef995 [Nick Pentreath] Add back context.py changes
9ef1896 [Nick Pentreath] Recover earlier changes lost in previous merge for serializers.py
077ecb2 [Nick Pentreath] Recover earlier changes lost in previous merge for context.py
5af4770 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
35b8e3a [Nick Pentreath] Another fix for test ordering
bef3afb [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
e001b94 [Nick Pentreath] Fix test failures due to ordering
78978d9 [Nick Pentreath] Add doc for SequenceFile and InputFormat support to Python programming guide
64eb051 [Nick Pentreath] Scalastyle fix
e7552fa [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
44f2857 [Nick Pentreath] Remove msgpack dependency and switch serialization to Pyrolite, plus some clean up and refactoring
c0ebfb6 [Nick Pentreath] Change sequencefile test data generator to easily be called from PySpark tests
1d7c17c [Nick Pentreath] Amend tests to auto-generate sequencefile data in temp dir
17a656b [Nick Pentreath] remove binary sequencefile for tests
f60959e [Nick Pentreath] Remove msgpack dependency and serializer from PySpark
450e0a2 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
31a2fff [Nick Pentreath] Scalastyle fixes
fc5099e [Nick Pentreath] Add Apache license headers
4e08983 [Nick Pentreath] Clean up docs for PySpark context methods
b20ec7e [Nick Pentreath] Clean up merge duplicate dependencies
951c117 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
f6aac55 [Nick Pentreath] Bring back msgpack
9d2256e [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
1bbbfb0 [Nick Pentreath] Clean up SparkBuild from merge
a67dfad [Nick Pentreath] Clean up Msgpack serialization and registering
7237263 [Nick Pentreath] Add back msgpack serializer and hadoop file code lost during merging
25da1ca [Nick Pentreath] Add generator for nulls, bools, bytes and maps
65360d5 [Nick Pentreath] Adding test SequenceFiles
0c612e5 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
d72bf18 [Nick Pentreath] msgpack
dd57922 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
e67212a [Nick Pentreath] Add back msgpack dependency
f2d76a0 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
41856a5 [Nick Pentreath] Merge branch 'master' into pyspark-inputformats
97ef708 [Nick Pentreath] Remove old writeToStream
2beeedb [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
795a763 [Nick Pentreath] Change name to WriteInputFormatTestDataGenerator. Cleanup some var names. Use SPARK_HOME in path for writing test sequencefile data.
174f520 [Nick Pentreath] Add back graphx settings
703ee65 [Nick Pentreath] Add back msgpack
619c0fa [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
1c8efbc [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
eb40036 [Nick Pentreath] Remove unused comment lines
4d7ef2e [Nick Pentreath] Fix indentation
f1d73e3 [Nick Pentreath] mergeConfs returns a copy rather than mutating one of the input arguments
0f5cd84 [Nick Pentreath] Remove unused pair UTF8 class. Add comments to msgpack deserializer
4294cbb [Nick Pentreath] Add old Hadoop api methods. Clean up and expand comments. Clean up argument names
818a1e6 [Nick Pentreath] Add seqencefile and Hadoop InputFormat support to PythonRDD
4e7c9e3 [Nick Pentreath] Merge remote-tracking branch 'upstream/master' into pyspark-inputformats
c304cc8 [Nick Pentreath] Adding supporting sequncefiles for tests. Cleaning up
4b0a43f [Nick Pentreath] Refactoring utils into own objects. Cleaning up old commented-out code
d86325f [Nick Pentreath] Initial WIP of PySpark support for SequenceFile and arbitrary Hadoop InputFormat
2014-06-09 22:21:03 -07:00
Andrew Ash 35630c86ff SPARK-1944 Document --verbose in spark-shell -h
https://issues.apache.org/jira/browse/SPARK-1944

Author: Andrew Ash <andrew@andrewash.com>

Closes #1020 from ash211/SPARK-1944 and squashes the following commits:

a831c4d [Andrew Ash] SPARK-1944 Document --verbose in spark-shell -h
2014-06-09 10:21:21 -07:00
Andrew Ash 32ee9f0668 Grammar: read -> reads
Author: Andrew Ash <andrew@andrewash.com>

Closes #1016 from ash211/patch-6 and squashes the following commits:

e3865c8 [Andrew Ash] Grammar: read -> reads
2014-06-08 23:20:10 -07:00
Neville Li 15ddbef414 [SPARK-2067] use relative path for Spark logo in UI
Author: Neville Li <neville@spotify.com>

Closes #1006 from nevillelyh/gh/SPARK-2067 and squashes the following commits:

9ee64cf [Neville Li] [SPARK-2067] use relative path for Spark logo in UI
2014-06-08 23:18:27 -07:00
Reynold Xin 219dc00b30 SPARK-1628 follow up: Improve RangePartitioner's documentation.
Adding a paragraph clarifying a weird behavior in RangePartitioner.

See also #549.

Author: Reynold Xin <rxin@apache.org>

Closes #1012 from rxin/partitioner-doc and squashes the following commits:

6f0109e [Reynold Xin] SPARK-1628 follow up: Improve RangePartitioner's documentation.
2014-06-08 18:39:57 -07:00
zsxwing a71c6d1cf0 SPARK-1628: Add missing hashCode methods in Partitioner subclasses
JIRA: https://issues.apache.org/jira/browse/SPARK-1628

Added `hashCode` in HashPartitioner, RangePartitioner, PythonPartitioner and PageRankUtils.CustomPartitioner.

Author: zsxwing <zsxwing@gmail.com>

Closes #549 from zsxwing/SPARK-1628 and squashes the following commits:

2620936 [zsxwing] SPARK-1628: Add missing hashCode methods in Partitioner subclasses
2014-06-08 14:18:52 -07:00
Neville Li 7b877b2705 SPARK-2056 Set RDD name to input path
Author: Neville Li <neville@spotify.com>

Closes #992 from nevillelyh/master and squashes the following commits:

3011739 [Neville Li] [SPARK-2056] Set RDD name to input path
2014-06-07 16:22:26 -07:00
Matei Zaharia b45c13e7d7 SPARK-2043: ExternalAppendOnlyMap doesn't always find matching keys
The current implementation reads one key with the next hash code as it finishes reading the keys with the current hash code, which may cause it to miss some matches of the next key. This can cause operations like join to give the wrong result when reduce tasks spill to disk and there are hash collisions, as values won't be matched together. This PR fixes it by not reading in that next key, using a peeking iterator instead.

Author: Matei Zaharia <matei@databricks.com>

Closes #986 from mateiz/spark-2043 and squashes the following commits:

0959514 [Matei Zaharia] Added unit test for having many hash collisions
892debb [Matei Zaharia] SPARK-2043: don't read a key with the next hash code in ExternalAppendOnlyMap, instead use a buffered iterator to only read values with the current hash code.
2014-06-05 23:01:48 -07:00
CrazyJvm 3d3f8c8004 Use pluggable clock in DAGSheduler #SPARK-2031
DAGScheduler supports pluggable clock like what TaskSetManager does.

Author: CrazyJvm <crazyjvm@gmail.com>

Closes #976 from CrazyJvm/clock and squashes the following commits:

6779a4c [CrazyJvm] Use pluggable clock in DAGSheduler
2014-06-05 17:44:46 -07:00
CodingCat 89cdbb087c SPARK-1677: allow user to disable output dir existence checking
https://issues.apache.org/jira/browse/SPARK-1677

For compatibility with older versions of Spark it would be nice to have an option `spark.hadoop.validateOutputSpecs` (default true)  for the user to disable the output directory existence checking

Author: CodingCat <zhunansjtu@gmail.com>

Closes #947 from CodingCat/SPARK-1677 and squashes the following commits:

7930f83 [CodingCat] miao
c0c0e03 [CodingCat] bug fix and doc update
5318562 [CodingCat] bug fix
13219b5 [CodingCat] allow user to disable output dir existence checking
2014-06-05 11:39:35 -07:00
Colin McCabe 1765c8d0dd SPARK-1518: FileLogger: Fix compile against Hadoop trunk
In Hadoop trunk (currently Hadoop 3.0.0), the deprecated
FSDataOutputStream#sync() method has been removed.  Instead, we should
call FSDataOutputStream#hflush, which does the same thing as the
deprecated method used to do.

Author: Colin McCabe <cmccabe@cloudera.com>

Closes #898 from cmccabe/SPARK-1518 and squashes the following commits:

752b9d7 [Colin McCabe] FileLogger: Fix compile against Hadoop trunk
2014-06-04 15:56:29 -07:00
Sean Owen d341b17c2a SPARK-1973. Add randomSplit to JavaRDD (with tests, and tidy Java tests)
I'd like to use randomSplit through the Java API, and would like to add a convenience wrapper for this method to JavaRDD. This is fairly trivial. (In fact, is the intent that JavaRDD not wrap every RDD method? and that sometimes users should just use JavaRDD.wrapRDD()?)

Along the way, I added tests for it, and also touched up the Java API test style and behavior. This is maybe the more useful part of this small change.

Author: Sean Owen <sowen@cloudera.com>
Author: Xiangrui Meng <meng@databricks.com>

This patch had conflicts when merged, resolved by
Committer: Xiangrui Meng <meng@databricks.com>

Closes #919 from srowen/SPARK-1973 and squashes the following commits:

148cb7b [Sean Owen] Some final Java test polish, while we are at it
1fc3f3e [Xiangrui Meng] more cleaning on Java 8 tests
9ebc57f [Sean Owen] Use accumulator instead of temp files to test foreach
5efb0be [Sean Owen] Add Java randomSplit, and unit tests (including for sample)
5dcc158 [Sean Owen] Simplified Java 8 test with new language features, and fixed the name of MLB's greatest team
91a1769 [Sean Owen] Touch up minor style issues in existing Java API suite test
2014-06-04 11:27:08 -07:00
Kan Zhang c402a4a685 [SPARK-1817] RDD.zip() should verify partition sizes for each partition
RDD.zip() will throw an exception if it finds partition sizes are not the same.

Author: Kan Zhang <kzhang@apache.org>

Closes #944 from kanzhang/SPARK-1817 and squashes the following commits:

c073848 [Kan Zhang] [SPARK-1817] Cosmetic updates
524c670 [Kan Zhang] [SPARK-1817] RDD.zip() should verify partition sizes for each partition
2014-06-03 22:47:18 -07:00
Sean Owen 4ca0625669 SPARK-1806 (addendum) Use non-deprecated methods in Mesos 0.18
The update to Mesos 0.18 caused some deprecation warnings in the build. The change to the non-deprecated version is straightforward as it emulates what the Mesos driver does with the deprecated method anyway (c5aa1dd221/src/sched/sched.cpp (L1354))

Author: Sean Owen <sowen@cloudera.com>

Closes #920 from srowen/SPARK-1806 and squashes the following commits:

8d76b6a [Sean Owen] Use non-deprecated methods in Mesos 0.18
2014-06-03 22:37:20 -07:00
Reynold Xin 1faef149f7 SPARK-1941: Update streamlib to 2.7.0 and use HyperLogLogPlus instead of HyperLogLog.
I also corrected some errors made in the previous HLL count approximate API, including relativeSD wasn't really a measure for error (and we used it to test error bounds in test results).

Author: Reynold Xin <rxin@apache.org>

Closes #897 from rxin/hll and squashes the following commits:

4d83f41 [Reynold Xin] New error bound and non-randomness.
f154ea0 [Reynold Xin] Added a comment on the value bound for testing.
e367527 [Reynold Xin] One more round of code review.
41e649a [Reynold Xin] Update final mima list.
9e320c8 [Reynold Xin] Incorporate code review feedback.
e110d70 [Reynold Xin] Merge branch 'master' into hll
354deb8 [Reynold Xin] Added comment on the Mima exclude rules.
acaa524 [Reynold Xin] Added the right exclude rules in MimaExcludes.
6555bfe [Reynold Xin] Added a default method and re-arranged MimaExcludes.
1db1522 [Reynold Xin] Excluded util.SerializableHyperLogLog from MIMA check.
9221b27 [Reynold Xin] Merge branch 'master' into hll
88cfe77 [Reynold Xin] Updated documentation and restored the old incorrect API to maintain API compatibility.
1294be6 [Reynold Xin] Updated HLL+.
e7786cb [Reynold Xin] Merge branch 'master' into hll
c0ef0c2 [Reynold Xin] SPARK-1941: Update streamlib to 2.7.0 and use HyperLogLogPlus instead of HyperLogLog.
2014-06-03 18:37:40 -07:00
Ankur Dave b1feb60209 [SPARK-1991] Support custom storage levels for vertices and edges
This PR adds support for specifying custom storage levels for the vertices and edges of a graph. This enables GraphX to handle graphs larger than memory size by specifying MEMORY_AND_DISK and then repartitioning the graph to use many small partitions, each of which does fit in memory. Spark will then automatically load partitions from disk as needed.

The user specifies the desired vertex and edge storage levels when building the graph by passing them to the graph constructor. These are then stored in the `targetStorageLevel` attribute of the VertexRDD and EdgeRDD respectively. Whenever GraphX needs to cache a VertexRDD or EdgeRDD (because it plans to use it more than once, for example), it uses the specified target storage level. Also, when the user calls `Graph#cache()`, the vertices and edges are persisted using their target storage levels.

In order to facilitate propagating the target storage levels across VertexRDD and EdgeRDD operations, we remove raw calls to the constructors and instead introduce the `withPartitionsRDD` and `withTargetStorageLevel` methods.

I tested this change by running PageRank and triangle count on a severely memory-constrained cluster (1 executor with 300 MB of memory, and a 1 GB graph). Before this PR, these algorithms used to fail with OutOfMemoryErrors. With this PR, and using the DISK_ONLY storage level, they succeed.

Author: Ankur Dave <ankurdave@gmail.com>

Closes #946 from ankurdave/SPARK-1991 and squashes the following commits:

ce17d95 [Ankur Dave] Move pickStorageLevel to StorageLevel.fromString
ccaf06f [Ankur Dave] Shadow members in withXYZ() methods rather than using underscores
c34abc0 [Ankur Dave] Exclude all of GraphX from compatibility checks vs. 1.0.0
c5ca068 [Ankur Dave] Revert "Exclude all of GraphX from binary compatibility checks"
34bcefb [Ankur Dave] Exclude all of GraphX from binary compatibility checks
6fdd137 [Ankur Dave] [SPARK-1991] Support custom storage levels for vertices and edges
2014-06-03 14:54:26 -07:00
Wenchen Fan(Cloud) 45e9bc85db [SPARK-1912] fix compress memory issue during reduce
When we need to read a compressed block, we will first create a compress stream instance(LZF or Snappy) and use it to wrap that block.
Let's say a reducer task need to read 1000 local shuffle blocks, it will first prepare to read that 1000 blocks, which means create 1000 compression stream instance to wrap them. But the initialization of compression instance will allocate some memory and when we have many compression instance at the same time, it is a problem.
Actually reducer reads the shuffle blocks one by one, so we can do the compression instance initialization lazily.

Author: Wenchen Fan(Cloud) <cloud0fan@gmail.com>

Closes #860 from cloud-fan/fix-compress and squashes the following commits:

0924a6b [Wenchen Fan(Cloud)] rename 'doWork' into 'getIterator'
07f32c2 [Wenchen Fan(Cloud)] move the LazyProxyIterator to dataDeserialize
d80c426 [Wenchen Fan(Cloud)] remove empty lines in short class
2c8adb2 [Wenchen Fan(Cloud)] add inline comment
8ebff77 [Wenchen Fan(Cloud)] fix compress memory issue during reduce
2014-06-03 13:18:20 -07:00
Aaron Davidson 9909efc10a SPARK-1839: PySpark RDD#take() shouldn't always read from driver
This patch simply ports over the Scala implementation of RDD#take(), which reads the first partition at the driver, then decides how many more partitions it needs to read and will possibly start a real job if it's more than 1. (Note that SparkContext#runJob(allowLocal=true) only runs the job locally if there's 1 partition selected and no parent stages.)

Author: Aaron Davidson <aaron@databricks.com>

Closes #922 from aarondav/take and squashes the following commits:

fa06df9 [Aaron Davidson] SPARK-1839: PySpark RDD#take() shouldn't always read from driver
2014-05-31 13:04:57 -07:00
Aaron Davidson 7d52777eff Super minor: Close inputStream in SparkSubmitArguments
`Properties#load()` doesn't close the InputStream, but it'd be closed after being GC'd anyway...

Also changed file.getName to file, because getName only shows the filename. This will show the full (possibly relative) path, which is less confusing if it's not found.

Author: Aaron Davidson <aaron@databricks.com>

Closes #914 from aarondav/tiny and squashes the following commits:

db9d072 [Aaron Davidson] Super minor: Close inputStream in SparkSubmitArguments
2014-05-31 12:36:58 -07:00
Chen Chao 9ecc40d3ae correct tiny comment error
Author: Chen Chao <crazyjvm@gmail.com>

Closes #928 from CrazyJvm/patch-8 and squashes the following commits:

144328b [Chen Chao] correct tiny comment error
2014-05-31 00:06:49 -07:00
Zhen Peng ff562b2396 [SPARK-1901] worker should make sure executor has exited before updating executor's info
https://issues.apache.org/jira/browse/SPARK-1901

Author: Zhen Peng <zhenpeng01@baidu.com>

Closes #854 from zhpengg/bugfix-worker-kills-executor and squashes the following commits:

21d380b [Zhen Peng] add some error messages
506cea6 [Zhen Peng] add some docs for killProcess()
a0b9860 [Zhen Peng] [SPARK-1901] worker should make sure executor has exited before updating executor's info
2014-05-30 10:12:51 -07:00
witgo 4dbb27b0cf [SPARK-1712]: TaskDescription instance is too big causes Spark to hang
Author: witgo <witgo@qq.com>

Closes #694 from witgo/SPARK-1712_new and squashes the following commits:

0f52483 [witgo] review commit
83ce29b [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1712_new
52e6752 [witgo] reset test SparkContext
63636b6 [witgo] review commit
44a59ee [witgo] review commit
3b6d48c [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1712_new
926bd6a [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1712_new
9a5cfad [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1712_new
03cc562 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1712_new
b0930b0 [witgo] review commit
b1174bd [witgo] merge master
f76679b [witgo] merge master
689495d [witgo] fix scala style bug
1d35c3c [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1712_new
062c182 [witgo] fix small bug for code style
0a428cf [witgo] add unit tests
158b2dc [witgo] review commit
4afe71d [witgo] review commit
9e4ffa7 [witgo] review commit
1d35c7d [witgo] fix hang
7965580 [witgo] fix Statement order
0e29eac [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1712_new
3ea1ca1 [witgo] remove duplicate serialize
743a7ad [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1712_new
86e2048 [witgo] Merge branch 'master' of https://github.com/apache/spark into SPARK-1712_new
2a89adc [witgo] SPARK-1712: TaskDescription instance is too big causes Spark to hang
2014-05-28 15:57:05 -07:00
lianhuiwang 95e4c9c6fb bugfix worker DriverStateChanged state should match DriverState.FAILED
bugfix worker DriverStateChanged state should match DriverState.FAILED

Author: lianhuiwang <lianhuiwang09@gmail.com>

Closes #864 from lianhuiwang/master and squashes the following commits:

480ce94 [lianhuiwang] address aarondav comments
f2b5970 [lianhuiwang] bugfix worker DriverStateChanged state should match DriverState.FAILED
2014-05-27 11:53:38 -07:00
zsxwing 549830b0db SPARK-1932: Fix race conditions in onReceiveCallback and cachedPeers
`var cachedPeers: Seq[BlockManagerId] = null` is used in `def replicate(blockId: BlockId, data: ByteBuffer, level: StorageLevel)` without proper protection.

There are two place will call `replicate(blockId, bytesAfterPut, level)`
* 17f3075bc4/core/src/main/scala/org/apache/spark/storage/BlockManager.scala (L644) runs in `connectionManager.futureExecContext`
* 17f3075bc4/core/src/main/scala/org/apache/spark/storage/BlockManager.scala (L752) `doPut` runs in `connectionManager.handleMessageExecutor`. `org.apache.spark.storage.BlockManagerWorker` calls `blockManager.putBytes` in `connectionManager.handleMessageExecutor`.

As they run in different `Executor`s, this is a race condition which may cause the memory pointed by `cachedPeers` is not correct even if `cachedPeers != null`.

The race condition of `onReceiveCallback` is that it's set in `BlockManagerWorker` but read in a different thread in `ConnectionManager.handleMessageExecutor`.

Author: zsxwing <zsxwing@gmail.com>

Closes #887 from zsxwing/SPARK-1932 and squashes the following commits:

524f69c [zsxwing] SPARK-1932: Fix race conditions in onReceiveCallback and cachedPeers
2014-05-26 23:17:39 -07:00
Reynold Xin 90e281b55a SPARK-1933: Throw a more meaningful exception when a directory is passed to addJar/addFile.
https://issues.apache.org/jira/browse/SPARK-1933

Author: Reynold Xin <rxin@apache.org>

Closes #888 from rxin/addfile and squashes the following commits:

8c402a3 [Reynold Xin] Updated comment.
ff6c162 [Reynold Xin] SPARK-1933: Throw a more meaningful exception when a directory is passed to addJar/addFile.
2014-05-26 22:05:23 -07:00
Reynold Xin ef690e1f69 Fixed the error message for OutOfMemoryError in DAGScheduler. 2014-05-26 21:31:27 -07:00
Zhen Peng 8d271c90fa SPARK-1929 DAGScheduler suspended by local task OOM
DAGScheduler does not handle local task OOM properly, and will wait for the job result forever.

Author: Zhen Peng <zhenpeng01@baidu.com>

Closes #883 from zhpengg/bugfix-dag-scheduler-oom and squashes the following commits:

76f7eda [Zhen Peng] remove redundant memory allocations
aa63161 [Zhen Peng] SPARK-1929 DAGScheduler suspended by local task OOM
2014-05-26 21:30:25 -07:00
Patrick Wendell b6d22af040 HOTFIX: Add no-arg SparkContext constructor in Java
Self explanatory.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #878 from pwendell/java-constructor and squashes the following commits:

2cc1605 [Patrick Wendell] HOTFIX: Add no-arg SparkContext constructor in Java
2014-05-25 20:13:32 -07:00
Zhen Peng 4e4831b8fa [SPARK-1886] check executor id existence when executor exit
Author: Zhen Peng <zhenpeng01@baidu.com>

Closes #827 from zhpengg/bugfix-executor-id-not-found and squashes the following commits:

cd8bb65 [Zhen Peng] bugfix: check executor id existence when executor exit
2014-05-24 20:40:19 -07:00
Andrew Or 5081a0a9d4 [SPARK-1900 / 1918] PySpark on YARN is broken
If I run the following on a YARN cluster
```
bin/spark-submit sheep.py --master yarn-client
```
it fails because of a mismatch in paths: `spark-submit` thinks that `sheep.py` resides on HDFS, and balks when it can't find the file there. A natural workaround is to add the `file:` prefix to the file:
```
bin/spark-submit file:/path/to/sheep.py --master yarn-client
```
However, this also fails. This time it is because python does not understand URI schemes.

This PR fixes this by automatically resolving all paths passed as command line argument to `spark-submit` properly. This has the added benefit of keeping file and jar paths consistent across different cluster modes. For python, we strip the URI scheme before we actually try to run it.

Much of the code is originally written by @mengxr. Tested on YARN cluster. More tests pending.

Author: Andrew Or <andrewor14@gmail.com>

Closes #853 from andrewor14/submit-paths and squashes the following commits:

0bb097a [Andrew Or] Format path correctly before adding it to PYTHONPATH
323b45c [Andrew Or] Include --py-files on PYTHONPATH for pyspark shell
3c36587 [Andrew Or] Improve error messages (minor)
854aa6a [Andrew Or] Guard against NPE if user gives pathological paths
6638a6b [Andrew Or] Fix spark-shell jar paths after #849 went in
3bb0359 [Andrew Or] Update more comments (minor)
2a1f8a0 [Andrew Or] Update comments (minor)
6af2c77 [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-paths
a68c4d1 [Andrew Or] Handle Windows python file path correctly
427a250 [Andrew Or] Resolve paths properly for Windows
a591a4a [Andrew Or] Update tests for resolving URIs
6c8621c [Andrew Or] Move resolveURIs to Utils
db8255e [Andrew Or] Merge branch 'master' of github.com:apache/spark into submit-paths
f542dce [Andrew Or] Fix outdated tests
691c4ce [Andrew Or] Ignore special primary resource names
5342ac7 [Andrew Or] Add missing space in error message
02f77f3 [Andrew Or] Resolve command line arguments to spark-submit properly
2014-05-24 18:01:49 -07:00
Xiangrui Meng dba314029b [SPARK-1870] Make spark-submit --jars work in yarn-cluster mode.
Sent secondary jars to distributed cache of all containers and add the cached jars to classpath before executors start. Tested on a YARN cluster (CDH-5.0).

`spark-submit --jars` also works in standalone server and `yarn-client`. Thanks for @andrewor14 for testing!

I removed "Doesn't work for drivers in standalone mode with "cluster" deploy mode." from `spark-submit`'s help message, though we haven't tested mesos yet.

CC: @dbtsai @sryza

Author: Xiangrui Meng <meng@databricks.com>

Closes #848 from mengxr/yarn-classpath and squashes the following commits:

23e7df4 [Xiangrui Meng] rename spark.jar to __spark__.jar and app.jar to __app__.jar to avoid confliction apped $CWD/ and $CWD/* to the classpath remove unused methods
a40f6ed [Xiangrui Meng] standalone -> cluster
65e04ad [Xiangrui Meng] update spark-submit help message and add a comment for yarn-client
11e5354 [Xiangrui Meng] minor changes
3e7e1c4 [Xiangrui Meng] use sparkConf instead of hadoop conf
dc3c825 [Xiangrui Meng] add secondary jars to classpath in yarn
2014-05-22 01:52:50 -07:00
Tathagata Das 52eb54d024 [Spark 1877] ClassNotFoundException when loading RDD with serialized objects
Updated version of #821

Author: Tathagata Das <tathagata.das1565@gmail.com>
Author: Ghidireac <bogdang@u448a5b0a73d45358d94a.ant.amazon.com>

Closes #835 from tdas/SPARK-1877 and squashes the following commits:

f346f71 [Tathagata Das] Addressed Patrick's comments.
fee0c5d [Ghidireac] SPARK-1877: ClassNotFoundException when loading RDD with serialized objects
2014-05-19 22:36:24 -07:00
Aaron Davidson b0ce22e071 SPARK-1689: Spark application should die when removed by Master
scheduler.error() will mask the error if there are active tasks. Being removed is a cataclysmic event for Spark applications, and should probably be treated as such.

Author: Aaron Davidson <aaron@databricks.com>

Closes #832 from aarondav/i-love-u and squashes the following commits:

9f1200f [Aaron Davidson] SPARK-1689: Spark application should die when removed by Master
2014-05-19 20:55:26 -07:00
Matei Zaharia 5af99d7617 SPARK-1879. Increase MaxPermSize since some of our builds have many classes
See https://issues.apache.org/jira/browse/SPARK-1879 -- builds with Hadoop2 and Hive ran out of PermGen space in spark-shell, when those things added up with the Scala compiler.

Note that users can still override it by setting their own Java options with this change. Their options will come later in the command string than the -XX:MaxPermSize=128m.

Author: Matei Zaharia <matei@databricks.com>

Closes #823 from mateiz/spark-1879 and squashes the following commits:

6bc0ee8 [Matei Zaharia] Increase MaxPermSize to 128m since some of our builds have lots of classes
2014-05-19 18:42:28 -07:00
Matei Zaharia 7b70a70718 [SPARK-1876] Windows fixes to deal with latest distribution layout changes
- Look for JARs in the right place
- Launch examples the same way as on Unix
- Load datanucleus JARs if they exist
- Don't attempt to parse local paths as URIs in SparkSubmit, since paths with C:\ are not valid URIs
- Also fixed POM exclusion rules for datanucleus (it wasn't properly excluding it, whereas SBT was)

Author: Matei Zaharia <matei@databricks.com>

Closes #819 from mateiz/win-fixes and squashes the following commits:

d558f96 [Matei Zaharia] Fix comment
228577b [Matei Zaharia] Review comments
d3b71c7 [Matei Zaharia] Properly exclude datanucleus files in Maven assembly
144af84 [Matei Zaharia] Update Windows scripts to match latest binary package layout
2014-05-19 15:02:35 -07:00
Patrick Wendell 442808a748 Make deprecation warning less severe
Just a small change. I think it's good not to scare people who are using the old options.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #810 from pwendell/warnings and squashes the following commits:

cb8a311 [Patrick Wendell] Make deprecation warning less severe
2014-05-16 22:58:47 -07:00
Andrew Or 4b8ec6fcfd [SPARK-1808] Route bin/pyspark through Spark submit
**Problem.** For `bin/pyspark`, there is currently no other way to specify Spark configuration properties other than through `SPARK_JAVA_OPTS` in `conf/spark-env.sh`. However, this mechanism is supposedly deprecated. Instead, it needs to pick up configurations explicitly specified in `conf/spark-defaults.conf`.

**Solution.** Have `bin/pyspark` invoke `bin/spark-submit`, like all of its counterparts in Scala land (i.e. `bin/spark-shell`, `bin/run-example`). This has the additional benefit of making the invocation of all the user facing Spark scripts consistent.

**Details.** `bin/pyspark` inherently handles two cases: (1) running python applications and (2) running the python shell. For (1), Spark submit already handles running python applications. For cases in which `bin/pyspark` is given a python file, we can simply call pass the file directly to Spark submit and let it handle the rest.

For case (2), `bin/pyspark` starts a python process as before, which launches the JVM as a sub-process. The existing code already provides a code path to do this. All we needed to change is to use `bin/spark-submit` instead of `spark-class` to launch the JVM. This requires modifications to Spark submit to handle the pyspark shell as a special case.

This has been tested locally (OSX and Windows 7), on a standalone cluster, and on a YARN cluster. Running IPython also works as before, except now it takes in Spark submit arguments too.

Author: Andrew Or <andrewor14@gmail.com>

Closes #799 from andrewor14/pyspark-submit and squashes the following commits:

bf37e36 [Andrew Or] Minor changes
01066fa [Andrew Or] bin/pyspark for Windows
c8cb3bf [Andrew Or] Handle perverse app names (with escaped quotes)
1866f85 [Andrew Or] Windows is not cooperating
456d844 [Andrew Or] Guard against shlex hanging if PYSPARK_SUBMIT_ARGS is not set
7eebda8 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-submit
b7ba0d8 [Andrew Or] Address a few comments (minor)
06eb138 [Andrew Or] Use shlex instead of writing our own parser
05879fa [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-submit
a823661 [Andrew Or] Fix --die-on-broken-pipe not propagated properly
6fba412 [Andrew Or] Deal with quotes + address various comments
fe4c8a7 [Andrew Or] Update --help for bin/pyspark
afe47bf [Andrew Or] Fix spark shell
f04aaa4 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-submit
a371d26 [Andrew Or] Route bin/pyspark through Spark submit
2014-05-16 22:34:38 -07:00
Michael Armbrust a80a6a139e SPARK-1864 Look in spark conf instead of system properties when propagating configuration to executors.
Author: Michael Armbrust <michael@databricks.com>

Closes #808 from marmbrus/confClasspath and squashes the following commits:

4c31d57 [Michael Armbrust] Look in spark conf instead of system properties when propagating configuration to executors.
2014-05-16 20:25:10 -07:00
Aaron Davidson bb98ecafce SPARK-1860: Do not cleanup application work/ directories by default
This causes an unrecoverable error for applications that are running for longer
than 7 days that have jars added to the SparkContext, as the jars are cleaned up
even though the application is still running.

Author: Aaron Davidson <aaron@databricks.com>

Closes #800 from aarondav/shitty-defaults and squashes the following commits:

a573fbb [Aaron Davidson] SPARK-1860: Do not cleanup application work/ directories by default
2014-05-15 21:37:58 -07:00
Huajian Mao 94c5139607 Typos in Spark
Author: Huajian Mao <huajianmao@gmail.com>

Closes #798 from huajianmao/patch-1 and squashes the following commits:

208a454 [Huajian Mao] A typo in Task
1b515af [Huajian Mao] A typo in the message
2014-05-15 18:20:16 -07:00
Prashant Sharma 46324279da Package docs
This is a few changes based on the original patch by @scrapcodes.

Author: Prashant Sharma <prashant.s@imaginea.com>
Author: Patrick Wendell <pwendell@gmail.com>

Closes #785 from pwendell/package-docs and squashes the following commits:

c32b731 [Patrick Wendell] Changes based on Prashant's patch
c0463d3 [Prashant Sharma] added eof new line
ce8bf73 [Prashant Sharma] Added eof new line to all files.
4c35f2e [Prashant Sharma] SPARK-1563 Add package-info.java and package.scala files for all packages that appear in docs
2014-05-14 22:24:41 -07:00
Patrick Wendell 21570b4633 Documentation: Encourage use of reduceByKey instead of groupByKey.
Author: Patrick Wendell <pwendell@gmail.com>

Closes #784 from pwendell/group-by-key and squashes the following commits:

9b4505f [Patrick Wendell] Small fix
6347924 [Patrick Wendell] Documentation: Encourage use of reduceByKey instead of groupByKey.
2014-05-14 22:24:04 -07:00
Tathagata Das ad4e60ee7e [SPARK-1840] SparkListenerBus prints out scary error message when terminated normally
Running SparkPi example gave this error.
```
Pi is roughly 3.14374
14/05/14 18:16:19 ERROR Utils: Uncaught exception in thread SparkListenerBus
scala.runtime.NonLocalReturnControl$mcV$sp
```
This is due to the catch-all in the SparkListenerBus, which logged control throwable used by scala system

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #783 from tdas/controlexception-fix and squashes the following commits:

a466c8d [Tathagata Das] Ignored control exceptions when logging all exceptions.
2014-05-14 21:13:41 -07:00
andrewor14 9ad096d55a [Typo] propertes -> properties
Author: andrewor14 <andrewor14@gmail.com>

Closes #780 from andrewor14/submit-typo and squashes the following commits:

e70e057 [andrewor14] propertes -> properties
2014-05-14 17:54:53 -07:00
Jacek Laskowski 601e37198b String interpolation + some other small changes
After having been invited to make the change in 6bee01dd04 (commitcomment-6284165) by @witgo.

Author: Jacek Laskowski <jacek@japila.pl>

Closes #748 from jaceklaskowski/sparkenv-string-interpolation and squashes the following commits:

be6ebac [Jacek Laskowski] String interpolation + some other small changes
2014-05-14 15:45:52 -07:00
Patrick Wendell 65533c7ec0 SPARK-1833 - Have an empty SparkContext constructor.
This is nicer than relying on new SparkContext(new SparkConf())

Author: Patrick Wendell <pwendell@gmail.com>

Closes #774 from pwendell/spark-context and squashes the following commits:

ef9f12f [Patrick Wendell] SPARK-1833 - Have an empty SparkContext constructor.
2014-05-14 12:53:30 -07:00
Andrew Ash a3315d7f4c SPARK-1829 Sub-second durations shouldn't round to "0 s"
As "99 ms" up to 99 ms
As "0.1 s" from 0.1 s up to 0.9 s

https://issues.apache.org/jira/browse/SPARK-1829

Compare the first image to the second here: http://imgur.com/RaLEsSZ,7VTlgfo#0

Author: Andrew Ash <andrew@andrewash.com>

Closes #768 from ash211/spark-1829 and squashes the following commits:

1c15b8e [Andrew Ash] SPARK-1829 Format sub-second durations more appropriately
2014-05-14 12:01:14 -07:00
Mark Hamstra 17f3075bc4 [SPARK-1620] Handle uncaught exceptions in function run by Akka scheduler
If the intended behavior was that uncaught exceptions thrown in functions being run by the Akka scheduler would end up being handled by the default uncaught exception handler set in Executor, and if that behavior is, in fact, correct, then this is a way to accomplish that.  I'm not certain, though, that we shouldn't be doing something different to handle uncaught exceptions from some of these scheduled functions.

In any event, this PR covers all of the cases I comment on in [SPARK-1620](https://issues.apache.org/jira/browse/SPARK-1620).

Author: Mark Hamstra <markhamstra@gmail.com>

Closes #622 from markhamstra/SPARK-1620 and squashes the following commits:

071d193 [Mark Hamstra] refactored post-SPARK-1772
1a6a35e [Mark Hamstra] another style fix
d30eb94 [Mark Hamstra] scalastyle
3573ecd [Mark Hamstra] Use wrapped try/catch in Utils.tryOrExit
8fc0439 [Mark Hamstra] Make functions run by the Akka scheduler use Executor's UncaughtExceptionHandler
2014-05-14 10:07:25 -07:00
Andrew Or 69f750228f [SPARK-1769] Executor loss causes NPE race condition
This PR replaces the Schedulable data structures in Pool.scala with thread-safe ones from java. Note that Scala's `with SynchronizedBuffer` trait is soon to be deprecated in 2.11 because it is ["inherently unreliable"](http://www.scala-lang.org/api/2.11.0/index.html#scala.collection.mutable.SynchronizedBuffer). We should slowly drift away from `SynchronizedBuffer` in other places too.

Note that this PR introduces an API-breaking change; `sc.getAllPools` now returns an Array rather than an ArrayBuffer. This is because we want this method to return an immutable copy rather than one may potentially confuse the user if they try to modify the copy, which takes no effect on the original data structure.

Author: Andrew Or <andrewor14@gmail.com>

Closes #762 from andrewor14/pool-npe and squashes the following commits:

383e739 [Andrew Or] JavaConverters -> JavaConversions
3f32981 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pool-npe
769be19 [Andrew Or] Assorted minor changes
2189247 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pool-npe
05ad9e9 [Andrew Or] Fix test - contains is not the same as containsKey
0921ea0 [Andrew Or] var -> val
07d720c [Andrew Or] Synchronize Schedulable data structures
2014-05-14 00:54:33 -07:00
Koert Kuipers b22952fa1f SPARK-1801. expose InterruptibleIterator and TaskKilledException in deve...
...loper api

Author: Koert Kuipers <koert@tresata.com>

Closes #764 from koertkuipers/feat-rdd-developerapi and squashes the following commits:

8516dd2 [Koert Kuipers] SPARK-1801. expose InterruptibleIterator and TaskKilledException in developer api
2014-05-14 00:12:35 -07:00
Patrick Wendell 7bb9a521f3 Revert "[SPARK-1784] Add a new partitioner to allow specifying # of keys per partition"
This reverts commit 92cebada09.
2014-05-13 23:24:51 -07:00
larvaboy c33b8dcbf6 Implement ApproximateCountDistinct for SparkSql
Add the implementation for ApproximateCountDistinct to SparkSql. We use the HyperLogLog algorithm implemented in stream-lib, and do the count in two phases: 1) counting the number of distinct elements in each partitions, and 2) merge the HyperLogLog results from different partitions.

A simple serializer and test cases are added as well.

Author: larvaboy <larvaboy@gmail.com>

Closes #737 from larvaboy/master and squashes the following commits:

bd8ef3f [larvaboy] Add support of user-provided standard deviation to ApproxCountDistinct.
9ba8360 [larvaboy] Fix alignment and null handling issues.
95b4067 [larvaboy] Add a test case for count distinct and approximate count distinct.
f57917d [larvaboy] Add the parser for the approximate count.
a2d5d10 [larvaboy] Add ApproximateCountDistinct aggregates and functions.
7ad273a [larvaboy] Add SparkSql serializer for HyperLogLog.
1d9aacf [larvaboy] Fix a minor typo in the toString method of the Count case class.
653542b [larvaboy] Fix a couple of minor typos.
2014-05-13 21:26:08 -07:00
Syed Hashmi 92cebada09 [SPARK-1784] Add a new partitioner to allow specifying # of keys per partition
This change adds a new partitioner which allows users
to specify # of keys per partition.

Author: Syed Hashmi <shashmi@cloudera.com>

Closes #721 from syedhashmi/master and squashes the following commits:

4ca94cc [Syed Hashmi] [SPARK-1784] Add a new partitioner
2014-05-13 21:24:23 -07:00
Andrew Or 5c0dafc2c8 [SPARK-1816] LiveListenerBus dies if a listener throws an exception
The solution is to wrap a try / catch / log around the posting of each event to each listener.

Author: Andrew Or <andrewor14@gmail.com>

Closes #759 from andrewor14/listener-die and squashes the following commits:

aee5107 [Andrew Or] Merge branch 'master' of github.com:apache/spark into listener-die
370939f [Andrew Or] Remove two layers of indirection
422d278 [Andrew Or] Explicitly throw an exception instead of 1 / 0
0df0e2a [Andrew Or] Try/catch and log exceptions when posting events
2014-05-13 18:32:32 -07:00
William Benton 16ffadcc4a SPARK-571: forbid return statements in cleaned closures
This patch checks top-level closure arguments to `ClosureCleaner.clean` for `return` statements and raises an exception if it finds any.  This is mainly a user-friendliness addition, since programs with return statements in closure arguments will currently fail upon RDD actions with a less-than-intuitive error message.

Author: William Benton <willb@redhat.com>

Closes #717 from willb/spark-571 and squashes the following commits:

c41eb7d [William Benton] Another test case for SPARK-571
30c42f4 [William Benton] Stylistic cleanups
559b16b [William Benton] Stylistic cleanups from review
de13b79 [William Benton] Style fixes
295b6a5 [William Benton] Forbid return statements in closure arguments.
b017c47 [William Benton] Added a test for SPARK-571
2014-05-13 13:45:23 -07:00
Sandy Ryza 2792bd016a SPARK-1815. SparkContext should not be marked DeveloperApi
Author: Sandy Ryza <sandy@cloudera.com>

Closes #753 from sryza/sandy-spark-1815 and squashes the following commits:

957a8ac [Sandy Ryza] SPARK-1815. SparkContext should not be marked DeveloperApi
2014-05-12 20:08:30 -07:00
Andrew Or ba96bb3d59 [SPARK-1780] Non-existent SPARK_DAEMON_OPTS is lurking around
What they really mean is SPARK_DAEMON_***JAVA***_OPTS

Author: Andrew Or <andrewor14@gmail.com>

Closes #751 from andrewor14/spark-daemon-opts and squashes the following commits:

70c41f9 [Andrew Or] SPARK_DAEMON_OPTS -> SPARK_DAEMON_JAVA_OPTS
2014-05-12 19:42:35 -07:00
Andrew Ash a5150d199c Typo: resond -> respond
Author: Andrew Ash <andrew@andrewash.com>

Closes #743 from ash211/patch-4 and squashes the following commits:

c959f3b [Andrew Ash] Typo: resond -> respond
2014-05-12 18:46:28 -07:00
Patrick Wendell 925d8b249b SPARK-1623: Use File objects instead of String's in HTTPBroadcast
This seems strictly better, and I think it's justified only the grounds of
clean-up. It might also fix issues with path conversions, but I haven't
yet isolated any instance of that happening.

/cc @srowen @tdas

Author: Patrick Wendell <pwendell@gmail.com>

Closes #749 from pwendell/broadcast-cleanup and squashes the following commits:

d6d54f2 [Patrick Wendell] SPARK-1623: Use File objects instead of string's in HTTPBroadcast
2014-05-12 17:27:28 -07:00
Patrick Wendell 3ce526b168 Rename testExecutorEnvs --> executorEnvs.
This was changed, but in fact, it's used for things other than tests.
So I've changed it back.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #747 from pwendell/executor-env and squashes the following commits:

36a60a5 [Patrick Wendell] Rename testExecutorEnvs --> executorEnvs.
2014-05-12 17:09:13 -07:00
Sean Owen 7120a2979d SPARK-1798. Tests should clean up temp files
Three issues related to temp files that tests generate – these should be touched up for hygiene but are not urgent.

Modules have a log4j.properties which directs the unit-test.log output file to a directory like `[module]/target/unit-test.log`. But this ends up creating `[module]/[module]/target/unit-test.log` instead of former.

The `work/` directory is not deleted by "mvn clean", in the parent and in modules. Neither is the `checkpoint/` directory created under the various external modules.

Many tests create a temp directory, which is not usually deleted. This can be largely resolved by calling `deleteOnExit()` at creation and trying to call `Utils.deleteRecursively` consistently to clean up, sometimes in an `@After` method.

_If anyone seconds the motion, I can create a more significant change that introduces a new test trait along the lines of `LocalSparkContext`, which provides management of temp directories for subclasses to take advantage of._

Author: Sean Owen <sowen@cloudera.com>

Closes #732 from srowen/SPARK-1798 and squashes the following commits:

5af578e [Sean Owen] Try to consistently delete test temp dirs and files, and set deleteOnExit() for each
b21b356 [Sean Owen] Remove work/ and checkpoint/ dirs with mvn clean
bdd0f41 [Sean Owen] Remove duplicate module dir in log4j.properties output path for tests
2014-05-12 14:16:19 -07:00
Bernardo Gomez Palacio d9c97ba397 SPARK-1806: Upgrade Mesos dependency to 0.18.1
Enabled Mesos (0.18.1) dependency with shaded protobuf

Why is this needed?
Avoids any protobuf version collision between Mesos and any other
dependency in Spark e.g. Hadoop HDFS 2.2+ or 1.0.4.

Ticket: https://issues.apache.org/jira/browse/SPARK-1806

* Should close https://issues.apache.org/jira/browse/SPARK-1433

Author berngp

Author: Bernardo Gomez Palacio <bernardo.gomezpalacio@gmail.com>

Closes #741 from berngp/feature/SPARK-1806 and squashes the following commits:

5d70646 [Bernardo Gomez Palacio] SPARK-1806: Upgrade Mesos dependency to 0.18.1
2014-05-12 11:10:28 -07:00
Aaron Davidson 3af1f38643 SPARK-1772 Stop catching Throwable, let Executors die
The main issue this patch fixes is [SPARK-1772](https://issues.apache.org/jira/browse/SPARK-1772), in which Executors may not die when fatal exceptions (e.g., OOM) are thrown. This patch causes Executors to delegate to the ExecutorUncaughtExceptionHandler when a fatal exception is thrown.

This patch also continues the fight in the neverending war against `case t: Throwable =>`, by only catching Exceptions in many places, and adding a wrapper for Threads and Runnables to make sure any uncaught exceptions are at least printed to the logs.

It also turns out that it is unlikely that the IndestructibleActorSystem actually works, given testing ([here](https://gist.github.com/aarondav/ca1f0cdcd50727f89c0d)). The uncaughtExceptionHandler is not called from the places that we expected it would be.
[SPARK-1620](https://issues.apache.org/jira/browse/SPARK-1620) deals with part of this issue, but refactoring our Actor Systems to ensure that exceptions are dealt with properly is a much bigger change, outside the scope of this PR.

Author: Aaron Davidson <aaron@databricks.com>

Closes #715 from aarondav/throwable and squashes the following commits:

f9b9bfe [Aaron Davidson] Remove other redundant 'throw e'
e937a0a [Aaron Davidson] Address Prashant and Matei's comments
1867867 [Aaron Davidson] [RFC] SPARK-1772 Stop catching Throwable, let Executors die
2014-05-12 11:08:52 -07:00
Patrick Wendell 7d9cc9214b SPARK-1770: Load balance elements when repartitioning.
This patch adds better balancing when performing a repartition of an
RDD. Previously the elements in the RDD were hash partitioned, meaning
if the RDD was skewed certain partitions would end up being very large.

This commit adds load balancing of elements across the repartitioned
RDD splits. The load balancing is not perfect: a given output partition
can have up to N more elements than the average if there are N input
partitions. However, some randomization is used to minimize the
probabiliy that this happens.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #727 from pwendell/load-balance and squashes the following commits:

f9da752 [Patrick Wendell] Response to Matei's feedback
acfa46a [Patrick Wendell] SPARK-1770: Load balance elements when repartitioning.
2014-05-11 17:11:55 -07:00
witgo 6bee01dd04 remove outdated runtime Information scala home
Author: witgo <witgo@qq.com>

Closes #728 from witgo/scala_home and squashes the following commits:

cdfd8be [witgo] Merge branch 'master' of https://github.com/apache/spark into scala_home
fac094a [witgo] remove outdated runtime Information scala home
2014-05-11 14:34:27 -07:00
Andrew Or 83e0424d87 [SPARK-1774] Respect SparkSubmit --jars on YARN (client)
SparkSubmit ignores `--jars` for YARN client. This is a bug.

This PR also automatically adds the application jar to `spark.jar`. Previously, when running as yarn-client, you must specify the jar additionally through `--files` (because `--jars` didn't work). Now you don't have to explicitly specify it through either.

Tested on a YARN cluster.

Author: Andrew Or <andrewor14@gmail.com>

Closes #710 from andrewor14/yarn-jars and squashes the following commits:

35d1928 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-jars
c27bf6c [Andrew Or] For yarn-cluster and python, do not add primaryResource to spark.jar
c92c5bf [Andrew Or] Minor cleanups
269f9f3 [Andrew Or] Fix format
013d840 [Andrew Or] Fix tests
1407474 [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-jars
3bb75e8 [Andrew Or] Allow SparkSubmit --jars to take effect in yarn-client mode
2014-05-10 20:58:02 -07:00
Kan Zhang 6c2691d0a0 [SPARK-1690] Tolerating empty elements when saving Python RDD to text files
Tolerate empty strings in PythonRDD

Author: Kan Zhang <kzhang@apache.org>

Closes #644 from kanzhang/SPARK-1690 and squashes the following commits:

c62ad33 [Kan Zhang] Adding Python doctest
473ec4b [Kan Zhang] [SPARK-1690] Tolerating empty elements when saving Python RDD to text files
2014-05-10 14:01:08 -07:00
Bouke van der Bijl 3776f2f283 Add Python includes to path before depickling broadcast values
This fixes https://issues.apache.org/jira/browse/SPARK-1731 by adding the Python includes to the PYTHONPATH before depickling the broadcast values

@airhorns

Author: Bouke van der Bijl <boukevanderbijl@gmail.com>

Closes #656 from bouk/python-includes-before-broadcast and squashes the following commits:

7b0dfe4 [Bouke van der Bijl] Add Python includes to path before depickling broadcast values
2014-05-10 13:02:13 -07:00
Matei Zaharia 7eefc9d2b3 SPARK-1708. Add a ClassTag on Serializer and things that depend on it
This pull request contains a rebased patch from @heathermiller (https://github.com/heathermiller/spark/pull/1) to add ClassTags on Serializer and types that depend on it (Broadcast and AccumulableCollection). Putting these in the public API signatures now will allow us to use Scala Pickling for serialization down the line without breaking binary compatibility.

One question remaining is whether we also want them on Accumulator -- Accumulator is passed as part of a bigger Task or TaskResult object via the closure serializer so it doesn't seem super useful to add the ClassTag there. Broadcast and AccumulableCollection in contrast were being serialized directly.

CC @rxin, @pwendell, @heathermiller

Author: Matei Zaharia <matei@databricks.com>

Closes #700 from mateiz/spark-1708 and squashes the following commits:

1a3d8b0 [Matei Zaharia] Use fake ClassTag in Java
3b449ed [Matei Zaharia] test fix
2209a27 [Matei Zaharia] Code style fixes
9d48830 [Matei Zaharia] Add a ClassTag on Serializer and things that depend on it
2014-05-10 12:10:24 -07:00
CodingCat 2f452cbaf3 SPARK-1686: keep schedule() calling in the main thread
https://issues.apache.org/jira/browse/SPARK-1686

moved from original JIRA (by @markhamstra):

In deploy.master.Master, the completeRecovery method is the last thing to be called when a standalone Master is recovering from failure. It is responsible for resetting some state, relaunching drivers, and eventually resuming its scheduling duties.

There are currently four places in Master.scala where completeRecovery is called. Three of them are from within the actor's receive method, and aren't problems. The last starts from within receive when the ElectedLeader message is received, but the actual completeRecovery() call is made from the Akka scheduler. That means that it will execute on a different scheduler thread, and Master itself will end up running (i.e., schedule() ) from that Akka scheduler thread.

In this PR, I added a new master message TriggerSchedule to trigger the "local" call of schedule() in the scheduler thread

Author: CodingCat <zhunansjtu@gmail.com>

Closes #639 from CodingCat/SPARK-1686 and squashes the following commits:

81bb4ca [CodingCat] rename variable
69e0a2a [CodingCat] style fix
36a2ac0 [CodingCat] address Aaron's comments
ec9b7bb [CodingCat] address the comments
02b37ca [CodingCat] keep schedule() calling in the main thread
2014-05-09 21:50:23 -07:00
Aaron Davidson 59577df14c SPARK-1770: Revert accidental(?) fix
Looks like this change was accidentally committed here: 06b15baab2
but the change does not show up in the PR itself (#704).

Other than not intending to go in with that PR, this also broke the test JavaAPISuite.repartition.

Author: Aaron Davidson <aaron@databricks.com>

Closes #716 from aarondav/shufflerand and squashes the following commits:

b1cf70b [Aaron Davidson] SPARK-1770: Revert accidental(?) fix
2014-05-09 14:51:34 -07:00
Tathagata Das 32868f31f8 Converted bang to ask to avoid scary warning when a block is removed
Removing a block through the blockmanager gave a scary warning messages in the driver.
```
2014-05-08 20:16:19,172 WARN BlockManagerMasterActor: Got unknown message: true
2014-05-08 20:16:19,172 WARN BlockManagerMasterActor: Got unknown message: true
2014-05-08 20:16:19,172 WARN BlockManagerMasterActor: Got unknown message: true
```

This is because the [BlockManagerSlaveActor](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/storage/BlockManagerSlaveActor.scala#L44) would send back an acknowledgement ("true"). But the BlockManagerMasterActor would have sent the RemoveBlock message as a send, not as ask(), so would reject the receiver "true" as a unknown message.
@pwendell

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #708 from tdas/bm-fix and squashes the following commits:

ed4ef15 [Tathagata Das] Converted bang to ask to avoid scary warning when a block is removed.
2014-05-08 22:34:08 -07:00
Patrick Wendell 4c60fd1e8c MINOR: Removing dead code.
Meant to do this when patching up the last merge.
2014-05-08 22:33:06 -07:00
Sandeep 7db47c463f SPARK-1775: Unneeded lock in ShuffleMapTask.deserializeInfo
This was used in the past to have a cache of deserialized ShuffleMapTasks, but that's been removed, so there's no need for a lock. It slows down Spark when task descriptions are large, e.g. due to large lineage graphs or local variables.

Author: Sandeep <sandeep@techaddict.me>

Closes #707 from techaddict/SPARK-1775 and squashes the following commits:

18d8ebf [Sandeep] SPARK-1775: Unneeded lock in ShuffleMapTask.deserializeInfo This was used in the past to have a cache of deserialized ShuffleMapTasks, but that's been removed, so there's no need for a lock. It slows down Spark when task descriptions are large, e.g. due to large lineage graphs or local variables.
2014-05-08 22:30:17 -07:00
Patrick Wendell 06b15baab2 SPARK-1565 (Addendum): Replace run-example with spark-submit.
Gives a nicely formatted message to the user when `run-example` is run to
tell them to use `spark-submit`.

Author: Patrick Wendell <pwendell@gmail.com>

Closes #704 from pwendell/examples and squashes the following commits:

1996ee8 [Patrick Wendell] Feedback form Andrew
3eb7803 [Patrick Wendell] Suggestions from TD
2474668 [Patrick Wendell] SPARK-1565 (Addendum): Replace `run-example` with `spark-submit`.
2014-05-08 22:26:36 -07:00
Andrew Or 8b78412994 [SPARK-1755] Respect SparkSubmit --name on YARN
Right now, SparkSubmit ignores the `--name` flag for both yarn-client and yarn-cluster. This is a bug.

In client mode, SparkSubmit treats `--name` as a [cluster config](https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L170) and does not propagate this to SparkContext.

In cluster mode, SparkSubmit passes this flag to `org.apache.spark.deploy.yarn.Client`, which only uses it for the [YARN ResourceManager](https://github.com/apache/spark/blob/master/yarn/stable/src/main/scala/org/apache/spark/deploy/yarn/Client.scala#L80), but does not propagate this to SparkContext.

This PR ensures that `spark.app.name` is always set if SparkSubmit receives the `--name` flag, which is what the usage promises. This makes it possible for applications to start a SparkContext with an empty conf `val sc = new SparkContext(new SparkConf)`, and inherit the app name from SparkSubmit.

Tested both modes on a YARN cluster.

Author: Andrew Or <andrewor14@gmail.com>

Closes #699 from andrewor14/yarn-app-name and squashes the following commits:

98f6a79 [Andrew Or] Fix tests
dea932f [Andrew Or] Merge branch 'master' of github.com:apache/spark into yarn-app-name
c86d9ca [Andrew Or] Respect SparkSubmit --name on YARN
2014-05-08 20:45:29 -07:00
Andrew Or c3f8b78c21 [SPARK-1745] Move interrupted flag from TaskContext constructor (minor)
It makes little sense to start a TaskContext that is interrupted. Indeed, I searched for all use cases of it and didn't find a single instance in which `interrupted` is true on construction.

This was inspired by reviewing #640, which adds an additional `@volatile var completed` that is similar. These are not the most urgent changes, but I wanted to push them out before I forget.

Author: Andrew Or <andrewor14@gmail.com>

Closes #675 from andrewor14/task-context and squashes the following commits:

9575e02 [Andrew Or] Add space
69455d1 [Andrew Or] Merge branch 'master' of github.com:apache/spark into task-context
c471490 [Andrew Or] Oops, removed one flag too many. Adding it back.
85311f8 [Andrew Or] Move interrupted flag from TaskContext constructor
2014-05-08 12:13:07 -07:00
Prashant Sharma 44dd57fb66 SPARK-1565, update examples to be used with spark-submit script.
Commit for initial feedback, basically I am curious if we should prompt user for providing args esp. when its mandatory. And can we skip if they are not ?

Also few other things that did not work like
`bin/spark-submit examples/target/scala-2.10/spark-examples-1.0.0-SNAPSHOT-hadoop1.0.4.jar --class org.apache.spark.examples.SparkALS --arg 100 500 10 5 2`

Not all the args get passed properly, may be I have messed up something will try to sort it out hopefully.

Author: Prashant Sharma <prashant.s@imaginea.com>

Closes #552 from ScrapCodes/SPARK-1565/update-examples and squashes the following commits:

669dd23 [Prashant Sharma] Review comments
2727e70 [Prashant Sharma] SPARK-1565, update examples to be used with spark-submit script.
2014-05-08 10:23:05 -07:00
Andrew Or 5200872243 [SPARK-1688] Propagate PySpark worker stderr to driver
When at least one of the following conditions is true, PySpark cannot be loaded:

1. PYTHONPATH is not set
2. PYTHONPATH does not contain the python directory (or jar, in the case of YARN)
3. The jar does not contain pyspark files (YARN)
4. The jar does not contain py4j files (YARN)

However, we currently throw the same random `java.io.EOFException` for all of the above cases, when trying to read from the python daemon's output. This message is super unhelpful.

This PR includes the python stderr and the PYTHONPATH in the exception propagated to the driver. Now, the exception message looks something like:

```
Error from python worker:
  : No module named pyspark
PYTHONPATH was:
  /path/to/spark/python:/path/to/some/jar
java.io.EOFException
  <stack trace>
```

whereas before it was just

```
java.io.EOFException
  <stack trace>
```

Author: Andrew Or <andrewor14@gmail.com>

Closes #603 from andrewor14/pyspark-exception and squashes the following commits:

10d65d3 [Andrew Or] Throwable -> Exception, worker -> daemon
862d1d7 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-exception
a5ed798 [Andrew Or] Use block string and interpolation instead of var (minor)
cc09c45 [Andrew Or] Account for the fact that the python daemon may not have terminated yet
444f019 [Andrew Or] Use the new RedirectThread + include system PYTHONPATH
aab00ae [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-exception
0cc2402 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-exception
783efe2 [Andrew Or] Make python daemon stderr indentation consistent
9524172 [Andrew Or] Avoid potential NPE / error stream contention + Move things around
29f9688 [Andrew Or] Add back original exception type
e92d36b [Andrew Or] Include python worker stderr in the exception propagated to the driver
7c69360 [Andrew Or] Merge branch 'master' of github.com:apache/spark into pyspark-exception
cdbc185 [Andrew Or] Fix python attribute not found exception when PYTHONPATH is not set
dcc0353 [Andrew Or] Check both python and system environment variables for PYTHONPATH
6c09c21 [Andrew Or] Validate PYTHONPATH and PySpark modules before starting python workers
2014-05-07 14:35:22 -07:00
Andrew Ash 7f6f4a1035 Nicer logging for SecurityManager startup
Happy to open a jira ticket if you'd like to track one there.

Author: Andrew Ash <andrew@andrewash.com>

Closes #678 from ash211/SecurityManagerLogging and squashes the following commits:

2aa0b7a [Andrew Ash] Nicer logging for SecurityManager startup
2014-05-07 17:24:12 -04:00
Aaron Davidson 3308722ca0 SPARK-1579: Clean up PythonRDD and avoid swallowing IOExceptions
This patch includes several cleanups to PythonRDD, focused around fixing [SPARK-1579](https://issues.apache.org/jira/browse/SPARK-1579) cleanly. Listed in order of approximate importance:

- The Python daemon waits for Spark to close the socket before exiting,
  in order to avoid causing spurious IOExceptions in Spark's
  `PythonRDD::WriterThread`.
- Removes the Python Monitor Thread, which polled for task cancellations
  in order to kill the Python worker. Instead, we do this in the
  onCompleteCallback, since this is guaranteed to be called during
  cancellation.
- Adds a "completed" variable to TaskContext to avoid the issue noted in
  [SPARK-1019](https://issues.apache.org/jira/browse/SPARK-1019), where onCompleteCallbacks may be execution-order dependent.
  Along with this, I removed the "context.interrupted = true" flag in
  the onCompleteCallback.
- Extracts PythonRDD::WriterThread to its own class.

Since this patch provides an alternative solution to [SPARK-1019](https://issues.apache.org/jira/browse/SPARK-1019), I did test it with

```
sc.textFile("latlon.tsv").take(5)
```

many times without error.

Additionally, in order to test the unswallowed exceptions, I performed

```
sc.textFile("s3n://<big file>").count()
```

and cut my internet during execution. Prior to this patch, we got the "stdin writer exited early" message, which was unhelpful. Now, we get the SocketExceptions propagated through Spark to the user and get proper (though unsuccessful) task retries.

Author: Aaron Davidson <aaron@databricks.com>

Closes #640 from aarondav/pyspark-io and squashes the following commits:

b391ff8 [Aaron Davidson] Detect "clean socket shutdowns" and stop waiting on the socket
c0c49da [Aaron Davidson] SPARK-1579: Clean up PythonRDD and avoid swallowing IOExceptions
2014-05-07 09:48:31 -07:00
Kan Zhang 967635a242 [SPARK-1460] Returning SchemaRDD instead of normal RDD on Set operations...
... that do not change schema

Author: Kan Zhang <kzhang@apache.org>

Closes #448 from kanzhang/SPARK-1460 and squashes the following commits:

111e388 [Kan Zhang] silence MiMa errors in EdgeRDD and VertexRDD
91dc787 [Kan Zhang] Taking into account newly added Ordering param
79ed52a [Kan Zhang] [SPARK-1460] Returning SchemaRDD on Set operations that do not change schema
2014-05-07 09:41:31 -07:00
Patrick Wendell 913a0a9c0a SPARK-1746: Support setting SPARK_JAVA_OPTS on executors for backwards compatibility
Author: Patrick Wendell <pwendell@gmail.com>

Closes #676 from pwendell/worker-opts and squashes the following commits:

54456c4 [Patrick Wendell] SPARK-1746: Support setting SPARK_JAVA_OPTS on executors for backwards compatibility
2014-05-07 00:11:05 -07:00
Matei Zaharia 951a5d9398 [SPARK-1549] Add Python support to spark-submit
This PR updates spark-submit to allow submitting Python scripts (currently only with deploy-mode=client, but that's all that was supported before) and updates the PySpark code to properly find various paths, etc. One significant change is that we assume we can always find the Python files either from the Spark assembly JAR (which will happen with the Maven assembly build in make-distribution.sh) or from SPARK_HOME (which will exist in local mode even if you use sbt assembly, and should be enough for testing). This means we no longer need a weird hack to modify the environment for YARN.

This patch also updates the Python worker manager to run python with -u, which means unbuffered output (send it to our logs right away instead of waiting a while after stuff was written); this should simplify debugging.

In addition, it fixes https://issues.apache.org/jira/browse/SPARK-1709, setting the main class from a JAR's Main-Class attribute if not specified by the user, and fixes a few help strings and style issues in spark-submit.

In the future we may want to make the `pyspark` shell use spark-submit as well, but it seems unnecessary for 1.0.

Author: Matei Zaharia <matei@databricks.com>

Closes #664 from mateiz/py-submit and squashes the following commits:

15e9669 [Matei Zaharia] Fix some uses of path.separator property
051278c [Matei Zaharia] Small style fixes
0afe886 [Matei Zaharia] Add license headers
4650412 [Matei Zaharia] Add pyFiles to PYTHONPATH in executors, remove old YARN stuff, add tests
15f8e1e [Matei Zaharia] Set PYTHONPATH in PythonWorkerFactory in case it wasn't set from outside
47c0655 [Matei Zaharia] More work to make spark-submit work with Python:
d4375bd [Matei Zaharia] Clean up description of spark-submit args a bit and add Python ones
2014-05-06 15:12:35 -07:00
witgo ec09acdd4a SPARK-1734: spark-submit throws an exception: Exception in thread "main"...
... java.lang.ClassNotFoundException: org.apache.spark.broadcast.TorrentBroadcastFactory

Author: witgo <witgo@qq.com>

Closes #665 from witgo/SPARK-1734 and squashes the following commits:

cacf238 [witgo] SPARK-1734: spark-submit throws an exception: Exception in thread "main" java.lang.ClassNotFoundException: org.apache.spark.broadcast.TorrentBroadcastFactory
2014-05-06 14:17:39 -07:00
Mark Hamstra fbfe69de69 [SPARK-1685] Cancel retryTimer on restart of Worker or AppClient
See https://issues.apache.org/jira/browse/SPARK-1685 for a more complete description, but in essence: If the Worker or AppClient actor restarts before successfully registering with Master, multiple retryTimers will be running, which will lead to less than the full number of registration retries being attempted before the new actor is forced to give up.

Author: Mark Hamstra <markhamstra@gmail.com>

Closes #602 from markhamstra/SPARK-1685 and squashes the following commits:

11cc088 [Mark Hamstra] retryTimer -> registrationRetryTimer
69c348c [Mark Hamstra] Cancel retryTimer on restart of Worker or AppClient
2014-05-06 12:53:39 -07:00
ArcherShao 0a5a468114 Update OpenHashSet.scala
Modify wrong comment of function addWithoutResize.

Author: ArcherShao <ArcherShao@users.noreply.github.com>

Closes #667 from ArcherShao/patch-3 and squashes the following commits:

a607358 [ArcherShao] Update OpenHashSet.scala
2014-05-06 10:12:59 -07:00
Andrew Or ea10b31261 Expose SparkListeners and relevant classes as DeveloperApi
Hopefully this can go into 1.0, as a few people on the user list have asked for this.

Author: Andrew Or <andrewor14@gmail.com>

Closes #648 from andrewor14/expose-listeners and squashes the following commits:

e45e1ef [Andrew Or] Add missing colons (minor)
350d643 [Andrew Or] Expose SparkListeners and relevant classes as DeveloperApi
2014-05-05 18:32:14 -07:00
Sandy Ryza 8e724dcbad SPARK-1728. JavaRDDLike.mapPartitionsWithIndex requires ClassTag
Author: Sandy Ryza <sandy@cloudera.com>

Closes #657 from sryza/sandy-spark-1728 and squashes the following commits:

4751443 [Sandy Ryza] SPARK-1728. JavaRDDLike.mapPartitionsWithIndex requires ClassTag
2014-05-05 18:26:34 -07:00
Bouke van der Bijl 3292e2a71b SPARK-1721: Reset the thread classLoader in the Mesos Executor
This is because Mesos calls it with a different environment or something, the result is that the Spark jar is missing and it can't load classes.

This fixes http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-ClassNotFoundException-spark-on-mesos-td3510.html

I have no idea whether this is the right fix, I can only confirm that it fixes the issue for us.

The `registered` method is called from mesos (765ff9bc2a/src/java/jni/org_apache_mesos_MesosExecutorDriver.cpp)

I am unsure which commit caused this regression

Author: Bouke van der Bijl <boukevanderbijl@gmail.com>

Closes #620 from bouk/mesos-classloader-fix and squashes the following commits:

c13eae0 [Bouke van der Bijl] Use getContextOrSparkClassLoader in SparkEnv and CompressionCodec
2014-05-05 11:19:36 -07:00
Sandeep b48a55ae9f SPARK-1710: spark-submit should print better errors than "InvocationTargetException"
Catching the InvocationTargetException, printing getTargetException.

Author: Sandeep <sandeep@techaddict.me>

Closes #630 from techaddict/SPARK-1710 and squashes the following commits:

834d79b [Sandeep] changes from srowen  suggestions
109d604 [Sandeep] SPARK-1710: spark-submit should print better errors than "InvocationTargetException"
2014-05-04 20:51:53 -07:00
Sean Owen f5041579ff SPARK-1629. Addendum: Depend on commons lang3 (already used by tachyon) as it's used in ReplSuite, and return to use lang3 utility in Utils.scala
For consideration. This was proposed in related discussion: https://github.com/apache/spark/pull/569

Author: Sean Owen <sowen@cloudera.com>

Closes #635 from srowen/SPARK-1629.2 and squashes the following commits:

a442b98 [Sean Owen] Depend on commons lang3 (already used by tachyon) as it's used in ReplSuite, and return to use lang3 utility in Utils.scala
2014-05-04 17:43:35 -07:00
Aaron Davidson 34719ba32e SPARK-1689 AppClient should indicate app is dead() when removed
Previously, we indicated disconnected(), which keeps the application in a limbo state where it has no executors but thinks it will get them soon.

This is a bug fix that hopefully can be included in 1.0.

Author: Aaron Davidson <aaron@databricks.com>

Closes #605 from aarondav/appremoved and squashes the following commits:

bea02a2 [Aaron Davidson] SPARK-1689 AppClient should indicate app is dead() when removed
2014-05-03 13:27:10 -07:00
Cheng Lian ce72c72aec [Bugfix] Tachyon file cleanup logical error
Should lookup `shutdownDeleteTachyonPaths` instead of `shutdownDeletePaths`. Together with a minor style clean up: `find {...}.isDefined` to `exists {...}`.

Author: Cheng Lian <lian.cs.zju@gmail.com>

Closes #575 from liancheng/tachyonFix and squashes the following commits:

deb8f31 [Cheng Lian] Fixed logical error in when cleanup Tachyon files and minor style cleanup
2014-05-03 13:23:52 -07:00
Thomas Graves 3d0a02dff3 [WIP] SPARK-1676: Cache Hadoop UGIs by default to prevent FileSystem leak
Move the doAs in Executor higher up so that we only have 1 ugi and aren't leaking filesystems.
Fix spark on yarn to work when the cluster is running as user "yarn" but the clients are launched as the user and want to read/write to hdfs as the user.

Note this hasn't been fully tested yet.  Need to test in standalone mode.

Putting this up for people to look at and possibly test.  I don't have access to a mesos cluster.

This is alternative to https://github.com/apache/spark/pull/607

Author: Thomas Graves <tgraves@apache.org>

Closes #621 from tgravescs/SPARK-1676 and squashes the following commits:

244d55a [Thomas Graves] fix line length
44163d4 [Thomas Graves] Rework
9398853 [Thomas Graves] change to have doAs in executor higher up.
2014-05-03 10:59:05 -07:00
Aaron Davidson 0a14421765 SPARK-1700: Close socket file descriptors on task completion
This will ensure that sockets do not build up over the course of a job, and that cancellation successfully cleans up sockets.

Tested in standalone mode. More file descriptors spawn than expected (around 1000ish rather than the expected 8ish) but they do not pile up between runs, or as high as before (where they went up to around 5k).

Author: Aaron Davidson <aaron@databricks.com>

Closes #623 from aarondav/pyspark2 and squashes the following commits:

0ca13bb [Aaron Davidson] SPARK-1700: Close socket file descriptors on task completion
2014-05-02 23:55:13 -07:00
wangfei 4bf24f7897 delete no use var
Author: wangfei <wangfei_hello@126.com>

Closes #613 from scwf/masterIndex and squashes the following commits:

1463056 [wangfei] delete no use var: masterIndex
2014-05-02 21:34:54 -07:00
Andrew Or 394d8cb1c4 Add tests for FileLogger, EventLoggingListener, and ReplayListenerBus
Modifications to Spark core are limited to exposing functionality to test files + minor style fixes.
(728 / 769 lines are from tests)

Author: Andrew Or <andrewor14@gmail.com>

Closes #591 from andrewor14/event-log-tests and squashes the following commits:

2883837 [Andrew Or] Merge branch 'master' of github.com:apache/spark into event-log-tests
c3afcea [Andrew Or] Compromise
2d5daf8 [Andrew Or] Use temp directory provided by the OS rather than /tmp
2b52151 [Andrew Or] Remove unnecessary file delete + add a comment
62010fd [Andrew Or] More cleanup (renaming variables, updating comments etc)
ad2beff [Andrew Or] Clean up EventLoggingListenerSuite + modify a few comments
862e752 [Andrew Or] Merge branch 'master' of github.com:apache/spark into event-log-tests
e0ba2f8 [Andrew Or] Fix test failures caused by race condition in processing/mutating events
b990453 [Andrew Or] ReplayListenerBus suite - tests do not all pass yet
ab66a84 [Andrew Or] Tests for FileLogger + delete file after tests
187bb25 [Andrew Or] Formatting and renaming variables
769336f [Andrew Or] Merge branch 'master' of github.com:apache/spark into event-log-tests
5d38ffe [Andrew Or] Clean up EventLoggingListenerSuite + add comments
e12f4b1 [Andrew Or] Preliminary tests for EventLoggingListener (need major cleanup)
2014-05-01 21:42:06 -07:00