Conflicts:
bagel/pom.xml
core/pom.xml
core/src/test/scala/org/apache/spark/ui/UISuite.scala
examples/pom.xml
mllib/pom.xml
pom.xml
project/SparkBuild.scala
repl/pom.xml
streaming/pom.xml
tools/pom.xml
In scala 2.10, a shorter representation is used for naming artifacts
so changed to shorter scala version for artifacts and made it a property in pom.
Fix inconsistent and incorrect log messages in shuffle read path
The user-facing messages generated by the CacheManager are currently wrong and somewhat misleading. This patch makes the messages more accurate. It also uses a consistent representation of the partition being fetched (`rdd_xx_yy`) so that it's easier for users to trace what is going on when reading logs.
Resolving package conflicts with hadoop 0.23.9
Hadoop 0.23.9 is having a package conflict with easymock's dependencies.
(cherry picked from commit 023e3fdf00)
Signed-off-by: Reynold Xin <rxin@apache.org>
merge in remaining changes from `branch-0.8`
This merges in the following changes from `branch-0.8`:
- The scala version is included in the published maven artifact names
- A unit tests which had non-deterministic failures is ignored (see SPARK-908)
- A minor documentation change shows the short version instead of the full version
- Moving the kafka jar to be "provided"
- Changing the default spark ec2 version.
- Some spacing changes caused by Maven's release plugin
Note that I've squashed this into a single commit rather than pull in the branch-0.8 history. There are a bunch of release/revert commits there that make the history super ugly.
Allow users to pass broadcasted Configurations and cache InputFormats across Hadoop file reads.
Note: originally from https://github.com/mesos/spark/pull/942
Currently motivated by Shark queries on Hive-partitioned tables, where there's a JobConf broadcast for every Hive-partition (i.e., every subdirectory read). The only thing different about those JobConfs is the input path - the Hadoop Configuration that the JobConfs are constructed from remain the same.
This PR only modifies the old Hadoop API RDDs, but similar additions to the new API might reduce computation latencies a little bit for high-frequency FileInputDStreams (which only uses the new API right now).
As a small bonus, added InputFormats caching, to avoid reflection calls for every RDD#compute().
Few other notes:
Added a general soft-reference hashmap in SparkHadoopUtil because I wanted to avoid adding another class to SparkEnv.
SparkContext default hadoopConfiguration isn't cached. There's no equals() method for Configuration, so there isn't a good way to determine when configuration properties have changed.
SPARK-920/921 - JSON endpoint updates
920 - Removal of duplicate scheme part of Spark URI, it was appearing as spark://spark//host:port in the JSON field.
JSON now delivered as:
url:spark://127.0.0.1:7077
921 - Adding the URL of the Main Application UI will allow custom interfaces (that use the JSON output) to redirect from the standalone UI.
Fixing SPARK-602: PythonPartitioner
Currently PythonPartitioner determines partition ID by hashing a
byte-array representation of PySpark's key. This PR lets
PythonPartitioner use the actual partition ID, which is required e.g.
for sorting via PySpark.
Currently PythonPartitioner determines partition ID by hashing a
byte-array representation of PySpark's key. This PR lets
PythonPartitioner use the actual partition ID, which is required e.g.
for sorting via PySpark.
fixed a wildcard bug in make-distribution.sh; ask sbt to check local
maven repo in project/SparkBuild.scala
(1) fixed a wildcard bug in make-distribution.sh:
with the wildcard * in quotes, this cp command failed. it worked after
moving the wildcard out quotes.
(2) ask sbt to check local maven repo in SparkBuild.scala:
To build Spark (0.9.0-SNAPSHOT) with the HEAD of mesos (0.15.0), I must
do "make maven-install" under mesos/build, which publishes the java .jar
file under ~/.m2. However, when building Spark (after pointing mesos to
version 0.15.0), sbt uses ivy which by default only checks ~/.ivy2. This
change is to tell sbt to also check ~/.m2.
Send Task results through the block manager when larger than Akka frame size (fixes SPARK-669).
This change requires adding an extra failure mode: tasks can complete
successfully, but the result gets lost or flushed from the block manager
before it's been fetched.
This change also moves the deserialization of tasks into a separate thread, so it's no longer part of the DAG scheduler's tight loop. This should improve scheduler throughput, particularly when tasks are sending back large results.
Thanks Josh for writing the original version of this patch!
This is duplicated from the mesos/spark repo: https://github.com/mesos/spark/pull/835
Improved organization of scheduling packages.
This commit does not change any code -- only file organization.
Please let me know if there was some masterminded strategy behind
the existing organization that I failed to understand!
There are two components of this change:
(1) Moving files out of the cluster package, and down
a level to the scheduling package. These files are all used by
the local scheduler in addition to the cluster scheduler(s), so
should not be in the cluster package. As a result of this change,
none of the files in the local package reference files in the
cluster package.
(2) Moving the mesos package to within the cluster package.
The mesos scheduling code is for a cluster, and represents a
specific case of cluster scheduling (the Mesos-related classes
often subclass cluster scheduling classes). Thus, the most logical
place for it seems to be within the cluster package.
The one thing about the scheduling code that seems a little funny to me
is the naming of the SchedulerBackends. The StandaloneSchedulerBackend
is not just for Standalone mode, but instead is used by Mesos coarse grained
mode and Yarn, and the backend that *is* just for Standalone mode is instead called SparkDeploySchedulerBackend. I didn't change this because I wasn't sure if there
was a reason for this naming that I'm just not aware of.
some minor fixes to MemoryStore
This is a repeat of #5, moved to its own branch in my repo.
This makes all updates to on ; it skips on synchronizing the reads where it can get away with it.