Add support for local:// URI scheme for addJars()
This PR adds support for a new URI scheme for SparkContext.addJars(): `local://file/path`.
The *local* scheme indicates that the `/file/path` exists on every worker node. The reason for its existence is for big library JARs, which would be really expensive to serve using the standard HTTP fileserver distribution method, especially for big clusters. Today the only inexpensive method (assuming such a file is on every host, via say NFS, rsync, etc.) of doing this is to add the JAR to the SPARK_CLASSPATH, but we want a method where the user does not need to modify the Spark configuration.
I would add something to the docs, but it's not obvious where to add it.
Oh, and it would be great if this could be merged in time for 0.8.1.
Display both task ID and task attempt ID in UI, and rename taskId to taskAttemptId
Previously only the task attempt ID was shown in the UI; this was confusing because the job can be shown as complete while there are tasks still running. Showing the task ID in addition to the attempt ID makes it clear which tasks are redundant.
This commit also renames taskId to taskAttemptId in TaskInfo and in the local/cluster schedulers. This identifier was used to uniquely identify attempts, not tasks, so the current naming was confusing. The new naming is also more consistent with map reduce.
System.getProperties.toMap will fail-fast when concurrently modified,
and it seems like some other thread started by SparkContext does
a System.setProperty during it's initialization.
Handle this by just looping on ConcurrentModificationException, which
seems the safest, since the non-fail-fast methods (Hastable.entrySet)
have undefined behavior under concurrent modification.
Added new Spark Streaming operations
New operations
- transformWith which allows arbitrary 2-to-1 DStream transform, added to Scala and Java API
- StreamingContext.transform to allow arbitrary n-to-1 DStream
- leftOuterJoin and rightOuterJoin between 2 DStreams, added to Scala and Java API
- missing variations of join and cogroup added to Scala Java API
- missing JavaStreamingContext.union
Updated a number of Java and Scala API docs
Properly display the name of a stage in the UI.
This fixes a bug introduced by the fix for SPARK-940, which
changed the UI to display the RDD name rather than the stage
name. As a result, no name for the stage was shown when
using the Spark shell, which meant that there was no way to
click on the stage to see more details (e.g., the running
tasks). This commit changes the UI back to using the
stage name.
@pwendell -- let me know if this change was intentional
This fixes a bug introduced by the fix for SPARK-940, which
changed the UI to display the RDD name rather than the stage
name. As a result, no name for the stage was shown when
using the Spark shell, which meant that there was no way to
click on the stage to see more details (e.g., the running
tasks). This commit changes the UI back to using the
stage name.
This patch adds an operator called repartition with more straightforward
semantics than the current `coalesce` operator. There are a few use cases
where this operator is useful:
1. If a user wants to increase the number of partitions in the RDD. This
is more common now with streaming. E.g. a user is ingesting data on one
node but they want to add more partitions to ensure parallelism of
subsequent operations across threads or the cluster.
Right now they have to call rdd.coalesce(numSplits, shuffle=true) - that's
super confusing.
2. If a user has input data where the number of partitions is not known. E.g.
> sc.textFile("some file").coalesce(50)....
This is both vague semantically (am I growing or shrinking this RDD) but also,
may not work correctly if the base RDD has fewer than 50 partitions.
The new operator forces shuffles every time, so it will always produce exactly
the number of new partitions. It also throws an exception rather than silently
not-working if a bad input is passed.
I am currently adding streaming tests (requires refactoring some of the test
suite to allow testing at partition granularity), so this is not ready for
merge yet. But feedback is welcome.
Show "GETTING_RESULTS" state in UI.
This commit adds a set of calls using the SparkListener interface
that indicate when a task is remotely fetching results, so that
we can display this (potentially time-consuming) phase of execution
to users through the UI.
This should fix SPARK-902, an issue where some
Java API Function classes could cause
AbstractMethodErrors when user code is compiled
using the Eclipse compiler.
Thanks to @MartinWeindel for diagnosing this
problem.
(This PR subsumes / closes#30)
This patch fixes a bug where the Spark UI didn't display the correct number of total
tasks if the number of tasks in a Stage doesn't equal the number of RDD partitions.
It also cleans up the listener API a bit by embedding this information in the
StageInfo class rather than passing it seperately.
Split MapOutputTracker into Master/Worker classes
Previously, MapOutputTracker contained fields and methods that were only applicable to the master or worker instances. This commit introduces a MasterMapOutputTracker class to prevent the master-specific methods from being accessed on workers.
I also renamed a few methods and made others protected/private.
This commit adds a set of calls using the SparkListener interface
that indicate when a task is remotely fetching results, so that
we can display this (potentially time-consuming) phase of execution
to users through the UI.
Made the following traits/interfaces/classes non-public:
Made the following traits/interfaces/classes non-public:
SparkHadoopWriter
SparkHadoopMapRedUtil
SparkHadoopMapReduceUtil
SparkHadoopUtil
PythonAccumulatorParam
BlockManagerSlaveActor
Provide Instrumentation for Shuffle Write Performance
Shuffle write performance can have a major impact on the performance of jobs. This patch adds a few pieces of instrumentation related to shuffle writes. They are:
1. A listing of the time spent performing blocking writes for each task. This is implemented by keeping track of the aggregate delay seen by many individual writes.
2. An undocumented option `spark.shuffle.sync` which forces shuffle data to sync to disk. This is necessary for measuring shuffle performance in the absence of the OS buffer cache.
3. An internal utility which micro-benchmarks write throughput for simulated shuffle outputs.
I'm going to do some performance testing on this to see whether these small timing calls add overhead. From a feature perspective, however, I consider this complete. Any feedback is appreciated.
Don't setup the uncaught exception handler in local mode.
This avoids unit test failures for Spark streaming.
java.util.concurrent.RejectedExecutionException: Task org.apache.spark.streaming.JobManager$JobHandler@38cf728d rejected from java.util.concurrent.ThreadPoolExecutor@3b69a41e[Terminated, pool size = 0, active threads = 0, queued tasks = 0, completed tasks = 14]
at java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2048)
at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:821)
at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1372)
at org.apache.spark.streaming.JobManager.runJob(JobManager.scala:54)
at org.apache.spark.streaming.Scheduler$$anonfun$generateJobs$2.apply(Scheduler.scala:108)
at org.apache.spark.streaming.Scheduler$$anonfun$generateJobs$2.apply(Scheduler.scala:108)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:60)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.streaming.Scheduler.generateJobs(Scheduler.scala:108)
at org.apache.spark.streaming.Scheduler$$anonfun$1.apply$mcVJ$sp(Scheduler.scala:41)
at org.apache.spark.streaming.util.RecurringTimer.org$apache$spark$streaming$util$RecurringTimer$$loop(RecurringTimer.scala:66)
at org.apache.spark.streaming.util.RecurringTimer$$anon$1.run(RecurringTimer.scala:34)
Code de-duplication in BlockManager
The BlockManager has a few methods that duplicate most of their code. This pull request extracts the duplicated code into private doPut(), doGetLocal(), and doGetRemote() methods that unify the storing/reading of bytes or objects.
I believe that I preserved the logic of the original code, but I'd appreciate some help in reviewing this.
The Spark shuffle phase can produce a large number of files, as one file is created
per mapper per reducer. For large or repeated jobs, this often produces millions of
shuffle files, which sees extremely degredaded performance from the OS file system.
This patch seeks to reduce that burden by combining multipe shuffle files into one.
This PR draws upon the work of Jason Dai in https://github.com/mesos/spark/pull/669.
However, it simplifies the design in order to get the majority of the gain with less
overall intellectual and code burden. The vast majority of code in this pull request
is a refactor to allow the insertion of a clean layer of indirection between logical
block ids and physical files. This, I feel, provides some design clarity in addition
to enabling shuffle file consolidation.
The main goal is to produce one shuffle file per reducer per active mapper thread.
This allows us to isolate the mappers (simplifying the failure modes), while still
allowing us to reduce the number of mappers tremendously for large tasks. In order
to accomplish this, we simply create a new set of shuffle files for every parallel
task, and return the files to a pool which will be given out to the next run task.
The main goal of this refactor was to allow the interposition of a new layer which
maps logical BlockIds to physical locations other than a file with the same name
as the BlockId. In particular, BlockIds will need to be mappable to chunks of files,
as multiple will be stored in the same file.
In order to accomplish this, the following changes have been made:
- Creation of DiskBlockManager, which manages the association of logical BlockIds
to physical disk locations (called FileSegments). By default, Blocks are simply
mapped to physical files of the same name, as before.
- The DiskStore now indirects all requests for a given BlockId through the DiskBlockManager
in order to resolve the actual File location.
- DiskBlockObjectWriter has been merged into BlockObjectWriter.
- The Netty PathResolver has been changed to map BlockIds into FileSegments, as this
codepath is the only one that uses Netty, and that is likely to remain the case.
Overall, I think this refactor produces a clearer division between the logical Block
paradigm and their physical on-disk location. There is now an explicit (and documented)
mapping from one to the other.
Previously, MapOutputTracker contained fields and methods that
were only applicable to the master or worker instances. This
commit introduces a MasterMapOutputTracker class to prevent
the master-specific methods from being accessed on workers.
I also renamed a few methods and made others protected/private.
Job cancellation via job group id.
This PR adds a simple API to group together a set of jobs belonging to a thread and threads spawned from it. It also allows the cancellation of all jobs in this group.
An example:
sc.setJobDescription("this_is_the_group_id", "some job description")
sc.parallelize(1 to 10000, 2).map { i => Thread.sleep(10); i }.count()
In a separate thread:
sc.cancelJobGroup("this_is_the_group_id")
Faster and stable/reliable broadcast
HttpBroadcast is noticeably slow, but the alternatives (TreeBroadcast or BitTorrentBroadcast) are notoriously unreliable. The main problem with them is they try to manage the memory for the pieces of a broadcast themselves. Right now, the BroadcastManager does not know which machines the tasks reading from a broadcast variable is running and when they have finished. Consequently, we try to guess and often guess wrong, which blows up the memory usage and kills/hangs jobs.
This very simple implementation solves the problem by not trying to manage the intermediate pieces; instead, it offloads that duty to the BlockManager which is quite good at juggling blocks. Otherwise, it is very similar to the BitTorrentBroadcast implementation (without fancy optimizations). And it runs much faster than HttpBroadcast we have right now.
I've been using this for another project for last couple of weeks, and just today did some benchmarking against the Http one. The following shows the improvements for increasing broadcast size for cold runs. Each line represent the number of receivers.
![fix-bc-first](https://f.cloud.github.com/assets/232966/1349342/ffa149e4-36e7-11e3-9fa6-c74555829356.png)
After the first broadcast is over, i.e., after JVM is wormed up and for HttpBroadcast the server is already running (I think), the following are the improvements for warm runs.
![fix-bc-succ](https://f.cloud.github.com/assets/232966/1349352/5a948bae-36e8-11e3-98ce-34f19ebd33e0.jpg)
The curves are not as nice as the cold runs, but the improvements are obvious, specially for larger broadcasts and more receivers.
Depending on how it goes, we should deprecate and/or remove old TreeBroadcast and BitTorrentBroadcast implementations, and hopefully, SPARK-889 will not be necessary any more.
Mainly, this occurs if you provide a messed up MASTER url (one that doesn't match one
of our regexes). Previously, we would default to Mesos, fail, and then start the shell
anyway, except that any Spark command would fail.
The interface was used only by the DAG scheduler (so it wasn't necessary
to define the additional interface), and the naming makes it very
confusing when reading the code (because "listener" was used
to describe the DAG scheduler, rather than SparkListeners, which
implement a nearly-identical interface but serve a different
function).
Renamed StandaloneX to CoarseGrainedX.
(as suggested by @rxin here https://github.com/apache/incubator-spark/pull/14)
The previous names were confusing because the components weren't just
used in Standalone mode. The scheduler used for Standalone
mode is called SparkDeploySchedulerBackend, so referring to the base class
as StandaloneSchedulerBackend was misleading.
Refactor BlockId into an actual type
Converts all of our BlockId strings into actual BlockId types. Here are some advantages of doing this now:
+ Type safety
+ Code clarity - it's now obvious what the key of a shuffle or rdd block is, for instance. Additionally, appearing in tuple/map type signatures is a big readability bonus. A Seq[(String, BlockStatus)] is not very clear. Further, we can now use more Scala features, like matching on BlockId types.
+ Explicit usage - we can now formally tell where various BlockIds are being used (without doing string searches); this makes updating current BlockIds a much clearer process, and compiler-supported.
(I'm looking at you, shuffle file consolidation.)
+ It will only get harder to make this change as time goes on.
Downside is, of course, that this is a very invasive change touching a lot of different files, which will inevitably lead to merge conflicts for many.
This is an unfortunately invasive change which converts all of our BlockId
strings into actual BlockId types. Here are some advantages of doing this now:
+ Type safety
+ Code clarity - it's now obvious what the key of a shuffle or rdd block is,
for instance. Additionally, appearing in tuple/map type signatures is a big
readability bonus. A Seq[(String, BlockStatus)] is not very clear.
Further, we can now use more Scala features, like matching on BlockId types.
+ Explicit usage - we can now formally tell where various BlockIds are being used
(without doing string searches); this makes updating current BlockIds a much
clearer process, and compiler-supported.
(I'm looking at you, shuffle file consolidation.)
+ It will only get harder to make this change as time goes on.
Since this touches a lot of files, it'd be best to either get this patch
in quickly or throw it on the ground to avoid too many secondary merge conflicts.
Add an optional closure parameter to HadoopRDD instantiation to use when creating local JobConfs.
Having HadoopRDD accept this optional closure eliminates the need for the HadoopFileRDD added earlier. It makes the HadoopRDD more general, in that the caller can specify any JobConf initialization flow.
Address review comments, move to incubator spark
Also includes a small fix to speculative execution.
<edit> Continued from https://github.com/mesos/spark/pull/914 </edit>
Standalone Scheduler fault tolerance using ZooKeeper
This patch implements full distributed fault tolerance for standalone scheduler Masters.
There is only one master Leader at a time, which is actively serving scheduling
requests. If this Leader crashes, another master will eventually be elected, reconstruct
the state from the first Master, and continue serving scheduling requests.
Leader election is performed using the ZooKeeper leader election pattern. We try to minimize
the use of ZooKeeper and the assumptions about ZooKeeper's behavior, so there is a layer of
retries and session monitoring on top of the ZooKeeper client.
Master failover follows directly from the single-node Master recovery via the file
system (patch d5a96fe), save that the Master state is stored in ZooKeeper instead.
Configuration:
By default, no recovery mechanism is enabled (spark.deploy.recoveryMode = NONE).
By setting spark.deploy.recoveryMode to ZOOKEEPER and setting spark.deploy.zookeeper.url
to an appropriate ZooKeeper URL, ZooKeeper recovery mode is enabled.
By setting spark.deploy.recoveryMode to FILESYSTEM and setting spark.deploy.recoveryDirectory
to an appropriate directory accessible by the Master, we will keep the behavior of from d5a96fe.
Additionally, places where a Master could be specificied by a spark:// url can now take
comma-delimited lists to specify backup masters. Note that this is only used for registration
of NEW Workers and application Clients. Once a Worker or Client has registered with the
Master Leader, it is "in the system" and will never need to register again.
SPARK-900 Use coarser grained naming for metrics
see SPARK-900 Use coarser grained naming for metrics.
Now the new metric name is formatted as {XXX.YYY.ZZZ.COUNTER_UNIT}, XXX.YYY.ZZZ represents the group name, which can group several metrics under the same Ganglia view.
Don't allocate Kryo buffers unless needed
I noticed that the Kryo serializer could be slower than the Java one by 2-3x on small shuffles because it spend a lot of time initializing Kryo Input and Output objects. This is because our default buffer size for them is very large. Since the serializer is often used on streams, I made the initialization lazy for that, and used a smaller buffer (auto-managed by Kryo) for input.
Conflicts:
bagel/pom.xml
core/pom.xml
core/src/test/scala/org/apache/spark/ui/UISuite.scala
examples/pom.xml
mllib/pom.xml
pom.xml
project/SparkBuild.scala
repl/pom.xml
streaming/pom.xml
tools/pom.xml
In scala 2.10, a shorter representation is used for naming artifacts
so changed to shorter scala version for artifacts and made it a property in pom.
Resolving package conflicts with hadoop 0.23.9
Hadoop 0.23.9 is having a package conflict with easymock's dependencies.
(cherry picked from commit 023e3fdf00)
Signed-off-by: Reynold Xin <rxin@apache.org>
Allow users to pass broadcasted Configurations and cache InputFormats across Hadoop file reads.
Note: originally from https://github.com/mesos/spark/pull/942
Currently motivated by Shark queries on Hive-partitioned tables, where there's a JobConf broadcast for every Hive-partition (i.e., every subdirectory read). The only thing different about those JobConfs is the input path - the Hadoop Configuration that the JobConfs are constructed from remain the same.
This PR only modifies the old Hadoop API RDDs, but similar additions to the new API might reduce computation latencies a little bit for high-frequency FileInputDStreams (which only uses the new API right now).
As a small bonus, added InputFormats caching, to avoid reflection calls for every RDD#compute().
Few other notes:
Added a general soft-reference hashmap in SparkHadoopUtil because I wanted to avoid adding another class to SparkEnv.
SparkContext default hadoopConfiguration isn't cached. There's no equals() method for Configuration, so there isn't a good way to determine when configuration properties have changed.
SPARK-920/921 - JSON endpoint updates
920 - Removal of duplicate scheme part of Spark URI, it was appearing as spark://spark//host:port in the JSON field.
JSON now delivered as:
url:spark://127.0.0.1:7077
921 - Adding the URL of the Main Application UI will allow custom interfaces (that use the JSON output) to redirect from the standalone UI.
One major change was the use of messages instead of raw functions as the
parameter of Akka scheduled timers. Since messages are serialized, unlike
raw functions, the behavior is easier to think about and doesn't cause
race conditions when exceptions are thrown.
Another change is to avoid using global pointers that might change without
a lock.
The previous names were confusing because the components weren't just
used in Standalone mode -- in fact, the scheduler used for Standalone
mode is called SparkDeploySchedulerBackend. So, the previous names
were misleading.
Currently PythonPartitioner determines partition ID by hashing a
byte-array representation of PySpark's key. This PR lets
PythonPartitioner use the actual partition ID, which is required e.g.
for sorting via PySpark.
This patch implements full distributed fault tolerance for standalone scheduler Masters.
There is only one master Leader at a time, which is actively serving scheduling
requests. If this Leader crashes, another master will eventually be elected, reconstruct
the state from the first Master, and continue serving scheduling requests.
Leader election is performed using the ZooKeeper leader election pattern. We try to minimize
the use of ZooKeeper and the assumptions about ZooKeeper's behavior, so there is a layer of
retries and session monitoring on top of the ZooKeeper client.
Master failover follows directly from the single-node Master recovery via the file
system (patch 194ba4b8), save that the Master state is stored in ZooKeeper instead.
Configuration:
By default, no recovery mechanism is enabled (spark.deploy.recoveryMode = NONE).
By setting spark.deploy.recoveryMode to ZOOKEEPER and setting spark.deploy.zookeeper.url
to an appropriate ZooKeeper URL, ZooKeeper recovery mode is enabled.
By setting spark.deploy.recoveryMode to FILESYSTEM and setting spark.deploy.recoveryDirectory
to an appropriate directory accessible by the Master, we will keep the behavior of from 194ba4b8.
Additionally, places where a Master could be specificied by a spark:// url can now take
comma-delimited lists to specify backup masters. Note that this is only used for registration
of NEW Workers and application Clients. Once a Worker or Client has registered with the
Master Leader, it is "in the system" and will never need to register again.
Forthcoming:
Documentation, tests (! - only ad hoc testing has been performed so far)
I do not intend for this commit to be merged until tests are added, but this patch should
still be mostly reviewable until then.
Implements a basic form of Standalone Scheduler fault recovery. In particular,
this allows faults to be manually recovered from by means of restarting the
Master process on the same machine. This is the majority of the code necessary
for general fault tolerance, which will first elect a leader and then recover
the Master state.
In order to enable fault recovery, the Master will persist a small amount of state related
to the registration of Workers and Applications to disk. If the Master is started and
sees that this state is still around, it will enter Recovery mode, during which time it
will not schedule any new Executors on Workers (but it does accept the registration of
new Clients and Workers).
At this point, the Master attempts to reconnect to all Workers and Client applications
that were registered at the time of failure. After confirming either the existence
or nonexistence of all such nodes (within a certain timeout), the Master will exit
Recovery mode and resume normal scheduling.
Improved organization of scheduling packages.
This commit does not change any code -- only file organization.
Please let me know if there was some masterminded strategy behind
the existing organization that I failed to understand!
There are two components of this change:
(1) Moving files out of the cluster package, and down
a level to the scheduling package. These files are all used by
the local scheduler in addition to the cluster scheduler(s), so
should not be in the cluster package. As a result of this change,
none of the files in the local package reference files in the
cluster package.
(2) Moving the mesos package to within the cluster package.
The mesos scheduling code is for a cluster, and represents a
specific case of cluster scheduling (the Mesos-related classes
often subclass cluster scheduling classes). Thus, the most logical
place for it seems to be within the cluster package.
The one thing about the scheduling code that seems a little funny to me
is the naming of the SchedulerBackends. The StandaloneSchedulerBackend
is not just for Standalone mode, but instead is used by Mesos coarse grained
mode and Yarn, and the backend that *is* just for Standalone mode is instead called SparkDeploySchedulerBackend. I didn't change this because I wasn't sure if there
was a reason for this naming that I'm just not aware of.
some minor fixes to MemoryStore
This is a repeat of #5, moved to its own branch in my repo.
This makes all updates to on ; it skips on synchronizing the reads where it can get away with it.
This commit does not change any code -- only file organization.
There are two components of this change:
(1) Moving files out of the cluster package, and down
a level to the scheduling package. These files are all used by
the local scheduler in addition to the cluster scheduler(s), so
should not be in the cluster package. As a result of this change,
none of the files in the local package reference files in the
cluster package.
(2) Moving the mesos package to within the cluster package.
The mesos scheduling code is for a cluster, and represents a
specific case of cluster scheduling (the Mesos-related classes
often subclass cluster scheduling classes). Thus, the most logical
place for it is within the cluster package.
This change requires adding an extra failure mode: tasks can complete
successfully, but the result gets lost or flushed from the block manager
before it's been fetched.
Make "currentMemory" @volatile, so that it's reads in ensureFreeSpace() are atomic and up-to-date--i.e., currentMemory can't increase while putLock is held (though it could decrease, which would only help ensureFreeSpace()).
In MapOutputTrackerSuite, the "remote fetch" test sets spark.driver.port
and spark.hostPort, assuming that they will be cleared by
LocalSparkContext. However, the test never sets sc, so it remains null,
causing LocalSparkContext to skip clearing these properties. Subsequent
tests therefore fail with java.net.BindException: "Address already in
use".
This commit makes LocalSparkContext clear the properties even if sc is
null.
The fact that DynamicVariable uses an InheritableThreadLocal
can cause problems where the properties end up being shared
across threads in certain circumstances.
StandaloneSchedulerBackend instead of the smaller IDs used within Spark
(that lack the application name).
This was reported by ClearStory in
https://github.com/clearstorydata/spark/pull/9.
Also fixed some messages that said slave instead of executor.
Include the useful tip that if shuffle=true, coalesce can actually
increase the number of partitions.
This makes coalesce more like a generic `RDD.repartition` operation.
(Ideally this `RDD.repartition` could automatically choose either a coalesce or
a shuffle if numPartitions was either less than or greater than, respectively,
the current number of partitions.)
The sc.StorageLevel -> StorageLevel pathway is a bit janky, but otherwise
the shell would have to call a private method of SparkContext. Having
StorageLevel available in sc also doesn't seem like the end of the world.
There may be a better solution, though.
As for creating the StorageLevel object itself, this seems to be the best
way in Python 2 for creating singleton, enum-like objects:
http://stackoverflow.com/questions/36932/how-can-i-represent-an-enum-in-python
Caching the results of local actions (e.g., rdd.first()) causes the driver to
store entire partitions in its own memory, which may be highly constrained.
This patch simply makes the CacheManager avoid caching the result of all locally-run computations.
- Use "fluid" layout that can expand to wide browser windows, instead of
the old one's limit of 1200 px
- Remove unnecessary <hr> elements
- Switch back to Bootstrap's default theme and tweak progress bar colors
- Make headers more consistent between deploy and app UIs
- Replace some inline CSS with stylesheets
This includes the following changes:
- The "assembly" package now builds in Maven by default, and creates an
assembly containing both hadoop-client and Spark, unlike the old
BigTop distribution assembly that skipped hadoop-client
- There is now a bigtop-dist package to build the old BigTop assembly
- The repl-bin package is no longer built by default since the scripts
don't reply on it; instead it can be enabled with -Prepl-bin
- Py4J is now included in the assembly/lib folder as a local Maven repo,
so that the Maven package can link to it
- run-example now adds the original Spark classpath as well because the
Maven examples assembly lists spark-core and such as provided
- The various Maven projects add a spark-yarn dependency correctly