Minor style cleanup. Mostly on indenting & line width changes.
Focused on the few important files since they are the files that new contributors usually read first.
Don't delegate to users `sbt`.
This changes our `sbt/sbt` script to not delegate to the user's `sbt`
even if it is present. If users already have sbt installed and they
want to use their own sbt, we'd expect them to just call sbt directly
from within Spark. We no longer set any enironment variables or anything
from this script, so they should just launch sbt directly on their own.
There are a number of hard-to-debug issues which can come from the
current appraoch. One is if the user is unaware of an existing sbt
installation and now without explanation their build breaks because
they haven't configured options correctly (such as permgen size)
within their sbt (reported by @patmcdonough). Another is if the user has a much older version
of sbt hanging around, in which case some of the older versions
don't acutally work well when newer verisons of sbt are specified
in the build file (reported by @marmbrus). A third is if the user
has done some other modification to their sbt script, such as
setting it to delegate to sbt/sbt in Spark, and this causes
that to break (also reported by @marmbrus).
So to keep things simple let's just avoid this path and
remove it. Any user who already has sbt and wants to build
spark with it should be able to understand easily how to do it.
This changes our `sbt/sbt` script to not delegate to the user's `sbt`
even if it is present. If users already have sbt installed and they
want to use their own sbt, we'd expect them to just call sbt directly
from within Spark. We no longer set any enironment variables or anything
from this script, so they should just launch sbt directly on their own.
There are a number of hard-to-debug issues which can come from the
current appraoch. One is if the user is unaware of an existing sbt
installation and now without explanation their build breaks because
they haven't configured options correctly (such as permgen size)
within their sbt. Another is if the user has a much older version
of sbt hanging around, in which case some of the older versions
don't acutally work well when newer verisons of sbt are specified
in the build file (reported by @marmbrus). A third is if the user
has done some other modification to their sbt script, such as
setting it to delegate to sbt/sbt in Spark, and this causes
that to break (also reported by @marmbrus).
So to keep things simple let's just avoid this path and
remove it. Any user who already has sbt and wants to build
spark with it should be able to understand easily how to do it.
Fixing config option "retained_stages" => "retainedStages".
This is a very esoteric option and it's out of sync with the style we use.
So it seems fitting to fix it for 0.9.0.
Set boolean param name for call to SparkHadoopMapReduceUtil.newTaskAttemptID
Set boolean param name for call to SparkHadoopMapReduceUtil.newTaskAttemptID to make it clear which param being set.
Add CDH Repository to Maven Build
At some point this was removed from the Maven build... so I'm adding it back. It's needed for the Hadoop2 tests we run on Jenkins and it's also included in the SBT build.
Remove calls to deprecated mapred's OutputCommitter.cleanupJob
Since Hadoop 1.0.4 the mapred OutputCommitter.commitJob should do cleanup job via call to OutputCommitter.cleanupJob,
Remove SparkHadoopWriter.cleanup since it is used only by PairRDDFunctions.
In fact the implementation of mapred OutputCommitter.commitJob looks like this:
public void commitJob(JobContext jobContext) throws IOException {
cleanupJob(jobContext);
}
support distributing extra files to worker for yarn client mode
So that user doesn't need to package all dependency into one assemble jar as spark app jar
the mapred OutputCommitter.commitJob should do cleanup job.
In fact the implementation of mapred OutputCommitter.commitJob looks like this:
public void commitJob(JobContext jobContext) throws IOException {
cleanupJob(jobContext);
}
(The jobContext input argument is type of org.apache.hadoop.mapred.JobContext)
SPARK-1009 Updated MLlib docs to show how to use it in Python
In addition added detailed examples for regression, clustering and recommendation algorithms in a separate Scala section. Fixed a few minor issues with existing documentation.
Refactored the streaming project to separate external libraries like Twitter, Kafka, Flume, etc.
At a high level, these are the following changes.
1. All the external code was put in `SPARK_HOME/external/` as separate SBT projects and Maven modules. Their artifact names are `spark-streaming-twitter`, `spark-streaming-kafka`, etc. Both SparkBuild.scala and pom.xml files have been updated. References to external libraries and repositories have been removed from the settings of root and streaming projects/modules.
2. To avail the external functionality (say, creating a Twitter stream), the developer has to `import org.apache.spark.streaming.twitter._` . For Scala API, the developer has to call `TwitterUtils.createStream(streamingContext, ...)`. For the Java API, the developer has to call `TwitterUtils.createStream(javaStreamingContext, ...)`.
3. Each external project has its own scala and java unit tests. Note the unit tests of each external library use classes of the streaming unit tests (`TestSuiteBase`, `LocalJavaStreamingContext`, etc.). To enable this code sharing among test classes, `dependsOn(streaming % "compile->compile,test->test")` was used in the SparkBuild.scala . In the streaming/pom.xml, an additional `maven-jar-plugin` was necessary to capture this dependency (see comment inside the pom.xml for more information).
4. Jars of the external projects have been added to examples project but not to the assembly project.
5. In some files, imports have been rearrange to conform to the Spark coding guidelines.
Get rid of `Either[ActorRef, ActorSelection]'
In this pull request, instead of returning an `Either[ActorRef, ActorSelection]`, `registerOrLookup` identifies the remote actor blockingly to obtain an `ActorRef`, or throws an exception if the remote actor doesn't exist or the lookup times out (configured by `spark.akka.lookupTimeout`). This function is only called when an `SparkEnv` is constructed (instantiating driver or executor), so the blocking call is considered acceptable. Executor side `ActorSelection`s/`ActorRef`s to driver side `MapOutputTrackerMasterActor` and `BlockManagerMasterActor` are affected by this pull request.
`ActorSelection` is dangerous and should be used with care. It's only absolutely safe to send messages via an `ActorSelection` when the remote actor is stateless, so that actor incarnation is irrelevant. But as pointed by @ScrapCodes in the comments below, executor exits immediately once the connection to the driver lost, `ActorSelection`s are not harmful in this scenario. So this pull request is mostly a code style patch.
Added ‘-i’ command line option to Spark REPL
We had to create a new implementation of both scala.tools.nsc.CompilerCommand and scala.tools.nsc.Settings, because using scala.tools.nsc.GenericRunnerSettings would bring in other options (-howtorun, -save and -execute) which don’t make sense in Spark.
Any new Spark specific command line option could now be added to org.apache.spark.repl.SparkRunnerSettings class.
Since the behavior of loading a script from the command line should be the same as loading it using the “:load” command inside the shell, the script should be loaded when the SparkContext is available, that’s why we had to move the call to ‘loadfiles(settings)’ _after_ the call to postInitialization(). This still doesn’t work if ‘isAsync = true’.
Add way to limit default # of cores used by apps in standalone mode
Also documents the spark.deploy.spreadOut option, and fixes a config option that had a dash in its name.
Don't leave os.arch unset after BlockManagerSuite
Recent SparkConf changes meant that BlockManagerSuite was now leaving the os.arch System.property unset. That's a problem for any subsequent tests that rely upon having a valid os.arch. This is true for CompressionCodecSuite in the usual maven build test order, even though it isn't usually true for the sbt build.
SPARK-1012: DAGScheduler Exception Fix
Added a predict method to MatrixFactorizationModel to enable bulk prediction. This method takes and RDD[(Int, Int)] of users and products and return an RDD with a Rating element per each element in the input RDD.
Also added python bindings to the new bulk prediction methods to address SPARK-1011 issue.
This is ready to be merged now.
Add log4j exclusion rule to maven.
To make this work I had to rename the defaults file. Otherwise
maven's pattern matching rules included it when trying to match
other log4j.properties files.
I also fixed a bug in the existing maven build where two
<transformers> tags were present in assembly/pom.xml
such that one overwrote the other.
To make this work I had to rename the defaults file. Otherwise
maven's pattern matching rules included it when trying to match
other log4j.properties files.
I also fixed a bug in the existing maven build where two
<transformers> tags were present in assembly/pom.xml
such that one overwrote the other.
Mllib 16 bugfix
Bug fix: https://spark-project.atlassian.net/browse/MLLIB-16
Hi, I fixed the bug and added a test suite for `GradientDescent`. There are 2 checks in the test case. First, the final loss must be lower than the initial one. Second, the trend of loss sequence should be decreasing, i.e., at least 80% iterations have lower losses than their prior iterations.
Thanks!
add the comments about SPARK_WORKER_DIR
this env variable seems to be forgotten
in many cases we need to set this variable, e.g. in EC2, we have to move the large application log files from the EBS to the ephemeral storage
Suggested small changes to Java code for slightly more standard style, encapsulation and in some cases performance
Sorry if this is too abrupt or not a welcome set of changes, but thought I'd see if I could contribute a little. I'm a Java developer and just getting seriously into Spark. So I thought I'd suggest a number of small changes to the couple Java parts of the code to make it a little tighter, more standard and even a bit faster.
Feel free to take all, some or none of this. Happy to explain any of it.
Conf improvements
There are two new features.
1. Allow users to set arbitrary akka configurations via spark conf.
2. Allow configuration to be printed in logs for diagnosis.