Mllib 16 bugfix
Bug fix: https://spark-project.atlassian.net/browse/MLLIB-16
Hi, I fixed the bug and added a test suite for `GradientDescent`. There are 2 checks in the test case. First, the final loss must be lower than the initial one. Second, the trend of loss sequence should be decreasing, i.e., at least 80% iterations have lower losses than their prior iterations.
Thanks!
add the comments about SPARK_WORKER_DIR
this env variable seems to be forgotten
in many cases we need to set this variable, e.g. in EC2, we have to move the large application log files from the EBS to the ephemeral storage
Suggested small changes to Java code for slightly more standard style, encapsulation and in some cases performance
Sorry if this is too abrupt or not a welcome set of changes, but thought I'd see if I could contribute a little. I'm a Java developer and just getting seriously into Spark. So I thought I'd suggest a number of small changes to the couple Java parts of the code to make it a little tighter, more standard and even a bit faster.
Feel free to take all, some or none of this. Happy to explain any of it.
Conf improvements
There are two new features.
1. Allow users to set arbitrary akka configurations via spark conf.
2. Allow configuration to be printed in logs for diagnosis.
Add a script to download sbt if not present on the system
As per the discussion on the dev mailing list this script will use the system sbt if present or otherwise attempt to install the sbt launcher. The fall back error message in the event it fails instructs the user to install sbt. While the URLs it fetches from aren't controlled by the spark project directly, they are stable and the current authoritative sources.
Update stop-slaves.sh
The most recently version has changed the directory structure, but this script "sbin/stop-all.sh" doesn't change with it accordingly. This mistake makes "sbin/stop-all.sh" can't stop the slave node.
The most recently version has changed the directory structure, but this script "sbin/stop-all.sh" doesn't change with it accordingly. This mistake makes "sbin/stop-all.sh" can't stop the slave node.
Fix test breaking downstream builds
This wasn't detected in the pull-request-builder because it manually sets SPARK_HOME. I'm going to change that (it should't do this) to make it like the other builds.
```
[error] /pod/home/anovak/build/graphx/core/src/main/scala/org/apache/spark/storage/ShuffleBlockManager.scala:172: not enough arguments for constructor PrimitiveKeyOpenHashMap: (initialCapacity: Int)(implicit evidence$3: ClassManifest[Int], implicit evidence$4: ClassManifest[Int])org.apache.spark.util.collection.PrimitiveKeyOpenHashMap[Int,Int]
[error] private val mapIdToIndex = new PrimitiveKeyOpenHashMap[Int, Int]()
[error] ^
[info] No documentation generated with unsucessful compiler run
[error] one error found
[error] (core/compile:doc) Scaladoc generation failed
[error] Total time: 67 s, completed Jan 6, 2014 2:20:51 PM
```
In theory a no-argument constructor ought not to differ from one with a single argument that has a default value, but in practice there seems to be an issue.
Change protobuf version for yarn alpha back to 2.4.1
The maven build for yarn-alpha uses the wrong protobuf version and hence the generated assembly jar doesn't work with Hadoop 0.23. Removing the setting for the yarn-alpha profile since the default protobuf version is 2.4.1 at the top of the pom file.
Fix handling of empty SPARK_EXAMPLES_JAR
Currently if SPARK_EXAMPLES_JAR is left unset you get a null pointer exception when running the examples (atleast on spark on yarn). The null now gets turned into a string of "null" when its put into the SparkConf so addJar no longer properly ignores it. This fixes that so that it can be left unset.