This change adds Java examples and unit tests for all GLM algorithms
to make sure the MLLib interface works from Java. Changes include
- Introduce LabeledPoint and avoid using Doubles in train arguments
- Rename train to run in class methods
- Make the optimizer a member variable of GLM to make sure the builder
pattern works
- Changes ALS to accept RDD[Rating] instead of (Int, Int, Double) making it
easier to call from Java
- Renames class methods from `train` to `run` to enable static methods to be
called from Java.
- Add unit tests which check if both static / class methods can be called.
- Also add examples which port the main() function in ALS, KMeans to the
examples project.
Couple of minor changes to existing code:
- Add a toJavaRDD method in RDD to convert scala RDD to java RDD easily
- Workaround a bug where using double[] from Java leads to class cast exception in
KMeans init
A ThreadLocal SparkEnv.env is facing various situations leading to
NullPointerExceptions, where SparkEnv.env set in one thread is not
gettable in another thread, but often assumed to be available.
See, e.g., https://groups.google.com/forum/#!topic/spark-developers/GLx8yunSj0A
This hotfixes SparkEnv.env to return either (a) the ThreadLocal
value if non-null, or (b) the previously set value in any thread.
This approach preserves SparkEnv.set() thread safety needed by
RDD.compute() and possibly other places. A refactoring that
parameterizes SparkEnv should be addressed subsequently.
On branch adatao-global-SparkEnv
Changes to be committed:
modified: core/src/main/scala/spark/SparkEnv.scala
This change refactors the structure of GLMs to use mixins which maintain
a similar interface to other ML lib algorithms. This change also creates
an Optimizer trait which allows GLMs to be extended to use other optimization
techniques.
For debugging and analysis purposes, it's nice to have the exact command
used to launch Spark contained within the logs. This adds the necessary
hooks to make that possible.