spark-instrumented-optimizer/project
Joseph K. Bradley dc0c4490a1 [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib] Standardize ML Prediction APIs
This is part (1a) of the updates from the design doc in [https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]

**UPDATE**: Most of the APIs are being kept private[spark] to allow further discussion.  Here is a list of changes which are public:
* new output columns: rawPrediction, probabilities
  * The “score” column is now called “rawPrediction”
* Classifiers now provide numClasses
* Params.get and .set are now protected instead of private[ml].
* ParamMap now has a size method.
* new classes: LinearRegression, LinearRegressionModel
* LogisticRegression now has an intercept.

### Sketch of APIs (most of which are private[spark] for now)

Abstract classes for learning algorithms (+ corresponding Model abstractions):
* Classifier (+ ClassificationModel)
* ProbabilisticClassifier (+ ProbabilisticClassificationModel)
* Regressor (+ RegressionModel)
* Predictor (+ PredictionModel)
* *For all of these*:
 * There is no strongly typed training-time API.
 * There is a strongly typed test-time (prediction) API which helps developers implement new algorithms.

Concrete classes: learning algorithms
* LinearRegression
* LogisticRegression (updated to use new abstract classes)
 * Also, removed "score" in favor of "probability" output column.  Changed BinaryClassificationEvaluator to match. (SPARK-5031)

Other updates:
* params.scala: Changed Params.set/get to be protected instead of private[ml]
 * This was needed for the example of defining a class from outside of the MLlib namespace.
* VectorUDT: Will later change from private[spark] to public.
 * This is needed for outside users to write their own validateAndTransformSchema() methods using vectors.
 * Also, added equals() method.f
* SPARK-4942 : ML Transformers should allow output cols to be turned on,off
 * Update validateAndTransformSchema
 * Update transform
* (Updated examples, test suites according to other changes)

New examples:
* DeveloperApiExample.scala (example of defining algorithm from outside of the MLlib namespace)
 * Added Java version too

Test Suites:
* LinearRegressionSuite
* LogisticRegressionSuite
* + Java versions of above suites

CC: mengxr  etrain  shivaram

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #3637 from jkbradley/ml-api-part1 and squashes the following commits:

405bfb8 [Joseph K. Bradley] Last edits based on code review.  Small cleanups
fec348a [Joseph K. Bradley] Added JavaDeveloperApiExample.java and fixed other issues: Made developer API private[spark] for now. Added constructors Java can understand to specialized Param types.
8316d5e [Joseph K. Bradley] fixes after rebasing on master
fc62406 [Joseph K. Bradley] fixed test suites after last commit
bcb9549 [Joseph K. Bradley] Fixed issues after rebasing from master (after move from SchemaRDD to DataFrame)
9872424 [Joseph K. Bradley] fixed JavaLinearRegressionSuite.java Java sql api
f542997 [Joseph K. Bradley] Added MIMA excludes for VectorUDT (now public), and added DeveloperApi annotation to it
216d199 [Joseph K. Bradley] fixed after sql datatypes PR got merged
f549e34 [Joseph K. Bradley] Updates based on code review.  Major ones are: * Created weakly typed Predictor.train() method which is called by fit() so that developers do not have to call schema validation or copy parameters. * Made Predictor.featuresDataType have a default value of VectorUDT.   * NOTE: This could be dangerous since the FeaturesType type parameter cannot have a default value.
343e7bd [Joseph K. Bradley] added blanket mima exclude for ml package
82f340b [Joseph K. Bradley] Fixed bug in LogisticRegression (introduced in this PR).  Fixed Java suites
0a16da9 [Joseph K. Bradley] Fixed Linear/Logistic RegressionSuites
c3c8da5 [Joseph K. Bradley] small cleanup
934f97b [Joseph K. Bradley] Fixed bugs from previous commit.
1c61723 [Joseph K. Bradley] * Made ProbabilisticClassificationModel into a subclass of ClassificationModel.  Also introduced ProbabilisticClassifier.  * This was to support output column “probabilityCol” in transform().
4e2f711 [Joseph K. Bradley] rat fix
bc654e1 [Joseph K. Bradley] Added spark.ml LinearRegressionSuite
8d13233 [Joseph K. Bradley] Added methods: * Classifier: batch predictRaw() * Predictor: train() without paramMap ProbabilisticClassificationModel.predictProbabilities() * Java versions of all above batch methods + others
1680905 [Joseph K. Bradley] Added JavaLabeledPointSuite.java for spark.ml, and added constructor to LabeledPoint which defaults weight to 1.0
adbe50a [Joseph K. Bradley] * fixed LinearRegression train() to use embedded paramMap * added Predictor.predict(RDD[Vector]) method * updated Linear/LogisticRegressionSuites
58802e3 [Joseph K. Bradley] added train() to Predictor subclasses which does not take a ParamMap.
57d54ab [Joseph K. Bradley] * Changed semantics of Predictor.train() to merge the given paramMap with the embedded paramMap. * remove threshold_internal from logreg * Added Predictor.copy() * Extended LogisticRegressionSuite
e433872 [Joseph K. Bradley] Updated docs.  Added LabeledPointSuite to spark.ml
54b7b31 [Joseph K. Bradley] Fixed issue with logreg threshold being set correctly
0617d61 [Joseph K. Bradley] Fixed bug from last commit (sorting paramMap by parameter names in toString).  Fixed bug in persisting logreg data.  Added threshold_internal to logreg for faster test-time prediction (avoiding map lookup).
601e792 [Joseph K. Bradley] Modified ParamMap to sort parameters in toString.  Cleaned up classes in class hierarchy, before implementing tests and examples.
d705e87 [Joseph K. Bradley] Added LinearRegression and Regressor back from ml-api branch
52f4fde [Joseph K. Bradley] removing everything except for simple class hierarchy for classification
d35bb5d [Joseph K. Bradley] fixed compilation issues, but have not added tests yet
bfade12 [Joseph K. Bradley] Added lots of classes for new ML API:
2015-02-05 23:43:47 -08:00
..
project Support cross building for Scala 2.11 2014-11-11 21:36:48 -08:00
spark-style/src/main/scala/org/apache/spark/scalastyle [Core] Upgrading ScalaStyle version to 0.5 and removing SparkSpaceAfterCommentStartChecker. 2014-10-16 02:05:44 -04:00
build.properties [SPARK-5415] bump sbt to version to 0.13.7 2015-01-28 02:13:06 -08:00
MimaBuild.scala Bumping version to 1.3.0-SNAPSHOT. 2014-11-18 21:24:18 -08:00
MimaExcludes.scala [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib] Standardize ML Prediction APIs 2015-02-05 23:43:47 -08:00
plugins.sbt [SPARK-3997][Build]scalastyle should output the error location 2014-10-26 16:24:50 -07:00
SparkBuild.scala [SPARK-5620][DOC] group methods in generated unidoc 2015-02-05 16:26:51 -08:00