Commit graph

491 commits

Author SHA1 Message Date
gasparms f80e2629bb [SPARK-5800] Streaming Docs. Change linked files according the selected language
Currently, Spark Streaming Programming Guide after updateStateByKey  explanation links to file stateful_network_wordcount.py and note "For the complete Scala code ..." for any language tab selected. This is an incoherence.

I've changed the guide and link its pertinent example file. JavaStatefulNetworkWordCount.java example was not created so I added to the commit.

Author: gasparms <gmunoz@stratio.com>

Closes #4589 from gasparms/feature/streaming-guide and squashes the following commits:

7f37f89 [gasparms] More style changes
ec202b0 [gasparms] Follow spark style guide
f527328 [gasparms] Improve example to look like scala example
4d8785c [gasparms] Remove throw exception
e92e6b8 [gasparms] Fix incoherence
92db405 [gasparms] Fix Streaming Programming Guide. Change files according the selected language
2015-02-14 20:10:29 +00:00
Reynold Xin e98dfe627c [SPARK-5752][SQL] Don't implicitly convert RDDs directly to DataFrames
- The old implicit would convert RDDs directly to DataFrames, and that added too many methods.
- toDataFrame -> toDF
- Dsl -> functions
- implicits moved into SQLContext.implicits
- addColumn -> withColumn
- renameColumn -> withColumnRenamed

Python changes:
- toDataFrame -> toDF
- Dsl -> functions package
- addColumn -> withColumn
- renameColumn -> withColumnRenamed
- add toDF functions to RDD on SQLContext init
- add flatMap to DataFrame

Author: Reynold Xin <rxin@databricks.com>
Author: Davies Liu <davies@databricks.com>

Closes #4556 from rxin/SPARK-5752 and squashes the following commits:

5ef9910 [Reynold Xin] More fix
61d3fca [Reynold Xin] Merge branch 'df5' of github.com:davies/spark into SPARK-5752
ff5832c [Reynold Xin] Fix python
749c675 [Reynold Xin] count(*) fixes.
5806df0 [Reynold Xin] Fix build break again.
d941f3d [Reynold Xin] Fixed explode compilation break.
fe1267a [Davies Liu] flatMap
c4afb8e [Reynold Xin] style
d9de47f [Davies Liu] add comment
b783994 [Davies Liu] add comment for toDF
e2154e5 [Davies Liu] schema() -> schema
3a1004f [Davies Liu] Dsl -> functions, toDF()
fb256af [Reynold Xin] - toDataFrame -> toDF - Dsl -> functions - implicits moved into SQLContext.implicits - addColumn -> withColumn - renameColumn -> withColumnRenamed
0dd74eb [Reynold Xin] [SPARK-5752][SQL] Don't implicitly convert RDDs directly to DataFrames
97dd47c [Davies Liu] fix mistake
6168f74 [Davies Liu] fix test
1fc0199 [Davies Liu] fix test
a075cd5 [Davies Liu] clean up, toPandas
663d314 [Davies Liu] add test for agg('*')
9e214d5 [Reynold Xin] count(*) fixes.
1ed7136 [Reynold Xin] Fix build break again.
921b2e3 [Reynold Xin] Fixed explode compilation break.
14698d4 [Davies Liu] flatMap
ba3e12d [Reynold Xin] style
d08c92d [Davies Liu] add comment
5c8b524 [Davies Liu] add comment for toDF
a4e5e66 [Davies Liu] schema() -> schema
d377fc9 [Davies Liu] Dsl -> functions, toDF()
6b3086c [Reynold Xin] - toDataFrame -> toDF - Dsl -> functions - implicits moved into SQLContext.implicits - addColumn -> withColumn - renameColumn -> withColumnRenamed
807e8b1 [Reynold Xin] [SPARK-5752][SQL] Don't implicitly convert RDDs directly to DataFrames
2015-02-13 23:03:22 -08:00
sboeschhuawei e1a1ff8108 [SPARK-5503][MLLIB] Example code for Power Iteration Clustering
Author: sboeschhuawei <stephen.boesch@huawei.com>

Closes #4495 from javadba/picexamples and squashes the following commits:

3c84b14 [sboeschhuawei] PIC Examples updates from Xiangrui's comments round 5
2878675 [sboeschhuawei] Fourth round with xiangrui on PICExample
d7ac350 [sboeschhuawei] Updates to PICExample from Xiangrui's comments round 3
d7f0cba [sboeschhuawei] Updates to PICExample from Xiangrui's comments round 3
cef28f4 [sboeschhuawei] Further updates to PICExample from Xiangrui's comments
f7ff43d [sboeschhuawei] Update to PICExample from Xiangrui's comments
efeec45 [sboeschhuawei] Update to PICExample from Xiangrui's comments
03e8de4 [sboeschhuawei] Added PICExample
c509130 [sboeschhuawei] placeholder for pic examples
5864d4a [sboeschhuawei] placeholder for pic examples
2015-02-13 09:45:57 -08:00
Davies Liu ea60284095 [SPARK-5704] [SQL] [PySpark] createDataFrame from RDD with columns
Deprecate inferSchema() and applySchema(), use createDataFrame() instead, which could take an optional `schema` to create an DataFrame from an RDD. The `schema` could be StructType or list of names of columns.

Author: Davies Liu <davies@databricks.com>

Closes #4498 from davies/create and squashes the following commits:

08469c1 [Davies Liu] remove Scala/Java API for now
c80a7a9 [Davies Liu] fix hive test
d1bd8f2 [Davies Liu] cleanup applySchema
9526e97 [Davies Liu] createDataFrame from RDD with columns
2015-02-10 19:40:12 -08:00
JqueryFan 6cc96cf0c3 [Spark-5717] [MLlib] add stop and reorganize import
Trivial. add sc stop and reorganize import
https://issues.apache.org/jira/browse/SPARK-5717

Author: JqueryFan <firing@126.com>
Author: Yuhao Yang <hhbyyh@gmail.com>

Closes #4503 from hhbyyh/scstop and squashes the following commits:

7837a2c [JqueryFan] revert import change
2e85cc1 [Yuhao Yang] add stop and reorganize import
2015-02-10 17:37:32 +00:00
Tathagata Das c15134632e [SPARK-4964][Streaming][Kafka] More updates to Exactly-once Kafka stream
Changes
- Added example
- Added a critical unit test that verifies that offset ranges can be recovered through checkpoints

Might add more changes.

Author: Tathagata Das <tathagata.das1565@gmail.com>

Closes #4384 from tdas/new-kafka-fixes and squashes the following commits:

7c931c3 [Tathagata Das] Small update
3ed9284 [Tathagata Das] updated scala doc
83d0402 [Tathagata Das] Added JavaDirectKafkaWordCount example.
26df23c [Tathagata Das] Updates based on PR comments from Cody
e4abf69 [Tathagata Das] Scala doc improvements and stuff.
bb65232 [Tathagata Das] Fixed test bug and refactored KafkaStreamSuite
50f2b56 [Tathagata Das] Added Java API and added more Scala and Java unit tests. Also updated docs.
e73589c [Tathagata Das] Minor changes.
4986784 [Tathagata Das] Added unit test to kafka offset recovery
6a91cab [Tathagata Das] Added example
2015-02-09 22:45:48 -08:00
Xiangrui Meng 855d12ac0a [SPARK-5539][MLLIB] LDA guide
This is the LDA user guide from jkbradley with Java and Scala code example.

Author: Xiangrui Meng <meng@databricks.com>
Author: Joseph K. Bradley <joseph@databricks.com>

Closes #4465 from mengxr/lda-guide and squashes the following commits:

6dcb7d1 [Xiangrui Meng] update java example in the user guide
76169ff [Xiangrui Meng] update java example
36c3ae2 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into lda-guide
c2a1efe [Joseph K. Bradley] Added LDA programming guide, plus Java example (which is in the guide and probably should be removed).
2015-02-08 23:40:36 -08:00
Xiangrui Meng 65181b7512 [HOTFIX][MLLIB] fix a compilation error with java 6
Author: Xiangrui Meng <meng@databricks.com>

Closes #4442 from mengxr/java6-fix and squashes the following commits:

2098500 [Xiangrui Meng] fix a compilation error with java 6
2015-02-06 13:52:35 -08:00
Joseph K. Bradley dc0c4490a1 [SPARK-4789] [SPARK-4942] [SPARK-5031] [mllib] Standardize ML Prediction APIs
This is part (1a) of the updates from the design doc in [https://docs.google.com/document/d/1BH9el33kBX8JiDdgUJXdLW14CA2qhTCWIG46eXZVoJs]

**UPDATE**: Most of the APIs are being kept private[spark] to allow further discussion.  Here is a list of changes which are public:
* new output columns: rawPrediction, probabilities
  * The “score” column is now called “rawPrediction”
* Classifiers now provide numClasses
* Params.get and .set are now protected instead of private[ml].
* ParamMap now has a size method.
* new classes: LinearRegression, LinearRegressionModel
* LogisticRegression now has an intercept.

### Sketch of APIs (most of which are private[spark] for now)

Abstract classes for learning algorithms (+ corresponding Model abstractions):
* Classifier (+ ClassificationModel)
* ProbabilisticClassifier (+ ProbabilisticClassificationModel)
* Regressor (+ RegressionModel)
* Predictor (+ PredictionModel)
* *For all of these*:
 * There is no strongly typed training-time API.
 * There is a strongly typed test-time (prediction) API which helps developers implement new algorithms.

Concrete classes: learning algorithms
* LinearRegression
* LogisticRegression (updated to use new abstract classes)
 * Also, removed "score" in favor of "probability" output column.  Changed BinaryClassificationEvaluator to match. (SPARK-5031)

Other updates:
* params.scala: Changed Params.set/get to be protected instead of private[ml]
 * This was needed for the example of defining a class from outside of the MLlib namespace.
* VectorUDT: Will later change from private[spark] to public.
 * This is needed for outside users to write their own validateAndTransformSchema() methods using vectors.
 * Also, added equals() method.f
* SPARK-4942 : ML Transformers should allow output cols to be turned on,off
 * Update validateAndTransformSchema
 * Update transform
* (Updated examples, test suites according to other changes)

New examples:
* DeveloperApiExample.scala (example of defining algorithm from outside of the MLlib namespace)
 * Added Java version too

Test Suites:
* LinearRegressionSuite
* LogisticRegressionSuite
* + Java versions of above suites

CC: mengxr  etrain  shivaram

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #3637 from jkbradley/ml-api-part1 and squashes the following commits:

405bfb8 [Joseph K. Bradley] Last edits based on code review.  Small cleanups
fec348a [Joseph K. Bradley] Added JavaDeveloperApiExample.java and fixed other issues: Made developer API private[spark] for now. Added constructors Java can understand to specialized Param types.
8316d5e [Joseph K. Bradley] fixes after rebasing on master
fc62406 [Joseph K. Bradley] fixed test suites after last commit
bcb9549 [Joseph K. Bradley] Fixed issues after rebasing from master (after move from SchemaRDD to DataFrame)
9872424 [Joseph K. Bradley] fixed JavaLinearRegressionSuite.java Java sql api
f542997 [Joseph K. Bradley] Added MIMA excludes for VectorUDT (now public), and added DeveloperApi annotation to it
216d199 [Joseph K. Bradley] fixed after sql datatypes PR got merged
f549e34 [Joseph K. Bradley] Updates based on code review.  Major ones are: * Created weakly typed Predictor.train() method which is called by fit() so that developers do not have to call schema validation or copy parameters. * Made Predictor.featuresDataType have a default value of VectorUDT.   * NOTE: This could be dangerous since the FeaturesType type parameter cannot have a default value.
343e7bd [Joseph K. Bradley] added blanket mima exclude for ml package
82f340b [Joseph K. Bradley] Fixed bug in LogisticRegression (introduced in this PR).  Fixed Java suites
0a16da9 [Joseph K. Bradley] Fixed Linear/Logistic RegressionSuites
c3c8da5 [Joseph K. Bradley] small cleanup
934f97b [Joseph K. Bradley] Fixed bugs from previous commit.
1c61723 [Joseph K. Bradley] * Made ProbabilisticClassificationModel into a subclass of ClassificationModel.  Also introduced ProbabilisticClassifier.  * This was to support output column “probabilityCol” in transform().
4e2f711 [Joseph K. Bradley] rat fix
bc654e1 [Joseph K. Bradley] Added spark.ml LinearRegressionSuite
8d13233 [Joseph K. Bradley] Added methods: * Classifier: batch predictRaw() * Predictor: train() without paramMap ProbabilisticClassificationModel.predictProbabilities() * Java versions of all above batch methods + others
1680905 [Joseph K. Bradley] Added JavaLabeledPointSuite.java for spark.ml, and added constructor to LabeledPoint which defaults weight to 1.0
adbe50a [Joseph K. Bradley] * fixed LinearRegression train() to use embedded paramMap * added Predictor.predict(RDD[Vector]) method * updated Linear/LogisticRegressionSuites
58802e3 [Joseph K. Bradley] added train() to Predictor subclasses which does not take a ParamMap.
57d54ab [Joseph K. Bradley] * Changed semantics of Predictor.train() to merge the given paramMap with the embedded paramMap. * remove threshold_internal from logreg * Added Predictor.copy() * Extended LogisticRegressionSuite
e433872 [Joseph K. Bradley] Updated docs.  Added LabeledPointSuite to spark.ml
54b7b31 [Joseph K. Bradley] Fixed issue with logreg threshold being set correctly
0617d61 [Joseph K. Bradley] Fixed bug from last commit (sorting paramMap by parameter names in toString).  Fixed bug in persisting logreg data.  Added threshold_internal to logreg for faster test-time prediction (avoiding map lookup).
601e792 [Joseph K. Bradley] Modified ParamMap to sort parameters in toString.  Cleaned up classes in class hierarchy, before implementing tests and examples.
d705e87 [Joseph K. Bradley] Added LinearRegression and Regressor back from ml-api branch
52f4fde [Joseph K. Bradley] removing everything except for simple class hierarchy for classification
d35bb5d [Joseph K. Bradley] fixed compilation issues, but have not added tests yet
bfade12 [Joseph K. Bradley] Added lots of classes for new ML API:
2015-02-05 23:43:47 -08:00
Xiangrui Meng 6b88825a25 [SPARK-5604][MLLIB] remove checkpointDir from trees
This is the second part of SPARK-5604, which removes checkpointDir from tree strategies. Note that this is a break change. I will mention it in the migration guide.

Author: Xiangrui Meng <meng@databricks.com>

Closes #4407 from mengxr/SPARK-5604-1 and squashes the following commits:

13a276d [Xiangrui Meng] remove checkpointDir from trees
2015-02-05 23:32:09 -08:00
Xiangrui Meng c19152cd2a [SPARK-5604[MLLIB] remove checkpointDir from LDA
`checkpointDir` is a Spark global configuration. Users should set it outside LDA. This PR also hides some methods under `private[clustering] object LDA`, so they don't show up in the generated Java doc (SPARK-5610).

jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #4390 from mengxr/SPARK-5604 and squashes the following commits:

a34bb39 [Xiangrui Meng] remove checkpointDir from LDA
2015-02-05 15:07:33 -08:00
Reynold Xin 7d789e117d [SPARK-5612][SQL] Move DataFrame implicit functions into SQLContext.implicits.
Author: Reynold Xin <rxin@databricks.com>

Closes #4386 from rxin/df-implicits and squashes the following commits:

9d96606 [Reynold Xin] style fix
edd296b [Reynold Xin] ReplSuite
1c946ab [Reynold Xin] [SPARK-5612][SQL] Move DataFrame implicit functions into SQLContext.implicits.
2015-02-04 23:44:34 -08:00
Joseph K. Bradley 980764f3c0 [SPARK-1405] [mllib] Latent Dirichlet Allocation (LDA) using EM
**This PR introduces an API + simple implementation for Latent Dirichlet Allocation (LDA).**

The [design doc for this PR](https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo) has been updated since I initially posted it.  In particular, see the API and Planning for the Future sections.

* Settle on a public API which may eventually include:
  * more inference algorithms
  * more options / functionality
* Have an initial easy-to-understand implementation which others may improve.
* This is NOT intended to support every topic model out there.  However, if there are suggestions for making this extensible or pluggable in the future, that could be nice, as long as it does not complicate the API or implementation too much.
* This may not be very scalable currently.  It will be important to check and improve accuracy.  For correctness of the implementation, please check against the Asuncion et al. (2009) paper in the design doc.

**Dependency: This makes MLlib depend on GraphX.**

Files and classes:
* LDA.scala (441 lines):
  * class LDA (main estimator class)
  * LDA.Document  (text + document ID)
* LDAModel.scala (266 lines)
  * abstract class LDAModel
  * class LocalLDAModel
  * class DistributedLDAModel
* LDAExample.scala (245 lines): script to run LDA + a simple (private) Tokenizer
* LDASuite.scala (144 lines)

Data/model representation and algorithm:
* Data/model: Uses GraphX, with term vertices + document vertices
* Algorithm: EM, following [Asuncion, Welling, Smyth, and Teh.  "On Smoothing and Inference for Topic Models."  UAI, 2009.](http://arxiv-web3.library.cornell.edu/abs/1205.2662v1)
* For more details, please see the description in the “DEVELOPERS NOTE” in LDA.scala

Please refer to the JIRA for more discussion + the [design doc for this PR](https://docs.google.com/document/d/1kSsDqTeZMEB94Bs4GTd0mvdAmduvZSSkpoSfn-seAzo)

Here, I list the main changes AFTER the design doc was posted.

Design decisions:
* logLikelihood() computes the log likelihood of the data and the current point estimate of parameters.  This is different from the likelihood of the data given the hyperparameters, which would be harder to compute.  I’d describe the current approach as more frequentist, whereas the harder approach would be more Bayesian.
* The current API takes Documents as token count vectors.  I believe there should be an extended API taking RDD[String] or RDD[Array[String]] in a future PR.  I have sketched this out in the design doc (as well as handier versions of getTopics returning Strings).
* Hyperparameters should be set differently for different inference/learning algorithms.  See Asuncion et al. (2009) in the design doc for a good demonstration.  I encourage good behavior via defaults and warning messages.

Items planned for future PRs:
* perplexity
* API taking Strings

* Should LDA be called LatentDirichletAllocation (and LDAModel be LatentDirichletAllocationModel)?
  * Pro: We may someday want LinearDiscriminantAnalysis.
  * Con: Very long names

* Should LDA reside in clustering?  Or do we want a sub-package?
  * mllib.topicmodel
  * mllib.clustering.topicmodel

* Does the API seem reasonable and extensible?

* Unit tests:
  * Should there be a test which checks a clustering results?  E.g., train on a small, fake dataset with 2 very distinct topics/clusters, and ensure LDA finds those 2 topics/clusters.  Does that sound useful or too flaky?

This has not been tested much for scaling.  I have run it on a laptop for 200 iterations on a 5MB dataset with 1000 terms and 5 topics.  Running it for 500 iterations made it fail because of GC problems.  I'm running larger scale tests & will put results here, but future PRs may need to improve the scaling.

* dlwh  for the initial implementation
  * + jegonzal  for some code in the initial implementation
* The many contributors towards topic model implementations in Spark which were referenced as a basis for this PR: akopich witgo yinxusen dlwh EntilZha jegonzal  IlyaKozlov
  * Note: The plan is to include this full list in the authors if this PR gets merged.  Please notify me if you prefer otherwise.

CC: mengxr

Authors:
  Joseph K. Bradley <joseph@databricks.com>
  Joseph Gonzalez <joseph.e.gonzalez@gmail.com>
  David Hall <david.lw.hall@gmail.com>
  Guoqiang Li <witgo@qq.com>
  Xiangrui Meng <meng@databricks.com>
  Pedro Rodriguez <pedro@snowgeek.org>
  Avanesov Valeriy <acopich@gmail.com>
  Xusen Yin <yinxusen@gmail.com>

Closes #2388
Closes #4047 from jkbradley/davidhall-lda and squashes the following commits:

77e8814 [Joseph K. Bradley] small doc fix
5c74345 [Joseph K. Bradley] cleaned up doc based on code review
589728b [Joseph K. Bradley] Updates per code review.  Main change was in LDAExample for faster vocab computation.  Also updated PeriodicGraphCheckpointerSuite.scala to clean up checkpoint files at end
e3980d2 [Joseph K. Bradley] cleaned up PeriodicGraphCheckpointerSuite.scala
74487e5 [Joseph K. Bradley] Merge remote-tracking branch 'upstream/master' into davidhall-lda
4ae2a7d [Joseph K. Bradley] removed duplicate graphx dependency in mllib/pom.xml
e391474 [Joseph K. Bradley] Removed LDATiming.  Added PeriodicGraphCheckpointerSuite.scala.  Small LDA cleanups.
e8d8acf [Joseph K. Bradley] Added catch for BreakIterator exception.  Improved preprocessing to reduce passes over data
1a231b4 [Joseph K. Bradley] fixed scalastyle
91aadfe [Joseph K. Bradley] Added Java-friendly run method to LDA. Added Java test suite for LDA. Changed LDAModel.describeTopics to return Java-friendly type
b75472d [Joseph K. Bradley] merged improvements from LDATiming into LDAExample.  Will remove LDATiming after done testing
993ca56 [Joseph K. Bradley] * Removed Document type in favor of (Long, Vector) * Changed doc ID restriction to be: id must be nonnegative and unique in the doc (instead of 0,1,2,...) * Add checks for valid ranges of eta, alpha * Rename “LearningState” to “EMOptimizer” * Renamed params: termSmoothing -> topicConcentration, topicSmoothing -> docConcentration   * Also added aliases alpha, beta
cb5a319 [Joseph K. Bradley] Added checkpointing to LDA * new class PeriodicGraphCheckpointer * params checkpointDir, checkpointInterval to LDA
43c1c40 [Joseph K. Bradley] small cleanup
0b90393 [Joseph K. Bradley] renamed LDA LearningState.collectTopicTotals to globalTopicTotals
77a2c85 [Joseph K. Bradley] Moved auto term,topic smoothing computation to get*Smoothing methods.  Changed word to term in some places.  Updated LDAExample to use default smoothing amounts.
fb1e7b5 [Xiangrui Meng] minor
08d59a3 [Xiangrui Meng] reset spacing
9fe0b95 [Xiangrui Meng] optimize aggregateMessages
cec0a9c [Xiangrui Meng] * -> *=
6cb11b0 [Xiangrui Meng] optimize computePTopic
9eb3d02 [Xiangrui Meng] + -> +=
892530c [Xiangrui Meng] use axpy
45cc7f2 [Xiangrui Meng] mapPart -> flatMap
ce53be9 [Joseph K. Bradley] fixed example name
75749e7 [Joseph K. Bradley] scala style fix
9f2a492 [Joseph K. Bradley] Unit tests and fixes for LDA, now ready for PR
377ebd9 [Joseph K. Bradley] separated LDA models into own file.  more cleanups before PR
2d40006 [Joseph K. Bradley] cleanups before PR
2891e89 [Joseph K. Bradley] Prepped LDA main class for PR, but some cleanups remain
0cb7187 [Joseph K. Bradley] Added 3 files from dlwh LDA implementation
2015-02-02 23:57:37 -08:00
FlytxtRnD 50a1a874e1 [SPARK-5012][MLLib][PySpark]Python API for Gaussian Mixture Model
Python API for the Gaussian Mixture Model clustering algorithm in MLLib.

Author: FlytxtRnD <meethu.mathew@flytxt.com>

Closes #4059 from FlytxtRnD/PythonGmmWrapper and squashes the following commits:

c973ab3 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
339b09c [FlytxtRnD] Added MultivariateGaussian namedtuple  and Arraybuffer in trainGaussianMixture
fa0a142 [FlytxtRnD] New line added
d5b36ab [FlytxtRnD] Changed argument names to lowercase
ac134f1 [FlytxtRnD] Merge branch 'PythonGmmWrapper' of https://github.com/FlytxtRnD/spark into PythonGmmWrapper
6671ea1 [FlytxtRnD] Added mllib/stat/distribution.py
3aee84b [FlytxtRnD] Fixed style issues
2e9f12a [FlytxtRnD] Added mllib/stat/distribution.py and fixed style issues
b22532c [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
2e14d82 [FlytxtRnD] Incorporate MultivariateGaussian instances in GaussianMixtureModel
05767c7 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
3464d19 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
c1d4c71 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'origin/PythonGmmWrapper' into PythonGmmWrapper
426d130 [FlytxtRnD] Added random seed parameter
332bad1 [FlytxtRnD] Merge branch 'PythonGmmWrapper', remote-tracking branch 'upstream/master' into PythonGmmWrapper
f82750b [FlytxtRnD] Fixed style issues
5c83825 [FlytxtRnD] Split input file with space delimiter
fda60f3 [FlytxtRnD] Python API for Gaussian Mixture Model
2015-02-02 23:04:55 -08:00
freeman eb0da6c4bd [SPARK-4979][MLLIB] Streaming logisitic regression
This adds support for streaming logistic regression with stochastic gradient descent, in the same manner as the existing implementation of streaming linear regression. It is a relatively simple addition because most of the work is already done by the abstract class `StreamingLinearAlgorithm` and existing algorithms and models from MLlib.

The PR includes
- Streaming Logistic Regression algorithm
- Unit tests for accuracy, streaming convergence, and streaming prediction
- An example use

cc mengxr tdas

Author: freeman <the.freeman.lab@gmail.com>

Closes #4306 from freeman-lab/streaming-logisitic-regression and squashes the following commits:

5c2c70b [freeman] Use Option on model
5cca2bc [freeman] Merge remote-tracking branch 'upstream/master' into streaming-logisitic-regression
275f8bd [freeman] Make private to mllib
3926e4e [freeman] Line formatting
5ee8694 [freeman] Experimental tag for docs
2fc68ac [freeman] Fix example formatting
85320b1 [freeman] Fixed line length
d88f717 [freeman] Remove stray comment
59d7ecb [freeman] Add streaming logistic regression
e78fe28 [freeman] Add streaming logistic regression example
321cc66 [freeman] Set private and protected within mllib
2015-02-02 22:42:15 -08:00
Davies Liu 0561c45449 [SPARK-5154] [PySpark] [Streaming] Kafka streaming support in Python
This PR brings the Python API for Spark Streaming Kafka data source.

```
    class KafkaUtils(__builtin__.object)
     |  Static methods defined here:
     |
     |  createStream(ssc, zkQuorum, groupId, topics, storageLevel=StorageLevel(True, True, False, False,
2), keyDecoder=<function utf8_decoder>, valueDecoder=<function utf8_decoder>)
     |      Create an input stream that pulls messages from a Kafka Broker.
     |
     |      :param ssc:  StreamingContext object
     |      :param zkQuorum:  Zookeeper quorum (hostname:port,hostname:port,..).
     |      :param groupId:  The group id for this consumer.
     |      :param topics:  Dict of (topic_name -> numPartitions) to consume.
     |                      Each partition is consumed in its own thread.
     |      :param storageLevel:  RDD storage level.
     |      :param keyDecoder:  A function used to decode key
     |      :param valueDecoder:  A function used to decode value
     |      :return: A DStream object
```
run the example:

```
bin/spark-submit --driver-class-path external/kafka-assembly/target/scala-*/spark-streaming-kafka-assembly-*.jar examples/src/main/python/streaming/kafka_wordcount.py localhost:2181 test
```

Author: Davies Liu <davies@databricks.com>
Author: Tathagata Das <tdas@databricks.com>

Closes #3715 from davies/kafka and squashes the following commits:

d93bfe0 [Davies Liu] Update make-distribution.sh
4280d04 [Davies Liu] address comments
e6d0427 [Davies Liu] Merge branch 'master' of github.com:apache/spark into kafka
f257071 [Davies Liu] add tests for null in RDD
23b039a [Davies Liu] address comments
9af51c4 [Davies Liu] Merge branch 'kafka' of github.com:davies/spark into kafka
a74da87 [Davies Liu] address comments
dc1eed0 [Davies Liu] Update kafka_wordcount.py
31e2317 [Davies Liu] Update kafka_wordcount.py
370ba61 [Davies Liu] Update kafka.py
97386b3 [Davies Liu] address comment
2c567a5 [Davies Liu] update logging and comment
33730d1 [Davies Liu] Merge branch 'master' of github.com:apache/spark into kafka
adeeb38 [Davies Liu] Merge pull request #3 from tdas/kafka-python-api
aea8953 [Tathagata Das] Kafka-assembly for Python API
eea16a7 [Davies Liu] refactor
f6ce899 [Davies Liu] add example and fix bugs
98c8d17 [Davies Liu] fix python style
5697a01 [Davies Liu] bypass decoder in scala
048dbe6 [Davies Liu] fix python style
75d485e [Davies Liu] add mqtt
07923c4 [Davies Liu] support kafka in Python
2015-02-02 19:16:27 -08:00
Jacek Lewandowski 5a5526164b SPARK-5425: Use synchronised methods in system properties to create SparkConf
SPARK-5425: Fixed usages of system properties

This patch fixes few problems caused by the fact that the Scala wrapper over system properties is not thread-safe and is basically invalid because it doesn't take into account the default values which could have been set in the properties object. The problem is fixed by modifying `Utils.getSystemProperties` method so that it uses `stringPropertyNames` method of the `Properties` class, which is thread-safe (internally it creates a defensive copy in a synchronized method) and returns keys of the properties which were set explicitly and which are defined as defaults.
The other related problem, which is fixed here. was in `ResetSystemProperties` mix-in. It created a copy of the system properties in the wrong way.

This patch also introduces a test case for thread-safeness of SparkConf creation.

Refer to the discussion in https://github.com/apache/spark/pull/4220 for more details.

Author: Jacek Lewandowski <lewandowski.jacek@gmail.com>

Closes #4222 from jacek-lewandowski/SPARK-5425-1.3 and squashes the following commits:

03da61b [Jacek Lewandowski] SPARK-5425: Modified Utils.getSystemProperties to return a map of all system properties - explicit + defaults
8faf2ea [Jacek Lewandowski] SPARK-5425: Use SerializationUtils to save properties in ResetSystemProperties trait
71aa572 [Jacek Lewandowski] SPARK-5425: Use synchronised methods in system properties to create SparkConf
2015-02-02 14:07:19 -08:00
Travis Galoppo 986977340d SPARK-5400 [MLlib] Changed name of GaussianMixtureEM to GaussianMixture
Decoupling the model and the algorithm

Author: Travis Galoppo <tjg2107@columbia.edu>

Closes #4290 from tgaloppo/spark-5400 and squashes the following commits:

9c1534c [Travis Galoppo] Fixed invokation instructions in comments
d848076 [Travis Galoppo] SPARK-5400 Changed name of GaussianMixtureEM to GaussianMixture to separate model from algorithm
2015-01-30 15:32:25 -08:00
Kazuki Taniguchi bc1fc9b60d [SPARK-5094][MLlib] Add Python API for Gradient Boosted Trees
This PR is implementing the Gradient Boosted Trees for Python API.

Author: Kazuki Taniguchi <kazuki.t.1018@gmail.com>

Closes #3951 from kazk1018/gbt_for_py and squashes the following commits:

620d247 [Kazuki Taniguchi] [SPARK-5094][MLlib] Add Python API for Gradient Boosted Trees
2015-01-30 00:39:44 -08:00
Reynold Xin 715632232d [SPARK-5445][SQL] Consolidate Java and Scala DSL static methods.
Turns out Scala does generate static methods for ones defined in a companion object. Finally no need to separate api.java.dsl and api.scala.dsl.

Author: Reynold Xin <rxin@databricks.com>

Closes #4276 from rxin/dsl and squashes the following commits:

30aa611 [Reynold Xin] Add all files.
1a9d215 [Reynold Xin] [SPARK-5445][SQL] Consolidate Java and Scala DSL static methods.
2015-01-29 15:13:09 -08:00
Reynold Xin 5b9760de8d [SPARK-5445][SQL] Made DataFrame dsl usable in Java
Also removed the literal implicit transformation since it is pretty scary for API design. Instead, created a new lit method for creating literals. This doesn't break anything from a compatibility perspective because Literal was added two days ago.

Author: Reynold Xin <rxin@databricks.com>

Closes #4241 from rxin/df-docupdate and squashes the following commits:

c0f4810 [Reynold Xin] Fix Python merge conflict.
094c7d7 [Reynold Xin] Minor style fix. Reset Python tests.
3c89f4a [Reynold Xin] Package.
dfe6962 [Reynold Xin] Updated Python aggregate.
5dd4265 [Reynold Xin] Made dsl Java callable.
14b3c27 [Reynold Xin] Fix literal expression for symbols.
68b31cb [Reynold Xin] Literal.
4cfeb78 [Reynold Xin] [SPARK-5097][SQL] Address DataFrame code review feedback.
2015-01-28 19:10:32 -08:00
Xiangrui Meng e80dc1c5a8 [SPARK-4586][MLLIB] Python API for ML pipeline and parameters
This PR adds Python API for ML pipeline and parameters. The design doc can be found on the JIRA page. It includes transformers and an estimator to demo the simple text classification example code.

TODO:
- [x] handle parameters in LRModel
- [x] unit tests
- [x] missing some docs

CC: davies jkbradley

Author: Xiangrui Meng <meng@databricks.com>
Author: Davies Liu <davies@databricks.com>

Closes #4151 from mengxr/SPARK-4586 and squashes the following commits:

415268e [Xiangrui Meng] remove inherit_doc from __init__
edbd6fe [Xiangrui Meng] move Identifiable to ml.util
44c2405 [Xiangrui Meng] Merge pull request #2 from davies/ml
dd1256b [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-4586
14ae7e2 [Davies Liu] fix docs
54ca7df [Davies Liu] fix tests
78638df [Davies Liu] Merge branch 'SPARK-4586' of github.com:mengxr/spark into ml
fc59a02 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-4586
1dca16a [Davies Liu] refactor
090b3a3 [Davies Liu] Merge branch 'master' of github.com:apache/spark into ml
0882513 [Xiangrui Meng] update doc style
a4f4dbf [Xiangrui Meng] add unit test for LR
7521d1c [Xiangrui Meng] add unit tests to HashingTF and Tokenizer
ba0ba1e [Xiangrui Meng] add unit tests for pipeline
0586c7b [Xiangrui Meng] add more comments to the example
5153cff [Xiangrui Meng] simplify java models
036ca04 [Xiangrui Meng] gen numFeatures
46fa147 [Xiangrui Meng] update mllib/pom.xml to include python files in the assembly
1dcc17e [Xiangrui Meng] update code gen and make param appear in the doc
f66ba0c [Xiangrui Meng] make params a property
d5efd34 [Xiangrui Meng] update doc conf and move embedded param map to instance attribute
f4d0fe6 [Xiangrui Meng] use LabeledDocument and Document in example
05e3e40 [Xiangrui Meng] update example
d3e8dbe [Xiangrui Meng] more docs optimize pipeline.fit impl
56de571 [Xiangrui Meng] fix style
d0c5bb8 [Xiangrui Meng] a working copy
bce72f4 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-4586
17ecfb9 [Xiangrui Meng] code gen for shared params
d9ea77c [Xiangrui Meng] update doc
c18dca1 [Xiangrui Meng] make the example working
dadd84e [Xiangrui Meng] add base classes and docs
a3015cf [Xiangrui Meng] add Estimator and Transformer
46eea43 [Xiangrui Meng] a pipeline in python
33b68e0 [Xiangrui Meng] a working LR
2015-01-28 17:14:23 -08:00
Reynold Xin c8e934ef3c [SPARK-5447][SQL] Replaced reference to SchemaRDD with DataFrame.
and

[SPARK-5448][SQL] Make CacheManager a concrete class and field in SQLContext

Author: Reynold Xin <rxin@databricks.com>

Closes #4242 from rxin/sqlCleanup and squashes the following commits:

e351cb2 [Reynold Xin] Fixed toDataFrame.
6545c42 [Reynold Xin] More changes.
728c017 [Reynold Xin] [SPARK-5447][SQL] Replaced reference to SchemaRDD with DataFrame.
2015-01-28 12:10:01 -08:00
Marcelo Vanzin 37a5e272f8 [SPARK-4809] Rework Guava library shading.
The current way of shading Guava is a little problematic. Code that
depends on "spark-core" does not see the transitive dependency, yet
classes in "spark-core" actually depend on Guava. So it's a little
tricky to run unit tests that use spark-core classes, since you need
a compatible version of Guava in your dependencies when running the
tests. This can become a little tricky, and is kind of a bad user
experience.

This change modifies the way Guava is shaded so that it's applied
uniformly across the Spark build. This means Guava is shaded inside
spark-core itself, so that the dependency issues above are solved.
Aside from that, all Spark sub-modules have their Guava references
relocated, so that they refer to the relocated classes now packaged
inside spark-core. Before, this was only done by the time the assembly
was built, so projects that did not end up inside the assembly (such
as streaming backends) could still reference the original location
of Guava classes.

The Guava classes are added to the "first" artifact Spark generates
(network-common), so that all downstream modules have the needed
classes available. Since "network-common" is a dependency of spark-core,
all Spark apps should get the relocated classes automatically.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #3658 from vanzin/SPARK-4809 and squashes the following commits:

3c93e42 [Marcelo Vanzin] Shade Guava in the network-common artifact.
5d69ec9 [Marcelo Vanzin] Merge branch 'master' into SPARK-4809
b3104fc [Marcelo Vanzin] Add comment.
941848f [Marcelo Vanzin] Merge branch 'master' into SPARK-4809
f78c48a [Marcelo Vanzin] Merge branch 'master' into SPARK-4809
8053dd4 [Marcelo Vanzin] Merge branch 'master' into SPARK-4809
107d7da [Marcelo Vanzin] Add fix for SPARK-5052 (PR #3874).
40b8723 [Marcelo Vanzin] Merge branch 'master' into SPARK-4809
4a4ed42 [Marcelo Vanzin] [SPARK-4809] Rework Guava library shading.
2015-01-28 00:29:29 -08:00
Reynold Xin 119f45d61d [SPARK-5097][SQL] DataFrame
This pull request redesigns the existing Spark SQL dsl, which already provides data frame like functionalities.

TODOs:
With the exception of Python support, other tasks can be done in separate, follow-up PRs.
- [ ] Audit of the API
- [ ] Documentation
- [ ] More test cases to cover the new API
- [x] Python support
- [ ] Type alias SchemaRDD

Author: Reynold Xin <rxin@databricks.com>
Author: Davies Liu <davies@databricks.com>

Closes #4173 from rxin/df1 and squashes the following commits:

0a1a73b [Reynold Xin] Merge branch 'df1' of github.com:rxin/spark into df1
23b4427 [Reynold Xin] Mima.
828f70d [Reynold Xin] Merge pull request #7 from davies/df
257b9e6 [Davies Liu] add repartition
6bf2b73 [Davies Liu] fix collect with UDT and tests
e971078 [Reynold Xin] Missing quotes.
b9306b4 [Reynold Xin] Remove removeColumn/updateColumn for now.
a728bf2 [Reynold Xin] Example rename.
e8aa3d3 [Reynold Xin] groupby -> groupBy.
9662c9e [Davies Liu] improve DataFrame Python API
4ae51ea [Davies Liu] python API for dataframe
1e5e454 [Reynold Xin] Fixed a bug with symbol conversion.
2ca74db [Reynold Xin] Couple minor fixes.
ea98ea1 [Reynold Xin] Documentation & literal expressions.
2b22684 [Reynold Xin] Got rid of IntelliJ problems.
02bbfbc [Reynold Xin] Tightening imports.
ffbce66 [Reynold Xin] Fixed compilation error.
59b6d8b [Reynold Xin] Style violation.
b85edfb [Reynold Xin] ALS.
8c37f0a [Reynold Xin] Made MLlib and examples compile
6d53134 [Reynold Xin] Hive module.
d35efd5 [Reynold Xin] Fixed compilation error.
ce4a5d2 [Reynold Xin] Fixed test cases in SQL except ParquetIOSuite.
66d5ef1 [Reynold Xin] SQLContext minor patch.
c9bcdc0 [Reynold Xin] Checkpoint: SQL module compiles!
2015-01-27 16:08:24 -08:00
Xiangrui Meng ea74365b7c [SPARK-3541][MLLIB] New ALS implementation with improved storage
This PR adds a new ALS implementation to `spark.ml` using the pipeline API, which should be able to scale to billions of ratings. Compared with the ALS under `spark.mllib`, the new implementation

1. uses the same algorithm,
2. uses float type for ratings,
3. uses primitive arrays to avoid GC,
4. sorts and compresses ratings on each block so that we can solve least squares subproblems one by one using only one normal equation instance.

The following figure shows performance comparison on copies of the Amazon Reviews dataset using a 16-node (m3.2xlarge) EC2 cluster (the same setup as in http://databricks.com/blog/2014/07/23/scalable-collaborative-filtering-with-spark-mllib.html):
![als-wip](https://cloud.githubusercontent.com/assets/829644/5659447/4c4ff8e0-96c7-11e4-87a9-73c1c63d07f3.png)

I keep the `spark.mllib`'s ALS untouched for easy comparison. If the new implementation works well, I'm going to match the features of the ALS under `spark.mllib` and then make it a wrapper of the new implementation, in a separate PR.

TODO:
- [X] Add unit tests for implicit preferences.

Author: Xiangrui Meng <meng@databricks.com>

Closes #3720 from mengxr/SPARK-3541 and squashes the following commits:

1b9e852 [Xiangrui Meng] fix compile
5129be9 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into SPARK-3541
dd0d0e8 [Xiangrui Meng] simplify test code
c627de3 [Xiangrui Meng] add tests for implicit feedback
b84f41c [Xiangrui Meng] address comments
a76da7b [Xiangrui Meng] update ALS tests
2a8deb3 [Xiangrui Meng] add some ALS tests
857e876 [Xiangrui Meng] add tests for rating block and encoded block
d3c1ac4 [Xiangrui Meng] rename some classes for better code readability add more doc and comments
213d163 [Xiangrui Meng] org imports
771baf3 [Xiangrui Meng] chol doc update
ca9ad9d [Xiangrui Meng] add unit tests for chol
b4fd17c [Xiangrui Meng] add unit tests for NormalEquation
d0f99d3 [Xiangrui Meng] add tests for LocalIndexEncoder
80b8e61 [Xiangrui Meng] fix imports
4937fd4 [Xiangrui Meng] update ALS example
56c253c [Xiangrui Meng] rename product to item
bce8692 [Xiangrui Meng] doc for parameters and project the output columns
3f2d81a [Xiangrui Meng] add doc
1efaecf [Xiangrui Meng] add example code
8ae86b5 [Xiangrui Meng] add a working copy of the new ALS implementation
2015-01-22 22:09:13 -08:00
jerryshao 3c3fa632e6 [SPARK-5233][Streaming] Fix error replaying of WAL introduced bug
Because of lacking of `BlockAllocationEvent` in WAL recovery, the dangled event will mix into the new batch, which will lead to the wrong result. Details can be seen in [SPARK-5233](https://issues.apache.org/jira/browse/SPARK-5233).

Author: jerryshao <saisai.shao@intel.com>

Closes #4032 from jerryshao/SPARK-5233 and squashes the following commits:

f0b0c0b [jerryshao] Further address the comments
a237c75 [jerryshao] Address the comments
e356258 [jerryshao] Fix bug in unit test
558bdc3 [jerryshao] Correctly replay the WAL log when recovering from failure
2015-01-22 21:58:53 -08:00
Travis Galoppo 23e25543be SPARK-5019 [MLlib] - GaussianMixtureModel exposes instances of MultivariateGauss...
This PR modifies GaussianMixtureModel to expose instances of MutlivariateGaussian rather than separate mean and covariance arrays.

Author: Travis Galoppo <tjg2107@columbia.edu>

Closes #4088 from tgaloppo/spark-5019 and squashes the following commits:

3ef6c7f [Travis Galoppo] In GaussianMixtureModel: Changed name of weight, gaussian to weights, gaussians.  Other sources modified accordingly.
091e8da [Travis Galoppo] SPARK-5019 - GaussianMixtureModel exposes instances of MultivariateGaussian rather than mean/covariance matrices
2015-01-20 12:58:11 -08:00
Reynold Xin 61b427d4b1 [SPARK-5193][SQL] Remove Spark SQL Java-specific API.
After the following patches, the main (Scala) API is now usable for Java users directly.

https://github.com/apache/spark/pull/4056
https://github.com/apache/spark/pull/4054
https://github.com/apache/spark/pull/4049
https://github.com/apache/spark/pull/4030
https://github.com/apache/spark/pull/3965
https://github.com/apache/spark/pull/3958

Author: Reynold Xin <rxin@databricks.com>

Closes #4065 from rxin/sql-java-api and squashes the following commits:

b1fd860 [Reynold Xin] Fix Mima
6d86578 [Reynold Xin] Ok one more attempt in fixing Python...
e8f1455 [Reynold Xin] Fix Python again...
3e53f91 [Reynold Xin] Fixed Python.
83735da [Reynold Xin] Fix BigDecimal test.
e9f1de3 [Reynold Xin] Use scala BigDecimal.
500d2c4 [Reynold Xin] Fix Decimal.
ba3bfa2 [Reynold Xin] Updated javadoc for RowFactory.
c4ae1c5 [Reynold Xin] [SPARK-5193][SQL] Remove Spark SQL Java-specific API.
2015-01-16 21:09:06 -08:00
Yuhao Yang 76389c5b99 [SPARK-5234][ml]examples for ml don't have sparkContext.stop
JIRA issue: https://issues.apache.org/jira/browse/SPARK-5234

simply add the call.

Author: Yuhao Yang <yuhao@yuhaodevbox.sh.intel.com>

Closes #4044 from hhbyyh/addscStop and squashes the following commits:

c1f75ac [Yuhao Yang] add SparkContext.stop to 3 ml examples
2015-01-14 11:53:43 -08:00
Sean Owen aff49a3ee1 SPARK-5172 [BUILD] spark-examples-***.jar shades a wrong Hadoop distribution
In addition to the `hadoop-2.x` profiles in the parent POM, there is actually another set of profiles in `examples` that has to be activated differently to get the right Hadoop 1 vs 2 flavor of HBase. This wasn't actually used in making Hadoop 2 distributions, hence the problem.

To reduce complexity, I suggest merging them with the parent POM profiles, which is possible now.

You'll see this changes appears to update the HBase version, but actually, the default 0.94 version was not being used. HBase is only used in examples, and the examples POM always chose one profile or the other that updated the version to 0.98.x anyway.

Author: Sean Owen <sowen@cloudera.com>

Closes #3992 from srowen/SPARK-5172 and squashes the following commits:

17830d9 [Sean Owen] Control hbase hadoop1/2 flavor in the parent POM with existing hadoop-2.x profiles
2015-01-12 12:15:34 -08:00
huangzhaowei f38ef6586c [SPARK-4033][Examples]Input of the SparkPi too big causes the emption exception
If input of the SparkPi args is larger than the 25000, the integer 'n' inside the code will be overflow, and may be a negative number.
And it causes  the (0 until n) Seq as an empty seq, then doing the action 'reduce'  will throw the UnsupportedOperationException("empty collection").

The max size of the input of sc.parallelize is Int.MaxValue - 1, not the Int.MaxValue.

Author: huangzhaowei <carlmartinmax@gmail.com>

Closes #2874 from SaintBacchus/SparkPi and squashes the following commits:

62d7cd7 [huangzhaowei] Add a commit to explain the modify
4cdc388 [huangzhaowei] Update SparkPi.scala
9a2fb7b [huangzhaowei] Input of the SparkPi is too big
2015-01-11 16:32:47 -08:00
Marcelo Vanzin 48cecf673c [SPARK-4048] Enhance and extend hadoop-provided profile.
This change does a few things to make the hadoop-provided profile more useful:

- Create new profiles for other libraries / services that might be provided by the infrastructure
- Simplify and fix the poms so that the profiles are only activated while building assemblies.
- Fix tests so that they're able to run when the profiles are activated
- Add a new env variable to be used by distributions that use these profiles to provide the runtime
  classpath for Spark jobs and daemons.

Author: Marcelo Vanzin <vanzin@cloudera.com>

Closes #2982 from vanzin/SPARK-4048 and squashes the following commits:

82eb688 [Marcelo Vanzin] Add a comment.
eb228c0 [Marcelo Vanzin] Fix borked merge.
4e38f4e [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
9ef79a3 [Marcelo Vanzin] Alternative way to propagate test classpath to child processes.
371ebee [Marcelo Vanzin] Review feedback.
52f366d [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
83099fc [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
7377e7b [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
322f882 [Marcelo Vanzin] Fix merge fail.
f24e9e7 [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
8b00b6a [Marcelo Vanzin] Merge branch 'master' into SPARK-4048
9640503 [Marcelo Vanzin] Cleanup child process log message.
115fde5 [Marcelo Vanzin] Simplify a comment (and make it consistent with another pom).
e3ab2da [Marcelo Vanzin] Fix hive-thriftserver profile.
7820d58 [Marcelo Vanzin] Fix CliSuite with provided profiles.
1be73d4 [Marcelo Vanzin] Restore flume-provided profile.
d1399ed [Marcelo Vanzin] Restore jetty dependency.
82a54b9 [Marcelo Vanzin] Remove unused profile.
5c54a25 [Marcelo Vanzin] Fix HiveThriftServer2Suite with *-provided profiles.
1fc4d0b [Marcelo Vanzin] Update dependencies for hive-thriftserver.
f7b3bbe [Marcelo Vanzin] Add snappy to hadoop-provided list.
9e4e001 [Marcelo Vanzin] Remove duplicate hive profile.
d928d62 [Marcelo Vanzin] Redirect child stderr to parent's log.
4d67469 [Marcelo Vanzin] Propagate SPARK_DIST_CLASSPATH on Yarn.
417d90e [Marcelo Vanzin] Introduce "SPARK_DIST_CLASSPATH".
2f95f0d [Marcelo Vanzin] Propagate classpath to child processes during testing.
1adf91c [Marcelo Vanzin] Re-enable maven-install-plugin for a few projects.
284dda6 [Marcelo Vanzin] Rework the "hadoop-provided" profile, add new ones.
2015-01-08 17:15:13 -08:00
Sean Owen 4cba6eb420 SPARK-4159 [CORE] Maven build doesn't run JUnit test suites
This PR:

- Reenables `surefire`, and copies config from `scalatest` (which is itself an old fork of `surefire`, so similar)
- Tells `surefire` to test only Java tests
- Enables `surefire` and `scalatest` for all children, and in turn eliminates some duplication.

For me this causes the Scala and Java tests to be run once each, it seems, as desired. It doesn't affect the SBT build but works for Maven. I still need to verify that all of the Scala tests and Java tests are being run.

Author: Sean Owen <sowen@cloudera.com>

Closes #3651 from srowen/SPARK-4159 and squashes the following commits:

2e8a0af [Sean Owen] Remove specialized SPARK_HOME setting for REPL, YARN tests as it appears to be obsolete
12e4558 [Sean Owen] Append to unit-test.log instead of overwriting, so that both surefire and scalatest output is preserved. Also standardize/correct comments a bit.
e6f8601 [Sean Owen] Reenable Java tests by reenabling surefire with config cloned from scalatest; centralize test config in the parent
2015-01-06 12:02:08 -08:00
Josh Rosen 352ed6bbe3 [SPARK-1010] Clean up uses of System.setProperty in unit tests
Several of our tests call System.setProperty (or test code which implicitly sets system properties) and don't always reset/clear the modified properties, which can create ordering dependencies between tests and cause hard-to-diagnose failures.

This patch removes most uses of System.setProperty from our tests, since in most cases we can use SparkConf to set these configurations (there are a few exceptions, including the tests of SparkConf itself).

For the cases where we continue to use System.setProperty, this patch introduces a `ResetSystemProperties` ScalaTest mixin class which snapshots the system properties before individual tests and to automatically restores them on test completion / failure.  See the block comment at the top of the ResetSystemProperties class for more details.

Author: Josh Rosen <joshrosen@databricks.com>

Closes #3739 from JoshRosen/cleanup-system-properties-in-tests and squashes the following commits:

0236d66 [Josh Rosen] Replace setProperty uses in two example programs / tools
3888fe3 [Josh Rosen] Remove setProperty use in LocalJavaStreamingContext
4f4031d [Josh Rosen] Add note on why SparkSubmitSuite needs ResetSystemProperties
4742a5b [Josh Rosen] Clarify ResetSystemProperties trait inheritance ordering.
0eaf0b6 [Josh Rosen] Remove setProperty call in TaskResultGetterSuite.
7a3d224 [Josh Rosen] Fix trait ordering
3fdb554 [Josh Rosen] Remove setProperty call in TaskSchedulerImplSuite
bee20df [Josh Rosen] Remove setProperty calls in SparkContextSchedulerCreationSuite
655587c [Josh Rosen] Remove setProperty calls in JobCancellationSuite
3f2f955 [Josh Rosen] Remove System.setProperty calls in DistributedSuite
cfe9cce [Josh Rosen] Remove use of system properties in SparkContextSuite
8783ab0 [Josh Rosen] Remove TestUtils.setSystemProperty, since it is subsumed by the ResetSystemProperties trait.
633a84a [Josh Rosen] Remove use of system properties in FileServerSuite
25bfce2 [Josh Rosen] Use ResetSystemProperties in UtilsSuite
1d1aa5a [Josh Rosen] Use ResetSystemProperties in SizeEstimatorSuite
dd9492b [Josh Rosen] Use ResetSystemProperties in AkkaUtilsSuite
b0daff2 [Josh Rosen] Use ResetSystemProperties in BlockManagerSuite
e9ded62 [Josh Rosen] Use ResetSystemProperties in TaskSchedulerImplSuite
5b3cb54 [Josh Rosen] Use ResetSystemProperties in SparkListenerSuite
0995c4b [Josh Rosen] Use ResetSystemProperties in SparkContextSchedulerCreationSuite
c83ded8 [Josh Rosen] Use ResetSystemProperties in SparkConfSuite
51aa870 [Josh Rosen] Use withSystemProperty in ShuffleSuite
60a63a1 [Josh Rosen] Use ResetSystemProperties in JobCancellationSuite
14a92e4 [Josh Rosen] Use withSystemProperty in FileServerSuite
628f46c [Josh Rosen] Use ResetSystemProperties in DistributedSuite
9e3e0dd [Josh Rosen] Add ResetSystemProperties test fixture mixin; use it in SparkSubmitSuite.
4dcea38 [Josh Rosen] Move withSystemProperty to TestUtils class.
2014-12-30 18:12:20 -08:00
Travis Galoppo 6cf6fdf3ff SPARK-4156 [MLLIB] EM algorithm for GMMs
Implementation of Expectation-Maximization for Gaussian Mixture Models.

This is my maiden contribution to Apache Spark, so I apologize now if I have done anything incorrectly; having said that, this work is my own, and I offer it to the project under the project's open source license.

Author: Travis Galoppo <tjg2107@columbia.edu>
Author: Travis Galoppo <travis@localhost.localdomain>
Author: tgaloppo <tjg2107@columbia.edu>
Author: FlytxtRnD <meethu.mathew@flytxt.com>

Closes #3022 from tgaloppo/master and squashes the following commits:

aaa8f25 [Travis Galoppo] MLUtils: changed privacy of EPSILON from [util] to [mllib]
709e4bf [Travis Galoppo] fixed usage line to include optional maxIterations parameter
acf1fba [Travis Galoppo] Fixed parameter comment in GaussianMixtureModel Made maximum iterations an optional parameter to DenseGmmEM
9b2fc2a [Travis Galoppo] Style improvements Changed ExpectationSum to a private class
b97fe00 [Travis Galoppo] Minor fixes and tweaks.
1de73f3 [Travis Galoppo] Removed redundant array from array creation
578c2d1 [Travis Galoppo] Removed unused import
227ad66 [Travis Galoppo] Moved prediction methods into model class.
308c8ad [Travis Galoppo] Numerous changes to improve code
cff73e0 [Travis Galoppo] Replaced accumulators with RDD.aggregate
20ebca1 [Travis Galoppo] Removed unusued code
42b2142 [Travis Galoppo] Added functionality to allow setting of GMM starting point. Added two cluster test to testing suite.
8b633f3 [Travis Galoppo] Style issue
9be2534 [Travis Galoppo] Style issue
d695034 [Travis Galoppo] Fixed style issues
c3b8ce0 [Travis Galoppo] Merge branch 'master' of https://github.com/tgaloppo/spark   Adds predict() method
2df336b [Travis Galoppo] Fixed style issue
b99ecc4 [tgaloppo] Merge pull request #1 from FlytxtRnD/predictBranch
f407b4c [FlytxtRnD] Added predict() to return the cluster labels and membership values
97044cf [Travis Galoppo] Fixed style issues
dc9c742 [Travis Galoppo] Moved MultivariateGaussian utility class
e7d413b [Travis Galoppo] Moved multivariate Gaussian utility class to mllib/stat/impl Improved comments
9770261 [Travis Galoppo] Corrected a variety of style and naming issues.
8aaa17d [Travis Galoppo] Added additional train() method to companion object for cluster count and tolerance parameters.
676e523 [Travis Galoppo] Fixed to no longer ignore delta value provided on command line
e6ea805 [Travis Galoppo] Merged with master branch; update test suite with latest context changes. Improved cluster initialization strategy.
86fb382 [Travis Galoppo] Merge remote-tracking branch 'upstream/master'
719d8cc [Travis Galoppo] Added scala test suite with basic test
c1a8e16 [Travis Galoppo] Made GaussianMixtureModel class serializable Modified sum function for better performance
5c96c57 [Travis Galoppo] Merge remote-tracking branch 'upstream/master'
c15405c [Travis Galoppo] SPARK-4156
2014-12-29 15:29:15 -08:00
zsxwing f9ed2b6641 [SPARK-4608][Streaming] Reorganize StreamingContext implicit to improve API convenience
There is only one implicit function `toPairDStreamFunctions` in `StreamingContext`. This PR did similar reorganization like [SPARK-4397](https://issues.apache.org/jira/browse/SPARK-4397).

Compiled the following codes with Spark Streaming 1.1.0 and ran it with this PR. Everything is fine.
```Scala
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.StreamingContext._

object StreamingApp {

  def main(args: Array[String]) {
    val conf = new SparkConf().setMaster("local[2]").setAppName("FileWordCount")
    val ssc = new StreamingContext(conf, Seconds(10))
    val lines = ssc.textFileStream("/some/path")
    val words = lines.flatMap(_.split(" "))
    val pairs = words.map(word => (word, 1))
    val wordCounts = pairs.reduceByKey(_ + _)
    wordCounts.print()

    ssc.start()
    ssc.awaitTermination()
  }
}
```

Author: zsxwing <zsxwing@gmail.com>

Closes #3464 from zsxwing/SPARK-4608 and squashes the following commits:

aa6d44a [zsxwing] Fix a copy-paste error
f74c190 [zsxwing] Merge branch 'master' into SPARK-4608
e6f9cc9 [zsxwing] Update the docs
27833bb [zsxwing] Remove `import StreamingContext._`
c15162c [zsxwing] Reorganize StreamingContext implicit to improve API convenience
2014-12-25 19:46:05 -08:00
Takeshi Yamamuro 9c251c555f [SPARK-4932] Add help comments in Analytics
Trivial modifications for usability.

Author: Takeshi Yamamuro <linguin.m.s@gmail.com>

Closes #3775 from maropu/AddHelpCommentInAnalytics and squashes the following commits:

fbea8f5 [Takeshi Yamamuro] Add help comments in Analytics
2014-12-23 12:39:41 -08:00
carlmartin 1d9788e42e [Minor] Improve some code in BroadcastTest for short
Using
    val arr1 = (0 until num).toArray
instead of
    val arr1 = new Array[Int](num)
    for (i <- 0 until arr1.length) {
      arr1(i) = i
    }
for short.

Author: carlmartin <carlmartinmax@gmail.com>

Closes #3750 from SaintBacchus/BroadcastTest and squashes the following commits:

43adb70 [carlmartin] Improve some code in BroadcastTest for short
2014-12-22 12:13:53 -08:00
Ernest a7ed6f3cc5 [SPARK-4880] remove spark.locality.wait in Analytics
spark.locality.wait set to 100000 in examples/graphx/Analytics.scala.
Should be left to the user.

Author: Ernest <earneyzxl@gmail.com>

Closes #3730 from Earne/SPARK-4880 and squashes the following commits:

d79ed04 [Ernest] remove spark.locality.wait in Analytics
2014-12-18 15:42:26 -08:00
Kostas Sakellis d6a972b3e4 [SPARK-4774] [SQL] Makes HiveFromSpark more portable
HiveFromSpark read the kv1.txt file from SPARK_HOME/examples/src/main/resources/kv1.txt which assumed
you had a source tree checked out. Now we copy the kv1.txt file to a temporary file and delete it when
the jvm shuts down. This allows us to run this example outside of a spark source tree.

Author: Kostas Sakellis <kostas@cloudera.com>

Closes #3628 from ksakellis/kostas-spark-4774 and squashes the following commits:

6770f83 [Kostas Sakellis] [SPARK-4774] [SQL] Makes HiveFromSpark more portable
2014-12-08 15:44:18 -08:00
Xiangrui Meng 7e758d7092 [FIX][DOC] Fix broken links in ml-guide.md
and some minor changes in ScalaDoc.

Author: Xiangrui Meng <meng@databricks.com>

Closes #3601 from mengxr/SPARK-4575-fix and squashes the following commits:

c559768 [Xiangrui Meng] minor code update
ce94da8 [Xiangrui Meng] Java Bean -> JavaBean
0b5c182 [Xiangrui Meng] fix links in ml-guide
2014-12-04 20:16:35 +08:00
Joseph K. Bradley 469a6e5f3b [SPARK-4575] [mllib] [docs] spark.ml pipelines doc + bug fixes
Documentation:
* Added ml-guide.md, linked from mllib-guide.md
* Updated mllib-guide.md with small section pointing to ml-guide.md

Examples:
* CrossValidatorExample
* SimpleParamsExample
* (I copied these + the SimpleTextClassificationPipeline example into the ml-guide.md)

Bug fixes:
* PipelineModel: did not use ParamMaps correctly
* UnaryTransformer: issues with TypeTag serialization (Thanks to mengxr for that fix!)

CC: mengxr shivaram  etrain  Documentation for Pipelines: I know the docs are not complete, but the goal is to have enough to let interested people get started using spark.ml and to add more docs once the package is more established/complete.

Author: Joseph K. Bradley <joseph@databricks.com>
Author: jkbradley <joseph.kurata.bradley@gmail.com>
Author: Xiangrui Meng <meng@databricks.com>

Closes #3588 from jkbradley/ml-package-docs and squashes the following commits:

d393b5c [Joseph K. Bradley] fixed bug in Pipeline (typo from last commit).  updated examples for CV and Params for spark.ml
c38469c [Joseph K. Bradley] Updated ml-guide with CV examples
99f88c2 [Joseph K. Bradley] Fixed bug in PipelineModel.transform* with usage of params.  Updated CrossValidatorExample to use more training examples so it is less likely to get a 0-size fold.
ea34dc6 [jkbradley] Merge pull request #4 from mengxr/ml-package-docs
3b83ec0 [Xiangrui Meng] replace TypeTag with explicit datatype
41ad9b1 [Joseph K. Bradley] Added examples for spark.ml: SimpleParamsExample + Java version, CrossValidatorExample + Java version.  CrossValidatorExample not working yet.  Added programming guide for spark.ml, but need to add CrossValidatorExample to it once CrossValidatorExample works.
2014-12-04 17:00:06 +08:00
Joseph K. Bradley 657a88835d [SPARK-4580] [SPARK-4610] [mllib] [docs] Documentation for tree ensembles + DecisionTree API fix
Major changes:
* Added programming guide sections for tree ensembles
* Added examples for tree ensembles
* Updated DecisionTree programming guide with more info on parameters
* **API change**: Standardized the tree parameter for the number of classes (for classification)

Minor changes:
* Updated decision tree documentation
* Updated existing tree and tree ensemble examples
 * Use train/test split, and compute test error instead of training error.
 * Fixed decision_tree_runner.py to actually use the number of classes it computes from data. (small bug fix)

Note: I know this is a lot of lines, but most is covered by:
* Programming guide sections for gradient boosting and random forests.  (The changes are probably best viewed by generating the docs locally.)
* New examples (which were copied from the programming guide)
* The "numClasses" renaming

I have run all examples and relevant unit tests.

CC: mengxr manishamde codedeft

Author: Joseph K. Bradley <joseph@databricks.com>
Author: Joseph K. Bradley <joseph.kurata.bradley@gmail.com>

Closes #3461 from jkbradley/ensemble-docs and squashes the following commits:

70a75f3 [Joseph K. Bradley] updated forest vs boosting comparison
d1de753 [Joseph K. Bradley] Added note about toString and toDebugString for DecisionTree to migration guide
8e87f8f [Joseph K. Bradley] Combined GBT and RandomForest guides into one ensembles guide
6fab846 [Joseph K. Bradley] small fixes based on review
b9f8576 [Joseph K. Bradley] updated decision tree doc
375204c [Joseph K. Bradley] fixed python style
2b60b6e [Joseph K. Bradley] merged Java RandomForest examples into 1 file.  added header.  Fixed small bug in same example in the programming guide.
706d332 [Joseph K. Bradley] updated python DT runner to print full model if it is small
c76c823 [Joseph K. Bradley] added migration guide for mllib
abe5ed7 [Joseph K. Bradley] added examples for random forest in Java and Python to examples folder
07fc11d [Joseph K. Bradley] Renamed numClassesForClassification to numClasses everywhere in trees and ensembles. This is a breaking API change, but it was necessary to correct an API inconsistency in Spark 1.1 (where Python DecisionTree used numClasses but Scala used numClassesForClassification).
cdfdfbc [Joseph K. Bradley] added examples for GBT
6372a2b [Joseph K. Bradley] updated decision tree examples to use random split.  tested all of them.
ad3e695 [Joseph K. Bradley] added gbt and random forest to programming guide.  still need to update their examples
2014-12-04 09:57:50 +08:00
Joseph K. Bradley 4ac2151154 [SPARK-4710] [mllib] Eliminate MLlib compilation warnings
Renamed StreamingKMeans to StreamingKMeansExample to avoid warning about name conflict with StreamingKMeans class.

Added import to DecisionTreeRunner to eliminate warning.

CC: mengxr

Author: Joseph K. Bradley <joseph@databricks.com>

Closes #3568 from jkbradley/ml-compilation-warnings and squashes the following commits:

64d6bc4 [Joseph K. Bradley] Updated DecisionTreeRunner.scala and StreamingKMeans.scala to eliminate compilation warnings, including renaming StreamingKMeans to StreamingKMeansExample.
2014-12-03 18:50:03 +08:00
wangfei 7b79957879 [SQL] Minor fix for doc and comment
Author: wangfei <wangfei1@huawei.com>

Closes #3533 from scwf/sql-doc1 and squashes the following commits:

962910b [wangfei] doc and comment fix
2014-12-01 14:02:02 -08:00
Sean Owen 5d7fe178b3 SPARK-4170 [CORE] Closure problems when running Scala app that "extends App"
Warn against subclassing scala.App, and remove one instance of this in examples

Author: Sean Owen <sowen@cloudera.com>

Closes #3497 from srowen/SPARK-4170 and squashes the following commits:

4a6131f [Sean Owen] Restore multiline string formatting
a8ca895 [Sean Owen] Warn against subclassing scala.App, and remove one instance of this in examples
2014-11-27 09:03:17 -08:00
q00251598 a51118a34a [SPARK-4535][Streaming] Fix the error in comments
change `NetworkInputDStream` to `ReceiverInputDStream`
change `ReceiverInputTracker` to `ReceiverTracker`

Author: q00251598 <qiyadong@huawei.com>

Closes #3400 from watermen/fix-comments and squashes the following commits:

75d795c [q00251598] change 'NetworkInputDStream' to 'ReceiverInputDStream' && change 'ReceiverInputTracker' to 'ReceiverTracker'
2014-11-25 04:01:56 -08:00
scwf b384119304 [SQL] Fix path in HiveFromSpark
It require us to run ```HiveFromSpark``` in specified dir because ```HiveFromSpark``` use relative path, this leads to ```run-example``` error(http://apache-spark-developers-list.1001551.n3.nabble.com/src-main-resources-kv1-txt-not-found-in-example-of-HiveFromSpark-td9100.html).

Author: scwf <wangfei1@huawei.com>

Closes #3415 from scwf/HiveFromSpark and squashes the following commits:

ed3d6c9 [scwf] revert no need change
b00e20c [scwf] fix path usring spark_home
dbd321b [scwf] fix path in hivefromspark
2014-11-24 12:49:08 -08:00
Xiangrui Meng 15cacc8124 [SPARK-4486][MLLIB] Improve GradientBoosting APIs and doc
There are some inconsistencies in the gradient boosting APIs. The target is a general boosting meta-algorithm, but the implementation is attached to trees. This was partially due to the delay of SPARK-1856. But for the 1.2 release, we should make the APIs consistent.

1. WeightedEnsembleModel -> private[tree] TreeEnsembleModel and renamed members accordingly.
1. GradientBoosting -> GradientBoostedTrees
1. Add RandomForestModel and GradientBoostedTreesModel and hide CombiningStrategy
1. Slightly refactored TreeEnsembleModel (Vote takes weights into consideration.)
1. Remove `trainClassifier` and `trainRegressor` from `GradientBoostedTrees` because they are the same as `train`
1. Rename class `train` method to `run` because it hides the static methods with the same name in Java. Deprecated `DecisionTree.train` class method.
1. Simplify BoostingStrategy and make sure the input strategy is not modified. Users should put algo and numClasses in treeStrategy. We create ensembleStrategy inside boosting.
1. Fix a bug in GradientBoostedTreesSuite with AbsoluteError
1. doc updates

manishamde jkbradley

Author: Xiangrui Meng <meng@databricks.com>

Closes #3374 from mengxr/SPARK-4486 and squashes the following commits:

7097251 [Xiangrui Meng] address joseph's comments
98dea09 [Xiangrui Meng] address manish's comments
4aae3b7 [Xiangrui Meng] add RandomForestModel and GradientBoostedTreesModel, hide CombiningStrategy
ea4c467 [Xiangrui Meng] fix unit tests
751da4e [Xiangrui Meng] rename class method train -> run
19030a5 [Xiangrui Meng] update boosting public APIs
2014-11-20 00:48:59 -08:00